– OVERVIEW / THOUGHT PROCESS –
For months now, I’ve been dealing with customers calling about their email services being no longer available. By the time I troubleshoot the problem, their email has come back on line which created a ‘could not duplicate’ finality to the problem. As such, I got extremely curious and therefore set up some email accounts for myself which I would only access from my home. After the accounts were created, I simply waited until the problem showed its ugly face so that I could duplicate the problem and slay the beast.
It didn’t take long (around 30 minutes) until I started seeing the anomalies. My first reaction was to simply fix the problem and make my email accounts work once again. Instead, I decided to wit it out for a little while to see if I could be the customer that says “it’s up again, don’t worry about it!”
After some analysis and troubleshooting, I figured it out… and here’s the story for the fix
– ANALYSIS –
My primary means of troubleshooting is to use the 50/50 method. This means that if I can eliminate 50% of the possibilities, I can spend more time troubleshooting things that will get me closer to the actual problem. Once I’m down to 50%, I split that in half again and narrow it down further. The end result is usually a true analysis of the actual problem… as opposed to going down rabbit holes and wasting time.
So how would I start my 50/50 approach to this problem? Well, it was quite simple. The first split would be between hardware and software.
HARDWARE: Linux box / 4GB RAM / 4 Processors / 2 Network Cards
SOFTWARE: CentOS 6.5 / mysql / email program installs (postfix / dovecot / amavisd / etc…)
To eliminate one of these, I simply took advantage of the time available when my email accounts were down. When they showed inoperable, I ssh’d into the machine, did a ‘top’ command, a few ‘pint’ commands, and restarted some services to determine that the hardware was just fine. With this, I moved to the second 50% which was the software.
With the software portion, I knew exactly what was installed on my machine. I had just performed a migration to a different box and with that, only installed certain programs which I controlled via the configuration files. So at this point, we had to narrow it down and eliminate some of the programs as the problem. Once I can figure out which program it was, or at least which email features were being affected, I could start my troubleshooting portion of this problem solve.
My software consisted of a standard CentOS 6.5 installation, the email programs described above, and an installation of mysql. Although the email was the subject of contention for my customers, I didn’t want to assume that the email programs were the main problem. Therefore, I also included the Apache web service in the mix (used for webmail). for my way ahead, my analysis into the root cause of the problem was going to be to check webpages, check my local email programs to see if the settings were wrong, and then finally to check server services that are installed with CentOS such as the firewall. In order to do the 50/50 approach with these, I would simply isolate each program one by one by turning them off and noting the affects.
As always, I like to work with the easy thing first. To me, the easiest item of the bunch was going to be the firewall. I don’t want to spoil the story, but I got pretty lucky on this one. I waited until my email was no longer available, and immediately did a ‘duo service iptables restart’ command. As soon as I did this and the tables were reset, my email returned without a hitch. With confidence, I waited for the email services to go down once again (took a little longer this time) and tried the same technique. As expected, doing the reset of the firewall brought back all email services.
– TROUBLESHOOTING –
Now that I’ve narrowed the problem down to the culprit being most likely the firewall, I started into my troubleshooting for that specific problem. It was quite obvious that the firewall was doing what I told it to do: let in only the traffic that I specify via the /etc/sysconfig/iptables configuration file. In this case, it was the standard email ports which consisted of (S)POP3, (S)IMAP, SSH, HTTP(S), and a few others for my pour poses. So now I wanted to establish my baseline for troubleshooting.
To get a baseline, meaning that I wanted to know exactly what the firewall doing in a known status, I restarted it once again and ran the ‘sudo iptables -L’ command. This lists the status of the firewall via a printout in the terminal. Mine was as follows:
————-IPTABLES BASELINE EXAMPLE———————-
————-END IPTABLES BASELINE—————————
From here, it was a bit of a waiting game once again so that I could compare the results of that same command after my email shuts down once again. Again, it didn’t take long and my email was showing inoperable. As soon as I saw this, I ran the ‘iptables -L’ command and here was the result:
————-END IPTABLES CONTROL—————————-
When you compare the baseline to the table after the email locked up, you can see a clear representation of the root cause for the problem:
————-IPTABLES CONTROL CROPPED————————
————-END IPTABLES CONTROL CROPPED——————-
There’s clearly an entry that was not created by the /etc/sysconfig/iptables file. This last entry was created because I gave one of those programs I mentioned earlier permission to write to my iptables when I did the install. That program is identified as ‘fail2ban’ which is given administrative privileges via the configuration file under /etc/fail2ban/fail2ban.conf.
Inside the fail2ban configuration file, it shows that an ip address that attempts a connection by is unsuccessful more than three times will be put in ‘jail’ for a period of 900 seconds. There are plenty more configurations that can be made via the fail2ban.conf file, but the one that is creating the problem is also the one that can solve the problem.
– SOLUTION –
In the end, there are a few ways to fix the problem. Either stop using fail2ban all together, create an exception for each IP address that is a known client, or simply give a greater tolerance for wrong log on attempts.
This snipet shows the portion of my fail2ban.conf file that both caused, and solved the problem.
————-END FAIL2BAN.CONF INSERT——————-
– FINAL THOUGHTS –
[CAUTION – TANGENT] I absolutely LOVE linux for its capabilities. On a Windows system, I don’t think that this type of troubleshooting would have been possible. I have been a Linux and Windows administrator for over 12 years and quite often, I compare the amount of time and effort it takes to troubleshoot both systems.
Years back, I wrote a paper outlining experiments that I performed which specifically base lined production time that was lost with each worker using Windows and Linux. In the Windows environment production time lost per worker was as high as 90 minutes per day. This was based on waiting for programs to respond, blue screens of death, restarts, and shutting down programs via force quit.
The Linux system was quick to respond via the ‘kill’ command, however the Windows system had a delay of as much as five minutes using the ctl-alt-del and force quit method. Mind you, this paper was written a while ago, however the trend remains the same with each iteration of Windows that is released. [END TANGENT]
With any Linux system, your log files are the key! You can troubleshoot just about any problem by looking at your log files to find out what’s going on. But that’s not always going to be the case. This post was the perfect example. I was able to figure out that something was happening with my firewall, but it was caused by a specific program called fail2ban. By thinking outside the box, and creating a baseline, I was able to compare that baseline to any condition that I created. When in doubt, find your baseline and use it to gather the information that makes you successful!
Always remember… WHAT IF AND WHY NOT?!?