Cursed at 497 days!

One morning back November 2015, we were having connectivity issues with our Asia branch offices which took a good day to resolve.  The offices were using Fortigate 100D firewalls and connect to our Sydney core firewall running on Fortigate VM01 virtualised appliance.

The main symptoms were sluggishness of remote access (vice versa) and application timeouts. Quick check of the firewalls revealed the IPSEC tunnels keep flapping with the system messages below  in the core firewall (similar messages were showing on the branch firewalls).

 Link monitor: Interface SYD-HK-peer1 was turned down
 Link monitor: Interface SYD-SG-peer1 was turned down
 Link monitor: Interface SYD-HK-peer1 was turned up
 Link monitor: Interface SYD-SG-peer1 was turned up

Fortigate-Logs-Up-Down-VPN-flapping

Whilst I personally vouch for Fortigate firewalls’ rock solid stability through the years, we have had some notable and strange issues over the years. A particular one was when emails started bouncing back one morning and the cause was eventually traced to Fortigate’s antivirus update (it’s automatically set to update every hour) rejecting every email coming its way. In this VPN flapping issue case, AV filtering and a few other controls/features (eg IPS) were momentarily turned off but the flapping problem persists.

We have checked time zone settings, NTP server availability, checked IPSEC VPN settings (which were never changed recently), ran vpn debugs, etc but still no joy. Anyone looking after IT systems would know that when you encounter weird technical issues like this -which affects the business and phones are running hot – then part of you wants to suddenly be a yoga instructor or do some job with nothing to do with technology!

Guess what fixed it later that night…a reboot of the core firewall!

We had it running for 497 days non-stop so it probably needed a little break 🙂
Further research revealed this behaviour also affects systems/devices from Windows, Linux, F5 and Cisco!

Why 497 days?
There are plenty of documents online saying this has to do with a limitation during coding where the number of 10 ms ticks since boot can’t fit into a 32 bit unsigned integer beyond, guess what, 497 days!

So when in doubt, the age old recommendation of rebooting the device won’t hurt. Who would have thought eh?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s