RTFM & Keep Your Server Sync with NTP

This was a rookie mistake.

At Panda Network, we generate internal reports every week by combining data residing in different sources, one of which happens to be Google Analytics (GA). Since these reports contain hand-picked content, the process is not completely automated. A big button on our bootstrap-y looking admin panel to start preparing the report, a couple of API calls to gather the data, some more clicks to add more information, and you are done. Manually triggered. Every week. But it only takes maximum of 10 minutes. Hence, not a big deal.

A few months ago, it was that time of the week. I hit the big button and saw some errors on the developer console. Some of the HTTP requests made in the process, had failed. As any horrified developer would do, I hit the button again. Swish. This time, all requests were completed successfully. With the joy of completing very first task of the week, I went through my Monday routine by turning up the volume for Spotify's always impressive Discover Weekly with a fresh cup of coffee!

Next week, same thing happened. This time, I wasn't lucky enough to generate the report at the second shot. It took a few more tries before working properly. There evidently was a problem. Upon digging into logs and spending some time debugging, I realized that ~4 out of every 10 calls to Google's Core Reporting API were failing due to an authentication error. We use service tokens to authenticate our requests in server-to-server interactions with Google APIs. I checked parameters of the JWT client and public/private key pair to see if anything is wrong. But everything seemed fine, at least on my local development environment.

While googling about that invalid_grant error response, I recalled that recently, we had been getting a warning on the GA dashboard about data collection limits.

Your data volume exceeds the limit of 10M hits per month as outlined in our Terms of Service. If you continue to exceed the limit, you may lose access to future data.

A property was sending more hits per month to GA than allowed volume, which, according to the warning, may prevent us from accessing reports. So, I instinctively, yet incorrectly connected the dots: having reached the monthly data limit, Google was rejecting to respond to some of our requests.

Months passed. Although we have reduced the data volume sent to GA, getting errors has become my new routine on Mondays. Apparently, I was wrong when I thought that authentication errors are related with rate limits. It had to be something else. However since I could still create reports on a couple of tries, it is not something public, and we already have a lot on our roadmap; I haven't spared any time to wipe this issue out. Until three weeks ago.

No matter how many times I tried, I was not able to generate the report without errors. Desperately, I started reading Analytics Core Reporting API docs. I double checked authorization system on our end, which looks completely inline with the documentation and works fine. At the bottom of the page, I came on to this:

If you receive an invalid_grant error response when attempting to use a refresh token, the cause of the error may be due to the following reasons:

  1. Your server's clock is not in sync with NTP.
  2. The refresh token limit has been exceeded.

Knowing that it wasn't due to refresh token limit, I quickly checked server clock on our production instances.

node01:~$ date  
Tue Dec 22 17:09:28 UTC 2015

node02:~$ date  
Tue Dec 22 17:07:14 UTC 2015  

They were supposed to be 17:02 UTC. Both of them were out of sync with NTP. Here is what was happening for a long while:

  • We have two instances serving our API, and a load balancer in front.
  • NTP service on both of our servers is configured to sync time only at boot time.
  • Well, we rarely reboot our servers.
  • A few months ago, first instance has gotten out of sync with NTP. Load balancer were directing ~4/10 of the requests to this instance and Google API were rejecting them. I had thought that they were dropping some of our requests since we were over data limits.
  • After months, second instance has also gotten out of sync with NTP. Now, all requests were being rejected by Google API, which caused me to investigate the issue.

Upon fixing the server clock, report generation was flawless. Luckily, there is an easy way to prevent facing the same problem in the future. I have set up a daily cron job to sync server with NTP regularly. If you want to do the same on Ubuntu, place this little script1 under /etc/cron.daily:

/usr/sbin/ntpd -s ntp.ubuntu.com

Had I read the f-ing manual (and seen Google API does not work with non-sync servers), I would have spotted the issue quickly. Since we have multiple instances, I couldn't find the error until each one of them is async with NTP. So, better read the docs first. Then, keep your server's clock correct.

  1. Check out ntpd manual.