Graph API call counts with Google BigQuery and DataDog

ss

At work we generate 10’s of thousands of API calls per minute.

I wanted a graph with detailed statistics of the API calls in real time. There are services that provide this for you: SumoLogic is one, and I’m sure DataDog will in the near future, but using them would be cost prohibitive. We generate 30-50 gigs of access logs per day. We already use SumoLogic for debugging, but that amount of data would drastically exceed our quota.

So I took a day to see if I could solve this problem some other way. We run on AWS, so I was able to use lambda’s and the built-in capabilities of the elastic load balancers.  

What I built:

  • 1 Golang lambda that gets notified whenever a new log file becomes available. It takes the raw log file and generates a CSV with the timestamp, the path, and the query string of each log entry. The group of CSV’s then gets uploaded to Google BigQuery.
  • 1 Golang lambda which runs every 5 minutes and runs 5-6 SQL queries on BigQuery, and then uploads the results as custom metrics to DataDog.
  • I then graph the custom metrics in a time series chart on DataDog. DataDog handles storage, so I’ll be able to look months in the past after the data has been uploaded.

Total Cost: $50-$100 a month.

So far the maintenance cost of the lambdas has been minimal, although I’m not sure that will continue. I elected to use BigQuery and lambda’s so that I wouldn’t have to maintain any infrastructure.

Specifics

We use a standard load balancer on AWS. First, I went into the settings and turned on the access logs. I had the system upload the log files to S3 every 5 minutes.

Then I created my Go lambda, and I added a trigger to run the lambda whenever new files were added to the S3 bucket.

Adding the BigQuery libraries and credentials to the lambda were trivial because Go runs as a fat binary, which includes all of its dependencies. It was just: build, upload, done. Ironically, I wrote a Python script to make the build + upload cycle faster.

Next, I looked at the DataDog custom metrics API, and I added a custom metric for each API call I wanted to track. I constructed SQL queries to run against BigQuery every 5 minutes. The queries gave me a tally of the API call I wanted for each of the last 10 minutes. I won’t go into the specifics of the queries, but in a nutshell it was a date_trunc() run over the last N minutes.

The lambda runs on a cron job that executes every 5 minutes. I ensured the lambda would execute on a smaller interval than I covered in the SQL query, ensuring  plenty of overlap in the intervals between lambda executions. This way, if the data for any minute were incomplete the first time the job executed, it would be overwritten on the next lambda execution.

Finally, I added SQL to delete data older than a couple of hours in BigQuery, since I only needed the most recent 10 minutes.


Problems

Cost

BigQuery charges you based on the amount of data your SQL queries have to process. When I let the data stack up for more than a day, my table size was 10-20 gigs. Even though I was only executing against the last 10 minutes, I was still processing more than a gig of data with each query. Executing 6 of them, every 5 minutes, 24 hours a day would add up quickly. I believe it would have cost more than $1000 a month, just to construct 1 or 2 graphs on DataDog

So instead, I added the deletion query, and I only kept an hour of data at a time, and the costs dropped by a factor of 20.

CSV Formatting

Whenever I generate CSV’s, I universally encounter formatting problems. My code will encounter an escape code that I forgot about, and the CSV format will blow up, and the whole log file will be rejected.

Fortunately, this is usually as simple as surrounding the appropriate field with quotes, and removing any quotes in the field itself, but it’s still a hassle given that the problem is different every time.

Upload Quota

BigQuery has an upload quota that prevents you from uploading too much to a single table within a given amount of time. Unfortunately, when you’re using a 5 minute log reporting interval with the ELB access logs, the system generates logs more quickly than the quota allows.

I still haven’t found a good solution to this problem. The end result is that not all of our log files get into the BigQuery table, and so we see dips in our graphs at certain times of day (you can see this in the screenshot above), usually when there is heavier traffic on the site. The only way to fix this would be to increase the batch size of our uploads, but then we’d have to store the intermediate logs somewhere, and the whole thing would get much more complicated.


Do you have any experience creating something similar to what I did above?

Leave me a comment below.