Big Data with Golang Instead of MapReduce

33917254955_edf546bd53_z

This is one of those software engineering ideas that I would normally warn you about. So many people use MapReduce that it seems foolhardy to use something else.

But in this case, it turned out well. The project was a success, and we were able to accomplish our goals more quickly and with fewer resources than it would have taken with a MapReduce cluster.

Background

I work on 3d Warehouse for Trimble SketchUp (formerly Google).

One of our focuses over the last year has been analytics – both for our customers and for our own internal use. Most business intelligence providers are expensive – anywhere from 100-500K per year. Even with that price point, it’s still cheaper than engineering time, so that was originally the path we took.

Unfortunately, we ran into trouble with scalability. 3d Warehouse is a medium sized web platform, and we generate about 1 billion data points annually. It’s enough data where you’ll run into serious problems without any optimizations, and our third party analytics provider couldn’t handle it. Our page would timeout after 5 minutes because our queries were so slow.

So starting in December 2016, we began building our own stack using AWS Redshift.

Today, we’re very happy with Redshift, and we have 5 billion rows stored in our database. Query times are excellent, and we only spend 5-10K a year on machine costs. Maintenance has also turned out to be less than our integration with the third party.

I won’t go into all the details of how our pipeline works – here, I want to talk about generating historical data.

When we switched providers, we wanted our customers to see graphs going back several years, right at launch. So somehow we needed to generate the data from the past.

Fortunately, we use AWS Cloudfront, and we’ve been collecting Cloudfront logs for a number of years. Since the logs record every API access, they contain everything we need to generate historical events.

The problem is that we get between 40 and 60 compressed gigs of logs per day. Uncompressed, that expands to 400-600 gigs. It’s enough data where you need some form of parallelism.

Approaches

The standard approach to this problem would be Elastic MapReduce using either Hadoop or Spark.

We started that way. Previously, 2 people on my team had completed separate projects using Python Spark to process the logs.

The problem was each project took 3-4 weeks. Spark has a huge learning curve. Writing the code is easy enough, but then you add all the details of running on a production AWS cluster. Some of the debugging pains we had:

  • Random machines running out of memory
  • The job not doing anything because you’re using Spark and you didn’t specify an output – you can’t just “write a mapper” without reducing.
  • The data not distributed correctly between nodes on the cluster
  • Pain finding the running logs for debugging.
  • Pain understanding all the various resource management systems that go into running a hadoop cluster
  • Pain setting up the AWS security groups so you can SSH to the proper machine and see what’s going on.

I was also shocked to find that the Python implementation of Spark doesn’t support counters. I’m not sure how people run multi-day jobs without being able to see running values, error rates, etc.

At the beginning of this project, I tried to adopt what my coworkers had done, and I spent a week fighting all the issues above in both Python and Scala. All I could think about was how much easier writing MapReduces at Google had been.

Then I figured there had to be an easier way.

I had been messing around with Go off and on for various projects. I was impressed by its built-in multi-processing, and I loved the languages’ simplicity.

I looked around on AWS and realized you could get 32 and 64 core machines for around 50 cents an hour using Spot pricing.

So that’s what I did. Instead of running an entire MapReduce cluster, I decided to run one massive but cheap machine using Go for parallelism.

Design

The application was simple:

  1. List all the files in S3 for a particular date – Cloudfront has around 20K individual log files generated per day.
  2. Download each file.
  3. Process the lines in each file discarding all the static file accesses and other lines we didn’t care about.
  4. Lookup the creatorid of each model and collection referenced from our API.
  5. Lookup the GeoIP data for the IP address using a Maxmind database.
  6. Transform the output into a zipped CSV we could import into Redshift.
  7. Run on an AWS machine to give close network proximity to S3 and to get plenty of processors.

Parallelizing

I used a goroutine pool for each of the separate parts of the problem:

  1. File Listing
  2. File Downloading
  3. Line Processing
  4. Aggregating the CSV.

Essentially, the file lister would list all the S3 objects and pass them via channel to the file downloader. One of the file downloader routines would fetch the file, split the file lines into groups of 5,000 and pass them along to the line processors. These would process all the lines, do all required lookups, and pass the values along to the csv aggregator.

I kept the csv lines in slices and outputted them to separate CSV files in groups of 50,000 which I then uploaded to S3.

Result

Eventually I got it working, and I was able to process 1 day of data in 10-20 minutes, depending on if I was using a 32 core or a 64 core machine.

Debugging, for the most part was dramatically easier than Spark. If something went wrong, I only had to ssh to a single machine to look at the logs.

I didn’t use anything complex to run the job, a simple screen session was good enough.

Of course, using Go had its own set of issues.

Hard Won Lessons

CPU Utilization

Getting full utilization on a machine of any size is hard. For most of my processing I settled for CPU utilization of 60-75%. At the end of the project I finally had a couple of breakthroughs that allowed me to reach 95% utilization:

Counters

I used a simple web interface where I published counter values to help monitor the job. The code for each counter was straightforward:

Simple enough, but it turns out the mutex implementation of this is extremely slow. As you increase your goroutines, you’re going to face lots of lock contention.

Thanks to Bjorn Rabenstein in this Gophercon talk, I learned that using the atomic package is dramatically better.

This Go Example is a great way to implement a simple counter using that package.

This simple change allowed me to squeeze an extra 10% CPU utilization out of the system. Amazing how small things can improve your performance.

Work Size

A bigger issue was the amount of work I was passing between the channels.

The groups of 5000 lines I was passing between the file downloader and the line processor were just not big enough.

As an experiment, I tried eliminating the separate goroutines, and I just processed everything inside the file downloader. I then increased my goroutine pool size for the downloaders, given that many of them would be blocked on IO most of the time.

With that change and the counters change, my CPU utilization shot to 95%.

It turns out that sharing things via channels is expensive. If the work size isn’t large enough, you’ll spend a lot of time moving data around between processors while many of your processors sit idle.

Memory Leaks

The toughest issue I faced was a nasty memory leak. After an hour or so of running, my Go program would eventually consume all of the 200-300 GB’s of RAM on my AWS machine. I didn’t even think it was possible for a program to use that much RAM.

I narrowed it down to the following problem:

I was saving multiple string slices, with one entry per line in the CSV file. I would output these slices to a CSV file when they reached a length of 50,000.

Because I had different types of data, I had 3 of these slices hanging around at any given time, which would output to different types of CSV files.

2 of the slices recorded events that happened frequently. Because these events were so common, the slices would empty and get recycled multiple times per minute.

The third type of events were much more rare, so it took a long time to accumulate 50,000 events in the slice – sometimes 10 minutes or more with the program running full bore.

For some reason, that long-lived slice caused massive problems. I went through the code line by line looking for references that were sticking around when they shouldn’t have been, and I couldn’t find squat.

The slice itself was never big enough to cause the memory leak; each CSV file output was only 5 Megs. All the request objects associated with the output slice weren’t even big enough to cause that much memory consumption.

My only explanation is that somehow that slice was causing lots of discarded objects to stick around. I discarded 9 out of every 10 lines in each log file, but first, each had to be parsed into strings, a Request object, etc.

Unfortunately I never got to the bottom of the issue. I ran out of time, and I had to “solve” it by changing the slice size to 5000 from 50,000 for that event type. This increased the recycling frequency and solved the issue.

My solution was a cop out, but I’ve seen others having trouble with long lived slices as well. So note to the reader: if you’re having memory problems with your Go program, looking at your slices is one place to start.

In retrospect, I could have avoided slices all together and just written line by line to a file, uploading to S3 whenever I reached a certain number of lines. That way any objects associated with a CSV line could be discarded immediately.

But frankly, I don’t have to worry about memory that often in web programming, so the file-based approach didn’t even occur to me at the beginning of the project.

Update:

With the help of Howard Shaw’s comment below, I was able to figure this out. Here’s what was happening.

Reading in each log file I had code like this:

Later on, I would parse each line into a custom Request object, and part of the Request object would refer to the original string – e.g. lineArray[i].

Then I would drop all references to the lineArray slice, and I assumed that once the string slice went away, all the unused lines in the array could be garbage collected.

But here’s the problem: the string slice was casted from a []byte in the first step.

As Howard pointed out, the underlying byte slice can’t be garbage collected until there are no more references to it. So even if I only use 1/1000 lines in the string slice, all 1000 lines will stay in memory because the underlying []byte structure has to stick around.

The next question was why did this only manifest with my long lived output slice?

Let’s say you’re tracking two events from log files – views and downloads, and let’s say that views are 50X as common as downloads.

So in your average 1000 line log file, let’s assume there are 50 view events and only 1 download event.

Remember that I’m outputting csv files in batches of 50,000 events.

Because download events are more rare, I have to look at 50X as many log files before I can output my slice for downloads than I do for views.

With the problem above, I was inadvertently keeping the entirety of each log file in memory, as long as there was at least one event recorded from that file.

If you assume each log file is 1000 lines, and 50 of those lines represent view events, and only 1 represents download events, then I have to keep 50,000 files in memory to output a single download CSV, whereas I only have to keep 1000 files in memory to output a single view CSV.

The output view CSV was still using way more memory than it should have; it just wasn’t enough to cause a crash.

Solution

The solution was a one liner. I added a string copy before parsing each line into my Request structure. This allowed the underlying []byte to be garbage collected, so I stopped storing the entire log file in memory when I was only using a handful of lines.

Memory usage dropped from 40-50% to 3-4% with 244GB of RAM.

Thanks to Howard for his help, and beware of underlying byte slices!

Summary

I wanted to solve this problem using Go to prove that you don’t always HAVE to use MapReduce to handle your big data struggles. Most people don’t have enough data to justify a whole cluster of machines, and you can do a lot with multiple cores on one box.

That being said, using Go isn’t “free”. Debugging the memory and CPU utilization issues took days of time – time I could have been using to debug my MapReduce cluster.

And if the problem had been larger, requiring multiple machines, or if it was something my team had to maintain, I would have gone with Spark. After all, Hadoop is the accepted approach to big data problems, and so it has a large community, and there are hundreds of teams using elastic mapreduce on AWS.

As for the language, I love Go. In the past I would have used Python for something like this, but I’ve experienced maintenance problems with Python programs bigger than ~1000 lines. My problems are so called “latent bugs” – imagine a mistyped variable hidden behind an if statement or buried inside an error clause. Unless you have perfect unit test coverage (which nobody does), you might not see that bug for many hours of program run time.

The type checker helps in these situations – it serves as a “stupid” unit test to check your work. The Go compiler and tooling (I use VS Code) are good and fast enough where I don’t think they hinder development. Plus, I found parallelization to be much easier in Go than my experiences with either Python or Java.

So in general I would definitely use Go again. It’s an easily parallelizable solution for medium sized data problems.

Photo Credit: whitehart1882

blog comments powered by Disqus