Photo of Jamie & Lion

The personal site of Jamie Knight, an autistic web developer, speaker and mountain biker who is never seen far from his plush sidekick Lion. View the Archive

Topics: Autism Development

Making grep an order of magnitude (or more) faster.

Update: The final code can now be found as part of my WikiLinks project at on my GitHub account

Grep is an amazing tool for searching through files. In a recent project I needed to Grep through a 45gb dataset as quickly as possible. While working on the project I developed a couple of strategies for making Grep tasks an order of magnitude faster.

The problem.

For this project I needed to analyse a 45gb text file looking for URLs pointing to a specific domain. The desired output was a file(s) containing one matched URL per line for further analysis.

After exploring a few options the simplest method seemed to be to apply a regular expression to the input file. After working out my regular expressions against a small 250mb test file I was happy with the output.

When I applied the regular expression to the full dataset the data was being processed at around 1.8mb/s. Some quick maths (wolfram alpha rocks!) estimated a runtime of ~7 hours.

If this was a one time job, then I would have left it there. But partly for curiosity and partly for purpose I decided to have a go at optimising the task.

The basic case.

I’m not a regex expert (far far from it!) but I decided to go with a simple match in order to extract the URLs from the text file.

The regex I settled on produces reasonably clean output while being very straightforward:

'[^(|=[&;*/ ]*example\.co\.uk[^| ]*'

In simple terms, the domain is matched and the match is extended untill it finds a prohibited character such as | or =.

Wrapping this into a simple bash script gave me a run time of 590 seconds clock time (589.8 User time) for my 2gb test file.

With a baseline known and the problem described, lets dive into how I made it fast.

Less input.

Part of why grep is so amazing is that it can perform some searches in less than linear time. That is, you can sometimes double the input and won’t run for twice as long. This is due to some magic called the Boyer Moore algorithm.

Even so, giving grep less input to process can result in huge performance improvements.

To give Grep less input with our dataset we have two options, preprocessing our input file, or reducing input on the fly.

For this script I went with reducing input on the fly, by simply chaining grep’s together in a pipeline.

 grep "url.co.uk" file | grep -o '[^(|=[&;*/ ]*url\.co\.uk[^| ]*'

The first grep search, is a fast binary search, looking for just lines which include the domain we are looking for.

The second grep in the chain runs our previous regex over the much smaller filtered input.

For the dataset I am processing this technique proved to increase performance by 3.5-4x. The runtime for the test dataset dropped from 590 to 148 seconds!

With my input vastly reduced and the performance already quadrupled I started looking at other methods to eek out more performance. Parallel looked like a good bet.

Fun with parallelism.

Grep is single threaded, which means by itself it is unable to exploit multi-core processes. However, the strictly single threaded nature of grep means it is trivial to run multiple grep commands in parallel.

For my script, this meant two things. First I needed to split my input, secondly I needed a way of running grep commands in parallel.

Splitting the input is easy (using the unix split utility). I used split to generate 4 files each of roughly 25% of the input file (11gb each). For my 2gb test file I created four ~512mb files.

Next I needed an easy way to dispatch grep commands in parallel. I googled around and found GNU parallels.

GNU Parallels makes it insanely easy to split tasks.

ls perftest/*.lump | parallel "grep "url" {}"

The first commands matches the 4 files (named chunks for the big data and lumps for my test files) then parallel passes each file to a grep command.

The performance improvement was less impressive than reducing input. In the small test run time fell from 590 to 324 seconds. A ~1.8x improvement.

Putting it all together

For my final script I combined both reducing input and parallelism.

For the small scale test this resulted in a total performance improvement of ~6.5x, dropping from 590 to 91 seconds.

Applied to the larger dataset (11gb, ~25% of full dataset) the performance improvement increased to ~11.5x dropping from 1535 to just 133 seconds.

Final words.

Achieving an order of magnitude improvement in performance is no mean feat, but by providing simple building blocks the unix environment makes it easy

When applied to my full dataset, the run time fell from an estimated 7 Hours to around 10 minutes in total; a 42x improvement in performance.

The solution is embarrassing parallel, my workstation has 4 cores, but and 8 core or 16 core box increase the performance further.

The unix way of combining simple best in breed tools allows for a deeply expressive environment for processing data.

Published: 30 September 2013 | Categories: , Permalink

Comment

Your Comment