Saturday, November 22, 2008

How to sort 1TB of data in 68 seconds or even 1 PB

“We are excited to announce we were able to sort 1TB (stored on the Google File System as 10 billion 100-byte records in uncompressed text files) on 1,000 computers in 68 seconds.” – Google Blog

But that was not enough they tried with 1000 times more data. 1PB.

“It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers.”

The method they use is MapReduce that makes it possible to run multiple processes simultaneously on multiple computers.

Programming model for MapReduce

Input & Output: each a set of key/value pairs

Programmer specifies two functions:

map (in_key, in_value) -> list(out_key, intermediate_value)

  • Processes input key/value pair
  • Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
  • Combines all intermediate values for a particular key
  • Produces a set of merged output values (usually just one)
Inspired by similar primitives in LISP and other languages.

Some of the latest blog posts

Subscribe to RSS headline updates from:
Powered by FeedBurner