But that was not enough they tried with 1000 times more data. 1PB.
“It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers.”
The method they use is MapReduce that makes it possible to run multiple processes simultaneously on multiple computers.
Programming model for MapReduce
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
map (in_key, in_value) -> list(out_key, intermediate_value)
- Processes input key/value pair
- Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by similar primitives in LISP and other languages.
- Combines all intermediate values for a particular key
- Produces a set of merged output values (usually just one)