Tag Archives: bloom

Hadoop Designing With Patterns

Filtering Patterns: Don’t change records. Only get a part of the data!. Examples:  Simple Filter

Bloom Filter (more efficient), Sampling, Random Sampling, Top K

Top K

In RDBMS you would normally first sort the data, then take top K records.  In mapreduce this kind of approach will not work, because the data is not sorted and is processed on several machines. Thus, the mappers will first have to find their own top K lists, without sorting the data, and then send the local lists to the reducers who then can find the global top K list.

Let’s find the top 5 longest posts in our forum! Read more of this post