Near-Optimal Parallel Join Processing in MapReduce

Subscribers:
348,000
Published on ● Video Link: https://www.youtube.com/watch?v=kiuUGXWRzPA



Duration: 57:00
4,734 views
32


Google Tech Talk (more info below)
May 5, 2011

Presented by Dr Mirek Riedewald, Associate Professor College of Computer and Information Science Northeastern University http://www.ccs.neu.edu/home/mirek/

ABSTRACT

As the amount and complexity of data in many fields increases rapidly, new tools are needed for exploratory analysis and scientific discovery. Our Scolopax system's goal is to address these challenges with novel techniques for large-scale parallel data management. In this talk, we will present an overview of Scolopax and then focus on parallel processing of joins. Joins combine information across data sets, e.g., to discover correlations. Our proposed join model simplifies reasoning about how to assign computation tasks to processors in MapReduce and other parallel environments. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide proofs and strong evidence that for a variety of join problems, its latency is either close to optimal or the best realizable option. For some popular joins we show how to improve over 1-Bucket-Theta by exploiting additional input statistics. Most of these results will appear at SIGMOD 2011.




Other Videos By Google TechTalks


2011-06-07How to Create World Peace and Silicon Valley's Stake in the Game
2011-06-07Video Games and the Future of Learning (Jan Plass and Bruce Homer)
2011-06-07Mining Your Logs - Gaining Insight Through Visualization
2011-06-07Heath@Google Series: When Stress Becomes Stressed Out - 5 Ways to Outsmart the Invisible Killer
2011-06-01Bufferbloat: Dark Buffers in the Internet
2011-05-31IMUG Meetup: Mobile App Localization as a Service
2011-05-27Oakland International High School @Google
2011-05-26Self-Publishing: A Googler's Journey
2011-05-25Racial Profiling Analysis in a Post-Beer Summit World
2011-05-25The Middle East and Its Current Political Climate
2011-05-18Near-Optimal Parallel Join Processing in MapReduce
2011-05-18Michel Beaudouin-Lafon_Lessons from the WILD Room, an Interactive Multi-Surface Environment
2011-05-18Large-scale Image Classification: ImageNet and ObjectBank
2011-05-16Predator: A Visual Tracker that Learns from its Errors
2011-05-03Social Networks and Community (Re)Engineering: Creating Health Through Information and Policy
2011-05-02Where Did This Code Come From? Discovering the Provenance of Program Binaries
2011-04-25Health@Google Series: Reset Yourself, Starting with Food
2011-04-25Health@Google Series: Boosting Performance Through Plant-Based Whole Foods
2011-04-15To Harness The Long Tail Online, Location Does Matter As Does Time
2011-04-15Bay Area Vision Meeting: Visual Recognition via Feature Learning
2011-04-15Health@Google Series: Hair Loss and Hair Restoration



Tags:
google tech talk
mapreduce
database
data management