High-Throughput Data-Intensive Computing: Shared-Scan Scheduling in Scientific Databases & the Cloud

Subscribers:
344,000
Published on ● Video Link: https://www.youtube.com/watch?v=NwwWnryd5Z8



Duration: 1:00:30
189 views
0


Data-intensive computing consists of batch-processing workloads that scan massive data sets in parallel. The focus on data access, data movement, data ingest, and data production means that these workloads overwhelm the network and I/O capabilities of data centers and supercomputers. Major improvements in throughput are available by co-scheduling tasks that access the same data so that multiple tasks complete processing based on accessing and transferring the data a single time. Multiple tasks share I/O, network data transfer, cache space, and even computing with SIMD or vector processing. This talk will review the evolution of co-scheduling in data-intensive computing systems, including shared-scan scheduling for map/reduce workloads (Agrawal et al., VLDB 2008), data-driven batch processing for scientific databases (LifeRaft and JAWS), shared streaming-I/O for spatial workloads, and shared join processing for Pig programs and Nova workflows.







Tags:
microsoft research