Sampling Techniques for Massive Data

Subscribers:
348,000
Published on ● Video Link: https://www.youtube.com/watch?v=pU9QC75uUMY



Duration: 49:51
6,153 views
8


Google Tech Talks
March 27, 2007

ABSTRACT

Consider a giant data matrix A of N rows and D columns. At Web scale, both N and D can be in the order of billions. In applications including duplicate (doc) detections, word associations, databases, nearest neighbors, kernels (e.g., for SVM), it is often desirable to store a very small fraction (sample) of the data to fit in physical memory for quickly computing summary statistics (e.g. L1 or L2 distances). Because the data are often highly sparse, conventional sampling methods (i.e., randomly selecting a few columns from the data matrix) would not work well. Two sampling methods, conditional random sampling (CRS) and stable random projections (SRP),...







Tags:
google
howto
sampling
techniques
massive
data