2016-11-04

Daniel McDonald, Indiana University

"Approximation-regularization for the analysis of large data sets"

When data sets are extremely large, computational limits can prevent analysis using standard methods like linear regression or principal components analysis (PCA). To address this bottleneck, approximation techniques—variously referred to as “sketching”, “preconditioning”, or “compression”—try to minimize the information loss relative to the unbiased full-data solution subject to computational constraints. However, because the full-data solution is unbiased, it tends to have large variance, and therefore, may not give the best results in terms of mean squared error. In this work we take a different approach: approximate solutions can be better than full-data solutions, because they actually decrease variance in some situations. We examine approximation techniques for least squares regression and discuss extensions to PCA, and PCA regression and demonstrate our results on genetics and astronomy data.

Stay connected TwitterFacebookLinkedInYouTubeInstagram