Preserving Statistical Validity in Adaptive Data Analysis
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth
[arXiv]
A great deal of effort has been devoted to reducing the risk of spurious
scientific discoveries, from the use of sophisticated validation
techniques, to deep statistical methods for controlling the false discovery
rate in multiple hypothesis testing. However, there is a fundamental
disconnect between the theoretical results and the practice of data
analysis: the theory of statistical inference assumes a fixed collection of
hypotheses to be tested, or learning algorithms to be applied, selected
non-adaptively before the data are gathered, whereas in practice data is
shared and reused with hypotheses and new analyses being generated on the
basis of data exploration and the outcomes of previous analyses.
In this work we initiate a principled study of how to
guarantee the validity of statistical inference in adaptive data analysis. As
an instance of this problem, we propose and investigate the question of estimating the
expectations of m adaptively chosen functions on an unknown
distribution given n random samples.
We show that, surprisingly, there is a way to estimate an exponential
in n number of expectations accurately even if the functions are chosen
adaptively. This gives an exponential improvement over standard empirical
estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves
actively perturbing and coordinating the estimates,
using techniques developed for privacy preservation.
We give additional applications of this technique to our question.