Presented by Anna Neufeld
Assistant Professor of Statistics
Williams College
While classical statistical methods are designed for testing hypotheses about pre-specified models, the reality of modern science is that analysts often explore their data before coming up with models and hypotheses of interest. We refer to the practice of using the same data to generate and then test a hypothesis, or to fit and then evaluate a model, as double dipping. Problems arise when standard statistical procedures are applied in settings that involve double dipping. Often, we avoid double dipping by splitting our observations into a training set and a test set. While this sample splitting approach is straightforward and easy to understand, it is generally unapplicable in unsupervised settings. Motivated by unsupervised problems that arise in the analysis of single-cell RNA sequencing data, we propose data thinning, an alternative to sample splitting that splits each observation in a dataset into two independent pieces. We show that this method provides an elegant solution to our motivating problems under distributional assumptions and discuss extensions that can be used when those assumptions are not met.
A seminar tea will be held at 2:45 p.m. in University Office Plaza, Room 240.


