Transfer and Multi-task Learning: Statistical Insights for Modern Data Challenges
Presented by Ye Tian
Department of Statistics
Columbia University
* Candidate for faculty position in the Division of Biostatistics and Health Data Science
Knowledge transfer, a core human ability, has inspired numerous data integration methods in machine learning and statistics. However, data integration faces significant challenges: (1) unknown similarity between data sources; (2) data contamination; (3) high-dimensionality; and (4) privacy constraints. This talk addresses these challenges across diverse contexts, presenting both innovative statistical methodologies and theoretical insights.
In Part I, I will introduce a transfer learning framework for high-dimensional generalized linear models that combines a pre-trained Lasso with a fine-tuning step. We provide theoretical guarantees for both estimation and inference, and apply the methods to predict county-level outcomes of the 2020 U.S. presidential election, uncovering valuable insights.
In Part II, I will explore an unsupervised learning setting where task-specific data is generated from a mixture model with heterogeneous mixture proportions. This complements the supervised learning setting discussed in Part I, addressing scenarios where labeled data is unavailable. We propose a federated gradient EM algorithm that is communication-efficient and privacy-preserving, providing estimation error bounds for the mixture model parameters. We demonstrate the method’s effectiveness via applications to handwritten digit clustering.
In Part III, I will present a representation-based multi-task learning framework that diverges from the distance-based similarity notion explored in Parts I and II. This framework is connected to modern applications in representation learning for image classification and natural language processing. We establish theoretical results on the fundamental limits of representation-based multi-task learning under conditions of representation heterogeneity and task-level data contamination, offering novel insights into their impact on learning performance.
Finally, I will summarize the talk and briefly introduce my broader research contributions, including robust transfer learning, imbalanced classification error control, and advancements in high-dimensional statistics. The three main sections of this talk are based on a series of papers and a short course I co-taught at NESS 2024. More about me and my research can be found at https://yet123.com.
A seminar tea will be held at 2:45 p.m. in University Office Plaza, Room 240. All are Welcome.