About this Event
200 W Packer Ave., Bethlehem, PA 18015
https://engineering.lehigh.edu/ise/news/ise-seminar-seriesSpeaker: Dr. Satyen Kale, Apple
Seminar Title: Stacking for faster pre-training of LLMs
Abstract: Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this talk, I will describe how stacking works, present some empirical results showcasing its effectiveness in training LLMs, and provide some explanations for the efficacy of stacking. The explanations are from two different perspectives: a heuristic one in which the few-shot learning capability of the LLMs is utilized, and a more theoretical one where stacking can be seen as performing a form of Nesterov's Accelerated Gradient Descent in function space. In the latter setting, I will also present a particular example of training a deep linear residual network where stacking provably provides an accelerated convergence rate.
Bio: Satyen Kale is a research scientist at Apple working in the New York office. His current research is the design of efficient and practical algorithms for optimization in Machine Learning in various areas (LLMs, Federated Learning, Differential Privacy, etc.) His research has been recognized with several awards: a best paper award at ICML 2015, a best paper award at ICLR 2018, and a best student paper award at COLT 2018. He was a program chair of COLT 2017 and ALT 2019, and serves as an Associate Editor for the Mathematics of Operations Research journal.
0 people are interested in this event