Risk Convergence and Algorithmic Regularization of Discrete-Stepsize (Stochastic) Gradient Descent

Published on ● Video Link: https://www.youtube.com/watch?v=GBvupXn0CJw



Duration: 15:55
201 views
3


Jingfeng Wu (UC Berkeley)
https://simons.berkeley.edu/talks/jingfeng-wu-uc-berkeley-2023-09-08
Meet the Fellows Welcome Event Fall 2023

Gradient descent (GD) and stochastic gradient descent (SGD) are the fundamental algorithms for optimizing machine learning models, particularly in the context of deep learning. However, certain observed behaviors of GD and SGD cannot be fully explained by classic optimization and statistical learning theories. For example, (1) the training loss induced by GD often oscillates locally yet still converges in the long run and (2) SGD-trained models often generalize well even when the number of training samples is less than the number of parameters. I will discuss two new understandings about the risk convergence and algorithmic regularization effects of GD and SGD:

(1) Large-stepsize GD can minimize risk in a non-monotonic manner for logistic regression with separable data.
(2) Online SGD (and its variant) can effectively learn linear regression and a ReLU neuron in the overparameterized regime.







Tags:
Simons Institute
theoretical computer science
UC Berkeley
Computer Science
Theory of Computation
Theory of Computing
Meet the Fellows Welcome Event Fall 2023
Jingfeng Wu