← ML Research Wiki / 1412.6980

ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

Diederik P Kingma [email protected] University of Amsterdam University of Toronto, Jimmy Lei Ba University of Amsterdam University of Toronto (2014)

Paper Information

arXiv ID

1412.6980

Venue

International Conference on Learning Representations

Domain

Machine learning

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based on adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.

Summary

The paper introduces Adam, an efficient algorithm for stochastic optimization based on adaptive estimates of lower-order moments of gradients. Adam combines the advantages of AdaGrad and RMSProp, making it suitable for large-scale machine learning problems with high-dimensional parameters and noisy objectives. The algorithm requires minimal memory, is easy to implement, and shows competitive convergence properties compared to existing methods. The authors analyze the theoretical convergence properties and empirical performance of Adam across various models and datasets, demonstrating its effectiveness, particularly in problems characterized by noisy, high-dimensional data.

Methods

This paper employs the following methods:

Adam
SGD
AdaGrad
RMSProp

Models Used

Logistic Regression
Multi-layer Neural Networks
Convolutional Neural Networks

Datasets

The following datasets were used in this research:

MNIST
IMDB movie review

Evaluation Metrics

Accuracy
Convergence Rate
Regret Bound

Results

Adam consistently outperforms other stochastic optimization methods for various models and datasets.
Adam yields similar convergence as SGD and converges faster than Adagrad in logistic regression experiments.
In multi-layer networks and CNNs, Adam shows better convergence than other methods, particularly in the presence of dropout noise.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Adam Stochastic optimization Gradient descent Adaptive optimization Deep learning

Papers Using Similar Methods

External Resources

Funding: None specified
References: 26
Influential Citations: 23030

ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers