← ML Research Wiki / 1412.6980

ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

Diederik P Kingma [email protected] University of Amsterdam University of Toronto, Jimmy Lei Ba University of Amsterdam University of Toronto (2014)

Paper Information
arXiv ID
Venue
International Conference on Learning Representations
Domain
Machine learning
SOTA Claim
Yes
Reproducibility
8/10

Abstract

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based on adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.

Summary

The paper introduces Adam, an efficient algorithm for stochastic optimization based on adaptive estimates of lower-order moments of gradients. Adam combines the advantages of AdaGrad and RMSProp, making it suitable for large-scale machine learning problems with high-dimensional parameters and noisy objectives. The algorithm requires minimal memory, is easy to implement, and shows competitive convergence properties compared to existing methods. The authors analyze the theoretical convergence properties and empirical performance of Adam across various models and datasets, demonstrating its effectiveness, particularly in problems characterized by noisy, high-dimensional data.

Methods

This paper employs the following methods:

  • Adam
  • SGD
  • AdaGrad
  • RMSProp

Models Used

  • Logistic Regression
  • Multi-layer Neural Networks
  • Convolutional Neural Networks

Datasets

The following datasets were used in this research:

  • MNIST
  • IMDB movie review

Evaluation Metrics

  • Accuracy
  • Convergence Rate
  • Regret Bound

Results

  • Adam consistently outperforms other stochastic optimization methods for various models and datasets.
  • Adam yields similar convergence as SGD and converges faster than Adagrad in logistic regression experiments.
  • In multi-layer networks and CNNs, Adam shows better convergence than other methods, particularly in the presence of dropout noise.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Adam Stochastic optimization Gradient descent Adaptive optimization Deep learning

Papers Using Similar Methods

External Resources