← ML Research Wiki / 1605.08695

TensorFlow: A system for large-scale machine learning

(2016)

Paper Information

arXiv ID

1605.08695

Venue

USENIX Symposium on Operating Systems Design and Implementation

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, generalpurpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.

Summary

TensorFlow is a large-scale machine learning system designed to facilitate experimentation with new models and training algorithms while optimizing computation across heterogeneous devices, including CPUs, GPUs, and custom TPUs. It employs a dataflow graph model to manage computation and state, enabling efficient distributed execution and support for deep neural networks. The paper details TensorFlow's execution model, its flexibility and extensibility for various machine learning tasks, and it showcases various applications, such as image classification using the ImageNet dataset, and language modeling referencing the One Billion Word Benchmark.

Methods

This paper employs the following methods:

Dataflow graph
Stochastic gradient descent (SGD)

Models Used

Inception-v3
ResNet

Datasets

The following datasets were used in this research:

ImageNet
One Billion Word Benchmark

Evaluation Metrics

Accuracy

Results

Achieved a training throughput of 2,300 images per second using Inception-v3 model on the ImageNet dataset.
Increased throughput for language modeling when using the sampled softmax technique.

Limitations

The authors identified the following limitations:

Challenges with synchronous training due to potential stragglers impacting performance.

Technical Requirements

Number of GPUs: 1
GPU Type: NVIDIA K40
Compute Requirements: Training the Inception-v3 model with one NVIDIA K40 GPU and 5 IvyBridge cores, using a cluster of GPU-enabled servers.

Papers Using Similar Methods

External Resources

References: 92
Influential Citations: 2091

TensorFlow: A system for large-scale machine learning

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers