← ML Research Wiki / 1603.02754

XGBoost: A Scalable Tree Boosting System

Tianqi Chen [email protected] University of Washington University of Washington, Carlos Guestrin [email protected] University of Washington University of Washington (2016)

Paper Information

arXiv ID

1603.02754

Venue

Knowledge Discovery and Data Mining

Domain

machine learning, data mining, computer science

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable endto-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.KeywordsLarge-scale Machine Learning 2 https://github.com/dmlc/xgboost 3 Solutions come from of top-3 teams of each competitions.

Summary

This paper presents XGBoost, a scalable machine learning system for tree boosting that achieves state-of-the-art results across various machine learning challenges. The authors highlight key innovations including a sparsity-aware algorithm, a weighted quantile sketch for approximate tree learning, and optimizations for cache access patterns, data compression, and sharding, allowing XGBoost to handle billions of examples efficiently. Extensive evaluations demonstrate that XGBoost outperforms existing methods in speed and scalability. The paper also details the architecture, algorithms, and experimental results, showcasing XGBoost's effectiveness in practical applications such as insurance claim prediction, particle physics event classification, and ranking tasks.

Methods

This paper employs the following methods:

Gradient Tree Boosting

Models Used

XGBoost

Datasets

The following datasets were used in this research:

Allstate
Higgs
Yahoo! learning to rank
Criteo

Evaluation Metrics

AUC
NDCG@10

Results

XGBoost runs more than ten times faster than existing popular solutions on a single machine.
Among 29 winning solutions on Kaggle's blog during 2015, 17 used XGBoost.
Every winning team in KDD Cup 2015 used XGBoost.

Limitations

The authors identified the following limitations:

None specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

XGBoost tree boosting scalability distributed learning sparsity-aware algorithms quantile sketch out-of-core computation

Papers Using Similar Methods

External Resources

Funding: None specified
References: 26
Influential Citations: 3031

XGBoost: A Scalable Tree Boosting System

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers