← ML Research Wiki / 1603.02754

XGBoost: A Scalable Tree Boosting System

Tianqi Chen [email protected] University of Washington University of Washington, Carlos Guestrin [email protected] University of Washington University of Washington (2016)

Paper Information
arXiv ID
Venue
Knowledge Discovery and Data Mining
Domain
machine learning, data mining, computer science
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable endto-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.KeywordsLarge-scale Machine Learning 2 https://github.com/dmlc/xgboost 3 Solutions come from of top-3 teams of each competitions.

Summary

This paper presents XGBoost, a scalable machine learning system for tree boosting that achieves state-of-the-art results across various machine learning challenges. The authors highlight key innovations including a sparsity-aware algorithm, a weighted quantile sketch for approximate tree learning, and optimizations for cache access patterns, data compression, and sharding, allowing XGBoost to handle billions of examples efficiently. Extensive evaluations demonstrate that XGBoost outperforms existing methods in speed and scalability. The paper also details the architecture, algorithms, and experimental results, showcasing XGBoost's effectiveness in practical applications such as insurance claim prediction, particle physics event classification, and ranking tasks.

Methods

This paper employs the following methods:

  • Gradient Tree Boosting

Models Used

  • XGBoost

Datasets

The following datasets were used in this research:

  • Allstate
  • Higgs
  • Yahoo! learning to rank
  • Criteo

Evaluation Metrics

  • AUC
  • NDCG@10

Results

  • XGBoost runs more than ten times faster than existing popular solutions on a single machine.
  • Among 29 winning solutions on Kaggle's blog during 2015, 17 used XGBoost.
  • Every winning team in KDD Cup 2015 used XGBoost.

Limitations

The authors identified the following limitations:

  • None specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

XGBoost tree boosting scalability distributed learning sparsity-aware algorithms quantile sketch out-of-core computation

Papers Using Similar Methods

External Resources