Venue
Knowledge Discovery and Data Mining
Domain
machine learning, data mining, computer science
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable endto-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.KeywordsLarge-scale Machine Learning 2 https://github.com/dmlc/xgboost 3 Solutions come from of top-3 teams of each competitions.
This paper presents XGBoost, a scalable machine learning system for tree boosting that achieves state-of-the-art results across various machine learning challenges. The authors highlight key innovations including a sparsity-aware algorithm, a weighted quantile sketch for approximate tree learning, and optimizations for cache access patterns, data compression, and sharding, allowing XGBoost to handle billions of examples efficiently. Extensive evaluations demonstrate that XGBoost outperforms existing methods in speed and scalability. The paper also details the architecture, algorithms, and experimental results, showcasing XGBoost's effectiveness in practical applications such as insurance claim prediction, particle physics event classification, and ranking tasks.
This paper employs the following methods:
The following datasets were used in this research:
- Allstate
- Higgs
- Yahoo! learning to rank
- Criteo
- XGBoost runs more than ten times faster than existing popular solutions on a single machine.
- Among 29 winning solutions on Kaggle's blog during 2015, 17 used XGBoost.
- Every winning team in KDD Cup 2015 used XGBoost.
The authors identified the following limitations:
- Number of GPUs: None specified
- GPU Type: None specified
XGBoost
tree boosting
scalability
distributed learning
sparsity-aware algorithms
quantile sketch
out-of-core computation