We introduce Argoverse 2 (AV2) -a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry -sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.In the last two years, the Argoverse team has hosted six competitions on 3D tracking, stereo depth estimation, and motion forecasting. We maintain evaluation servers and leaderboards for these tasks, *Equal contribution. † Work completed while at Argo AI. 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. arXiv:2301.00493v1 [cs.CV] 2 Jan 2023 1. Bigger isn't always better. Self-driving vehicles capture a flood of sensor data which is logistically difficult to work with. Sensor datasets are several terabytes in size, even when compressed. If standard benchmarks grow further, we risk alienating much of the academic community and leaving progress to well-resourced industry groups. For this reason, we match but do not exceed the scale of sensor data in nuScenes [4] and Waymo Open [45]. 2. Make every instance count. Much of driving is boring. Datasets should focus on the difficult, interesting scenarios where current forecasting and perception systems struggle. Therefore we mine for especially crowded, dynamic, and kinematically unusual scenarios. 3. Diversity matters. Training on data from wintertime Detroit is not sufficient for detecting objects in Miami -Miami has 15 times the frequency of motorcycles and mopeds. Behaviors differ as well, so learned pedestrian motion behavior might not generalize. Accordingly, each of our datasets are drawn from six diverse cities -Austin, Detroit, Miami, Palo Alto, Pittsburgh, and Washington D.C. -and different seasons, as well, from snowy to sunny. 4. Map the world. HD maps are powerful priors for perception and forecasting. Learning-based methods that found clever ways to encode map information [31] performed well in Argoverse competitions. For this reason, we augment our HD map representation with 3D lane geometry, paint markings, crosswalks, higher resolution ground height, and more.5. Self-supervise. Other machine learning domains have seen enormous success from self-supervised learning in recent years. Large-scale lidar data from dynamic scenes, paired with HD maps, could lead to better representations than current supervised approaches. For this reason, we build the largest dataset of lidar sensor data.6. Fight the heavy tail. Passenger vehicles are common, and thus we can assess our forecasting and detection accuracy for cars. However, with existing datasets, we cannot assess forecasting accuracy for buses and motorcycles with their distinct behaviors, nor can we evaluate stroller and wheel chair detection. Thus we introduce the largest taxonomy to date for sensor and forecasting datasets, and we ensure enough samples of rare objects to train and evaluate models.With these guidelines in mind we built the three Argoverse 2 (AV2) datasets. Below, we highlight some of their contributions.1. The 1,000 scenario Sensor dataset has the largest self-driving taxonomy to date -30 categories. 26 categories contain at least 6,000 cuboids to enable diverse taxonomy training and testing. The dataset also has stereo imagery, unlike recent self-driving datasets.2. The 20,000 scenario Lidar dataset is the largest dataset for self-supervised learning on lidar. The only similar dataset, concurrently developed ONCE [36], does not have HD maps.3. The 250,000 scenario Motion Forecasting Dataset has the largest taxonomy -5 types of dynamic actors and 5 types of static actors -and covers the largest mapped area of any such dataset.We believe these datasets will support research into problems such as 3D detection, 3D tracking, monocular and stereo depth estimation, motion forecasting, visual odometry, pose estimation, lane detection, map automation, self-supervised learning, structure from motion, scene flow, optical flow, time to contact estimation, and point cloud forecasting.Related WorkThe last few years have seen rapid progress in self-driving perception and forecasting research, catalyzed by many high quality datasets.Sensor datasets and 3D Object Detection and Tracking. New sensor datasets for 3D object detection[4,45,39,40,24,33,18,14,41,36]have led to influential detection methods such as 1 This count includes private submissions not posted to the public leaderboards. 2 https://github.com/argoverse/argoverse-api