Skip to content

Latest commit

 

History

History
9 lines (5 loc) · 3.08 KB

ABSTRACT.md

File metadata and controls

9 lines (5 loc) · 3.08 KB

Cityscapes is a benchmark suite and large-scale dataset aimed at training and testing approaches for pixel-level and instance-level semantic labeling for complex real-world urban scenes. Cityscapes encompasses a diverse set of stereo video sequences recorded in streets from 50 different cities, with 5000 images having high-quality pixel-level annotations and an additional 20,000 images having coarse annotations to support methods leveraging weakly-labeled data. Notably, the efforts of the authors surpass previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. They also conducted an empirical study that provided an in-depth analysis of the dataset characteristics and evaluated the performance of several state-of-the-art approaches using their benchmark.

The data recording and annotation methodology were meticulously designed by the authors to capture the high variability of outdoor street scenes. They acquired several hundreds of thousands of frames from a moving vehicle over several months, covering different seasons in 50 cities primarily in Germany and neighboring countries. The authors intentionally avoided recording in adverse weather conditions such as heavy rain or snow, as they believed specialized techniques and datasets would be required for such conditions.

The authors used a camera system and post-processing techniques that represented the current state-of-the-art in the automotive domain. The images were recorded using an automotive-grade 22 cm baseline stereo camera equipped with 1/3-inch CMOS 2 MP sensors (OnSemi AR0331) with rolling shutters at a frame rate of 17 Hz. The sensors, mounted behind the windshield, produced high dynamic-range (HDR) images with 16 bits linear color depth. Each 16-bit stereo image pair underwent subsequent debayering and rectification. The authors relied on extrinsic and intrinsic calibration methods from a referenced source to ensure calibration accuracy, re-calibrating on-site before each recording session.

To maintain comparability and compatibility with existing datasets, the authors also provided low dynamic-range (LDR) 8-bit RGB images obtained through logarithmic compression curves. Such tone mappings were common in automotive vision, as they could be computed efficiently and independently for each pixel. For optimal annotation quality, the authors applied a separate tone mapping to each image, resulting in less realistic but visually more pleasing images, which proved easier to annotate. From the 27 cities, the authors manually selected 5000 images for dense pixel-level annotation, aiming for diversity in foreground objects, background, and overall scene layout. Annotations were performed on the 20th frame of a 30-frame video snippet, and the full snippet was provided to offer context information. For the remaining 23 cities, the authors selected a single image every 20 seconds or 20 meters of driving distance, whichever came first, resulting in a total of 20,000 images with coarse annotations.

Within DatasetNinja, the statistics for the 5000 images version were calculated.