Big Data on NYC Open Data

Author: Mark Bauer

Introduction

The Metropolitan Transportation Authority (MTA) recently released the 2023 and 2024 Subway Origin-Destination Ridership Estimates datasets on the New York State Open Data Portal. The 2023 dataset alone comprises approximately 116 million rows. While this dataset is fascinating, its massive size can pose challenges for users, even for experienced analysts and researchers. This prompted me to investigate the sizes of the largest datasets available on the NYC Open Data Portal and how users interact with these datasets compared to others.

This project aims to:

Investigate Dataset Sizes: Examine the largest datasets available on the NYC Open Data Portal.
Analyze Usage Patterns: Analyze row counts, download counts, and view counts to understand how these large datasets are used.
Explore Broader Implications: Address broader questions such as whether open data can include Big Data? If so, what characteristics define an optimal open big data platform? How can open big data benefit a broad range of producers and users? Which agencies produce the largest datasets, and what methods do they use to support users?

Overview of NYC Open Data

Metric	Value
Number of datasets	2,491
Number of agencies	201
Number of rows	5,965,739,051
Number of views	27,761,756
Number of downloads	11,529,518

Table xx: Summary statistics of NYC Open Data. Note: This analysis only includes datasets with asset type as dataset and the display type as table, as well as datasets with at least one row.

Figure xx: Top 10 Agencies by Total Number of Rows on NYC Open Data.

Figure xx: Top 10 Agencies by Median Number of Rows on NYC Open Data.

Figure xx: Boxplots of Download Counts by Number of Rows on NYC Open Data.

Figure xx: Total and Average Download Count with IQR Error Bars on NYC Open Data.

Dataset Analysis

id	name	attribution	count_rows	viewCount	downloadCount
rmhc-afj9	DSNY - PlowNYC Data	Department of Sanitation (DSNY)	376,404,531	1,854	504

Table xx: Dataset with the Largest Number of Rows on NYC Open Data.

Figure xx: Top 10 Datasets by Number of Rows on NYC Open Data.

Figure xx: Top 10 Datasets by Number of Rows on NYC Open Data (Excluding Taxi Data).

Figure xx: Top 10 Datasets by Number of Rows on NYC Open Data (Only Taxi Data).

Implications

The User Journey: Exporting Data on NYC Open Data

There are two main methods to exporting a dataset on NYC Open Data: 1) Download files locally and 2) utilize the Socrata Open Data API (SODA API):

Download Files Locally: You can download tabular data in various formats, including JSON, CSV, RDF, RSS, TSV, and XML. For geospatial data, additional formats such as KML, KMZ, Shapefile, and GeoJSON are also available.
Socrata Open Data API (SODA API): The SODA API provides access via unique URLs, known as endpoints, that represent datasets or individual records. The API follows the REST (REpresentational State Transfer) design pattern, using HTTP methods for CRUD (Create, Read, Update, Delete) operations. It supports querying and filtering through the Socrata Query Language (SoQL), which is a SQL-like language tailored for web data. Note that for performance reasons, SODA APIs are paged and return a maximum of 50,000 records per page. More information on SODA API endpoints is available.

Limitations: The methods mentioned above have some limitations. For instance, there is no support for columnar file formats such as Parquet, an open-source column-oriented file format optimized for efficient data storage and retrieval. Additionally, the SODA API can experience performance issues due to the overhead associated with HTTP requests and responses, particularly when querying large volumes of data.

The Gold Standard: NYC Taxi and Limousine Commission (TLC)

As highlighted earlier, many of the largest datasets on NYC Open Data originate from the NYC Taxi and Limousine Commission (TLC). It’s no surprise that these Taxi Trip Datasets are commonly used in big data tutorials, and popular cloud services often provide them for free as sample data. In addition to hosting datasets (typically by year) on NYC Open Data, TLC also offers Parquet file formats on their website, which are distributed via Amazon Web Services (AWS), specifically using Amazon CloudFront.

These options cater to both types of users: those who prefer accessing data directly from NYC Open Data and those who opt for optimized Parquet files.

Code

The code to calculate count of rows for each dataset is located in the data-export.ipynb notebook.
Brief data cleaning before the analysis can be found in the data-cleaning.ipynb notebook.
The code to generate the figures can be found in the analysis.ipynb notebook.

Data

Data was retrieved from NYC Open Data.

Additional Resources

Socrata: NYC Open Data runs on the Socrata Platform.

Socrata - Home
Socrata System Architecture: This blog post from SODA 2.0 was originally released in 2011.

Note: Socrata was acquired by Tyler Technologies in 2018 and is now the Data and Insights division of Tyler.

NYC Open Data Portal

NYC Open Data Dashboard
NYC Open Data Overview
NYC Open Data Laws and Reports

NYC Taxi and Limousine Commission (TLC)

TLC Data: Aggregate and Raw Data
TLC Trip Record Data

Say Hello!

Feel free to reach out.

LinkedIn: markebauer
Portfolio: mebauer.github.io
GitHub: mebauer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Big Data on NYC Open Data

Introduction

Overview of NYC Open Data

Dataset Analysis

Implications

The User Journey: Exporting Data on NYC Open Data

The Gold Standard: NYC Taxi and Limousine Commission (TLC)

Code

Data

Additional Resources

Say Hello!

Files

README.md

Latest commit

History

README.md

File metadata and controls

Big Data on NYC Open Data

Introduction

Overview of NYC Open Data

Dataset Analysis

Implications

The User Journey: Exporting Data on NYC Open Data

The Gold Standard: NYC Taxi and Limousine Commission (TLC)

Code

Data

Additional Resources

Say Hello!