Skip to content

The source code for the WPES'23 paper "Unveiling the Impact of User-Agent Reduction and Client Hints: A Measurement Study"

License

Notifications You must be signed in to change notification settings

ua-reduction/ua-client-hints-crawler

Repository files navigation

Unveiling the Impact of User-Agent Reduction and Client Hints: A Measurement Study (WPES'23)

This repository contains the code for the paper titled Unveiling the Impact of User-Agent Reduction and Client Hints: A Measurement Study (to be presented at WPES'23. UA-Reduction

Background: Browsers including Chrome recently reduced the user-agent string to make it less identifying. Simultaneously, Chrome introduced several highly identifying (or high-entropy) the user-agent client hints (UA-CH) to allow access to browser properties that are redacted from the user-agent string. In this empirical study, we attempt to characterize the effects of these major changes through a large-scale web measurement on the top 100K websites. Using an instrumented crawler, we quantify access to high-entropy browser features through UA-CH HTTP headers and the JavaScript API (mainly the navigator.userAgentData.getHighEntropyValues method). We measure access delegation to third parties and investigate whether the new client hints are already used by tracking, advertising and browser fingerprinting scripts.

Project website: For a more detailed overview please visit the project's website.

Crawler

We extended DuckDuckGo’s Tracker Radar Collector to record HTTP headers, JavaScript API calls and HTML elements that can be used to access, opt-in or delegate User-Agent Client Hints. Our main modifications can be found in the following files:

To start a crawl, 1) clone this repo, 2) install the required npm packages (npm i) and 3) run the following command:

npm run crawl -- -u 'https://www.example.com' -o ./data/ -v -f -d "fingerprints,requests,cookies,screenshots,ch_delegation" --reporters 'cli,file' -l ./data/

Please check the upstream Tracker Radar Collector repository for other command line options.

Data

Crawl data

The data from the crawl (performed in June'23) will be made available soon. For each visited website the crawler produces the following files:

  • homepage screenshot
  • homepage HTML source
  • a JSON file that contains HTTP request and response details, cookies, JavaScript API calls, details of User-Agent Client Hint delegation or opt-in via HTML

Auxiliary data

The auxiliary data we use in the analysis includes the following:

  • 100k_nyc_all_reqs.csv: Request and response details extracted from the crawl JSONs.
  • 100k_nyc_delegation_df.csv: Information about websites where User-Agent Client Hints are delegated via HTML, obtained from the crawl JSON files.
  • 100k_nyc_leaky_reqs_with_hashes.csv: Request and response details where high-entropy hints are exfiltrated to a remote servers, created by using 100k_nyc_all_reqs.csv and leak-detector code published in this repo. This leak detection methodology is based on the approach presented by Englehardt et al.'s work.
  • site_rank.txt: The ranking details associated with each visited website.
  • tracker_category.json: The categorization of domains (exfiltrating or accesing the User-Agent Client Hints) is established through the usage of DuckDuckGo's Tracker Radar dataset. Within the specified folders and their corresponding subfolders, all JSON files have been processed to extract information about their categories.
  • tracker_owner.json: The information about the owner of the tracker domains (exfiltrating or accesing the User-Agent Client Hints) is sourced from the data contained within the provided DuckDuckGo's Tracker Radar dataset. All JSON files have been parsed, and the displayName information has been extracted.
  • 100k_nyc_api_calls.csv: JavaScript calls and property accesses related to User-Agent Client Hints, including function arguments and return values, extracted from the crawl JSON files.
  • 100k_nyc_fp_attempts.csv: Detailed information about fingerprinting attempts, based on applying heuristics developed by Iqbal et al to the crawl JSON files.
  • category_domains.json: The category of the domains (exfiltrating or accesing the User-Agent Client Hints) determined by using DuckDuckGo's Tracker Radar repository.
  • succeeded_hostnames.txt: The list of URLs we succesfully visited during the crawl.

Data analysis code

The Jupyter notebooks used for the analyses can be found at https://github.com/ua-reduction/ua-client-hints-crawler/tree/main/analysis.

Reference

@inproceedings{senol2023unveiling,
  title={Unveiling the Impact of User-Agent Reduction and Client Hints: A Measurement Study},
  author={Senol, Asuman and Acar, Gunes},
  booktitle={Proceedings of the 22nd Workshop on Privacy in the Electronic Society},
  pages={91--106},
  year={2023}
}

About

The source code for the WPES'23 paper "Unveiling the Impact of User-Agent Reduction and Client Hints: A Measurement Study"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published