🗺️ Replication: Towards a Publicly Available Internet scale IP Geolocation Dataset (IMC 2023)

This repository contains the code needed to reproduce and replicate our results in our IMC 2023 paper.

Our study replicates the methodology of two papers that obtained outstanding results on geolocating IP addresses in terms of coverage and accuracy in nowadays Internet on the largest publicly available measurement platform, RIPE Atlas. These two papers are:

They are called million scale and street level papers throughout this README, as done in our paper.

Our code offers the possibility to:

reproduce our results using our measurement datasets.
replicate our methodology with different targets and vantage points. For now, only RIPE Atlas vantage points are supported, but it should not be difficult to adapt the code to handle other vantage points and targets.

Prerequisites

Our code performs measurements on RIPE Atlas, so be sure to have an account if you want to replicate our methodology with your own RIPE Atlas measurements.

⚠️ To replicate our RIPE Atlas measurements, you will need a lot of credits (millions).

Installation

Requirements

Python3.9 (or above)
Poetry
Docker

Download datasets

You can fetch our data our on FTP ftp.iris.dioptra.io that will give you the ClickHouse tables dumped in CSV format.

Clone the reprository

git clone https://github.com/dioptra-io/geoloc-imc-2023.git
cd geoloc-imc-2023

Installer

You can use the script install.sh to:

Pull the clickhouse docker image.
Start the clickhouse server.
Download clickhouse-client binary.
Install python project using poetry.
Create all tables and populate the database with our measurements.

source install.sh

If the installation fails, all necessary steps to use the project are described below.

Install source files

GeoScale uses poetry has dependency manager, install the project using:

poetry shell
poetry lock
poetry install

Clickhouse

We use docker to run clickhouse server, by default server is listening on localhost on port 8123 and tcp9000. If you prefer using your own docker configuration, please also modify default.py

# pull the docker image
docker pull clickhouse/clickhouse-server:22.6

# start the server
docker run --rm -d \
    -v ./clickhouse_files/data:/var/lib/clickhouse/ \
    -v ./clickhouse_files/logs:/var/log/clickhouse-server/ \
    -v ./clickhouse_files/users.d:/etc/clickhouse-server/users.d:ro \
    -v ./clickhouse_files/init-db.sh:/docker-entrypoint-initdb.d/init-db.sh \
    -p 8123:8123 \
    -p 9000:9000 \
    --ulimit nofile=262144:262144 \
    clickhouse/clickhouse-server:22.6

You can either install clickhouse-client or download clikhouse client binary (by default, install.sh download binary file).

curl https://clickhouse.com/ | sh
mv clickhouse ./clickhouse_files/

Finally, create all necessary tables and populate it with our own measurements with:

python scripts/utils/clickhouse_installer.py

Setttings

Our tool relies on ENV variables for configuring clickhouse or interacting with RIPE Atlas API. An example of necessary ENV variables is given in .env.example. Create your own env file with following values:

RIPE_USERNAME=
RIPE_SECRET_KEY=

⚠️ IF you used, your own clickhouse configuration, you can modify the following ENV:

# clickhouse settings
CLICKHOUSE_CLIENT=
CLICKHOUSE_HOST=
CLICKHOUSE_DB=
CLICKHOUSE_USER=
CLICKHOUSE_PASSWORD=

Further notice

Test environment

The project has been run on:

CentOS 7.5
Python 3.9
Server with 64GB RAM, 32 cores.

⚠️ Some scripts and analysis can use a lot of CPU and RAM (tens of GB) and last for hours.

Reproducing our results

We provide python scripts and jupyter notebooks to reproduce the results and the graphs that we got in replicating the million scale and the street level papers.

Million Scale

You can reproduce Million scale results using a jupyter notebook: million_scale.ipynb

Alternatively you can also use the python script in background, as some steps are vey long to execute (several hours):

nohup python analysis/million_scale.py > output.log &

All analysis results can be found in ./analysis/results

Street level

⚠️ The tier 1 of the Street-level replication (See the paper for more details) relies on results calculated by the million scale technique. You need to run the million scale notebook/scripts before running those of street-level.

No additional steps are necessary to reproduce the street-level experiment.

Generating figures

You can directly use notebooks plot.ipynb and tables.ipynb to produce the figures and tables of our paper.

Run your own measurements

You can also run your own measurements on custom datasets of targets (anchors) and vantage points (probes).

First step: generate targets and vantage points datasets

The jupyter notebook create_dataset will generate:

the set of probes (used as vantage points)
the set of anchors (used as targets)
filter both sets by removing problematic probes (wrongly geolocated for example)

All generated files will be placed in /datasets/user_datasets.

Second step: run measurements

With million_scale_measurements.ipynb, you can select a subset of vantage points and targets and run measurements on RIPE Atlas.

This script will start measurements for:

towards all targets from all vantage points
towards 3 responsive addresses for each target from all vantage points

⚠️ These measurements might cost a lot of RIPE Atlas credits and time if you run them on large datasets (default is only 2 targets and 4 vantage points).

Third step: analyze your results

Perform the analysis by using the same step described previously on your own measurements results and datasets by setting the boolean variable repro = True, at the beginning of million_scale.ipynb (or million_scale.py if you are using the script).

TODO: Street level

📚 Publications

@inproceedings{darwich2023replication,
  title={Replication: Towards a Publicly Available Internet scale IP Geolocation Dataset},
  author={Darwich, Omar and Rimlinger, Hugo and Dreyfus, Milo and Gouel, Matthieu and Vermeulen, Kevin},
  booktitle={Proceedings of the 2023 ACM on Internet Measurement Conference},
  pages={1--15},
  year={2023}
}

🧑‍💻 Authors

This project is the result of a collaboration between the LAAS-CNRS and Sorbonne Université.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🗺️ Replication: Towards a Publicly Available Internet scale IP Geolocation Dataset (IMC 2023)

Prerequisites

Table of contents

Installation

Requirements

Download datasets

Clone the reprository

Installer

Install source files

Clickhouse

Setttings

Further notice

Test environment

Reproducing our results

Million Scale

Street level

Generating figures

Run your own measurements

First step: generate targets and vantage points datasets

Second step: run measurements

Third step: analyze your results

📚 Publications

🧑‍💻 Authors

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
analysis		analysis
clickhouse_files		clickhouse_files
datasets		datasets
measurements		measurements
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
default.py		default.py
install.sh		install.sh
logger.py		logger.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

dioptra-io/geoloc-imc-2023

Folders and files

Latest commit

History

Repository files navigation

🗺️ Replication: Towards a Publicly Available Internet scale IP Geolocation Dataset (IMC 2023)

Prerequisites

Table of contents

Test environment

Million Scale

Street level

Generating figures

First step: generate targets and vantage points datasets

Second step: run measurements

Third step: analyze your results

📚 Publications

🧑‍💻 Authors

About

Resources

Stars

Watchers

Forks

Languages