Skip to content

2 Download images

Hou Yujun edited this page Aug 15, 2024 · 5 revisions

Please download the folder code/download_imgs/.

Set up environment with requirements-non_cv.txt.

An access token is required to download data from Mapillary. You can register one for free from Mapillary. Update your mapillary token in the following files:

  • code/download_imgs/download_jpegs.py
  • code/download_imgs/download_jpegs_mapillary.py

How the code works

download_jpegs.py downloads from Mapillary and KartaView all images specified in a csv input file, in .jpeg format, to a specified output folder.

Input

The input csv should have each row representing an image and contain minimally three columns:

  • uuid: the universally unique identifier (UUID) assigned to each image in the dataset. The downloaded image files will be named with their UUIDs, i.e. {uuid}.jpeg.
  • source: indicates whether the image was obtained from Mapillary or KartaView. The script uses this information to select the appropriate download function and API.
  • orig_id: the original image ID given by Mapillary or KartaView in metadata. This ID is used to query the Mapillary / KartaView API to download the images.

All these three columns can be found in every csv file we provide in the dataset. This means you can use any of the csv files as input for download_jpegs.py.

Output

Images are downloaded into subfolders with maximum 10,000 images per subfolder. Each image file is named by its UUID - {uuid}.jpeg.

Adjustable variables

User can adjust the following variables in download_jpegs.py to suit their needs:

  • access_token (str):
    • Insert your Mapillary access token.
  • in_csvPath (str):
    • Insert the path to your input csv.
  • out_mainFolder (str):
    • Insert the path to the main output folder, under which subfolders will be created automatically by the script to group downloaded images so that each subfolder has maximally 10,000 images.
  • chunk_size (int):
    • Maximum number of images per output subfolder. Default to 10000.
  • num_thread (int):
    • Number of threads or download tasks to run concurrently. Default to 100.

How to run the code

Set up environment with requirements-non_cv.txt.

To reproduce sample_output

Insert your access_token.

Modify out_mainFolder to your output folder.

Uncomment the line:

data_l = pd.concat([data_l[data_l['source']=='Mapillary'].sample(n=25, random_state=0), data_l[data_l['source']=='KartaView'].sample(n=25, random_state=0)], ignore_index=True) # sample 50 images to download just for illustration purpose

Then run:

python3 download_jpegs.py

About sample_output

We sampled 50 images from code/raw_download/sample_output/points.csv to download the image files, stored in code/download_imgs/sample_output/all/1_50.

These 50 images will also be used as input to demonstrate the subsequent CV (computer vision) processing:

To download image files for the entire dataset

Use any of the csv files provided in our dataset as input.

Modify the adjustable variables to suit your needs.

Ensure there is more than 6 TB of available space since all imagery would take up at least 6 TB.

Run

python3 download_jpegs.py

The whole download might take days to complete.

Download image files for a subset of data

You may be interested in downloading the imagery for just a subset of the dataset you need.

You can produce a subset of the dataset by filtering the appropriate metadata. See info.csv for a list of the different features and their meaning.

The notebook sample_subset_download.ipynb contains an example of filtering for images from Singapore taken during the day time.

As seen in the notebook, ensure the resulting filtered csv file contains at least the three columns (uuid, source, and orig_id) as mentioned before.

Once you have saved the csv, change in_csvPath and out_mainFolder accordingly.

Run

python3 download_jpegs.py`

Notes

  1. download_jpegs.py imports the functions from download_jpegs_mapillary.py and download_jpegs_kartaview.py to download imagery from Mapillary and KartaView respectively, by sending requests to the respective APIs. For this reason, it is best to keep these three .py files within the same folder so that download_jpegs.py works smoothly.
  2. Run the script a few times until you observe no change in total number of downloaded images, or as indicated by the message that all images have been downloaded, because not all images can be downloaded in one go due to network issues.
  3. Sometimes, despite running the script a few times until no more change is observed in the total number of downloaded images, some images could still be missing. This is because sometimes the image file could just be unavailable (despite presence of its metadata), due to unknown reasons (e.g. contributor deleted the image, or maybe the image didn't pass some kind of internal check by Mapillary/KartaView etc.). As a result, you may also see some error messages during the download process, but the download process should continue on its own.
  4. If your download is ever interrupted halfway, re-run python3 download_jpegs.py to resume the download. The script would check everything in out_mainFolder against your input CSV (in_csvPath) and only attempt to download the images that do not yet exist in the output folder.