Xing Liu (UC Berkeley) and Anthony Kremin (Berkeley Lab), June 2024
DESI's early data release (EDR) is available to the public, free of charge, at the desidata S3 cloud storage "bucket" on Amazon Web Services (AWS).
Here, we provide a Docker image which makes it easy to work with both local and cloud-hosted DESI data. Our Docker image is a self-contained Linux environment which comes pre-packaged with
- A Jupyter server installed with general Python libraries for scientific programming, as well as DESI-specific libraries, and
- A filesystem mounted to the DESI S3 bucket, which automatically downloads the data you query and nothing more.
Most DESI code developed for NERSC can run on this Docker image with little to no modifications.
You are free to choose a combination of local/cloud-hosted databases and local/cloud-hosted programming environments to suit your workflow.
If your DESI data is hosted locally, or if you want to stream the S3 DESI data to process locally, then please follow the instructions at Running the Docker image locally. We emphasize that local data processing is only practical for those with high-performance computers. Due to the high resolution of DESI data, you should only run the image locally if your computer has at least 16 GB of memory (24 GB recommended).
Otherwise, we recommend running the Docker image at your institution's computing center, or a commercial cloud computing center such as AWS Elastic Cloud Compute (EC2). A cloud compute instance gives you on-demand access to additional storage and processing power. AWS EC2, in particular, have a very high-bandwidth internal network integration with AWS S3. If you are interested, then please follow the instructions for Running the Docker image on an AWS EC2 cloud compute instance.
- A modern version of Windows, macOS, or Linux
- Windows users need to first install Windows Subsystem for Linux (WSL) and Windows Terminal. Then, in Windows Terminal, switch from PowerShell to a Linux shell (such as Ubuntu).
- At least 16 GB of memory (24 GB recommended)
- At least 32 GB of free storage if streaming data from S3 (64 GB recommended); A lot more if locally hosting data
We will be using Docker Engine, Docker's command-line tool.
- Windows and macOS users should install Docker Desktop, which comes bundled with Docker Engine.
- Linux users can either install the full Docker Desktop, or Docker Engine for Linux directly.
Open your computer terminal, and navigate to the folder you use as your workspace for DESI.
If your DESI data is locally hosted at local_data_path
, then enter this command:
docker run -it -p 8888:8888 -e DESI_RELEASE=edr \
--volume "$(pwd):/home/synced" \
--volume "local_data_path:/home/desidata:ro" \
ghcr.io/desihub/desidocker:main
- If you want to give the Docker container write access to your data release, then remove the
:ro
at the end of the flag.
Otherwise, to access the DESI data hosted at AWS S3, then enter this command instead:
docker run -it -p 8888:8888 -e DESI_RELEASE=edr \
--volume "$(pwd):/home/synced" \
--cap-add SYS_ADMIN --device /dev/fuse --security-opt apparmor:unconfined \
ghcr.io/desihub/desidocker:main
- Note that mounting the S3 bucket as a local filesystem requires granting the container sysadmin-level access to your computer's FUSE interface. This is not ideal for security, so if that is a major concern, then we do recommend running a cloud instance.
Once the image starts running, locate the line beginning with http://127.0.0.1:8888/lab?token=...
in the output, and open the address in your browser.
While you do not need an AWS account to access the DESI data locally, you do have to make one in order to use the AWS EC2 service. Follow the official instructions for First time users of AWS to get started. Once you’ve signed into your account, we recommend switching your region to us-west-2 (Oregon) as that is the region of our S3 bucket. Then, you can navigate to Services » EC2 to set-up a cloud compute instance.
To access the Jupyter web server provided by our Docker image, first we need to create a security group which allows HTTPS network access.
Navigate to Services » EC2 » Security groups, then click Create security group. Fill in the following fields —
- Basic details: Name the security group jupyter.
- Inbound rules: Add the following rules —
Type | Protocol | Port range | Source type | Source | Description |
---|---|---|---|---|---|
Custom TCP | (TCP) | 8888 | My IP | (Your IP) | Open TCP port for Jupyter server |
HTTPS | (TCP) | (443) | My IP | (Your IP) | Allow HTTPS for Jupyter server |
SSH | (TCP) | (22) | My IP | (Your IP) | Allow SSH access to the instance |
- If your IP address is not fixed (for example, if you primarily use cellular data or are on a large WiFi network), you should instead enter "Custom" for Source type and the range of possible IP addresses you use in Source.
- Outbound rules: Add the following rule (if it isn't already there) —
Type | Protocol | Port range | Source type | Source | Description |
---|---|---|---|---|---|
All traffic | (All) | (All) | Anywhere-IPv4 | (0.0.0.0/0) | Allow instance to access the whole internet |
Then click Create security group.
Navigate to Services » EC2 » Instances, then click Launch instances. Fill in the following fields —
- Name and tags: Pick your own.
- Application and OS Images (Amazon Machine Image): We recommend selecting Amazon Linux, although Ubuntu and other Linux distributions should also work.
- Instance type: We recommend starting with t3.xlarge or t3.2xlarge, due to the memory-intensive nature of processing DESI data. You should upgrade to other instances if you need more processing power and memory.
- Key pair: Create your own and save the private key file.
- Network settings: Select the jupyter security group we created earlier.
- Configure storage: For free-tier accounts, we recommend the maximum available 30 GiB. There can be a lot of locally cached DESI data!
Then click Launch instance. After the instance has loaded, follow the official instructions to Connect to your instance.
Run the following lines to install Git and Docker on Amazon Linux, which uses the yum
package management system.
# Install Git and Docker
sudo yum update
sudo yum install git
sudo yum install docker
# Give Docker extra permissions
sudo usermod -a -G docker ec2-user
id ec2-user
newgrp docker
sudo systemctl enable docker.service
If you are using a different Linux distribution on your instance, refer to the official instructions to install Docker Engine for Linux instead.
Run this command to start Docker,
sudo systemctl start docker.service
- This needs to be re-run every time you start your instance.
Finally, run this shell command to download and run the image.
docker run -it -p 8888:8888 -e DESI_RELEASE=edr \
-e PUBLIC_IP=$(curl -s https://checkip.amazonaws.com) \
--volume "$(pwd):/home/synced" \
--cap-add SYS_ADMIN --device /dev/fuse --security-opt apparmor:unconfined \
ghcr.io/desihub/desidocker:main
- If you encounter an
unknown server OS
error, you may need to restart Docker.
Once the image starts running, locate the line beginning with http://...:8888/lab?token=...
in the output, and open the address in your browser.
- To point
$DESI_ROOT
to another public data release, replaceedr
with the other release's name in the-e DESI_RELEASE=edr
flag. - The internal and external ports of the Jupyter server are respectively the first and second
8888
in-p 8888:8888
. Adjust the external port (as well as the port security policy if using EC2) should you encounter port collision issues. - To sync your changes in the container to a custom local folder,
replace
$(pwd)
(which points to the folder where you entered thedocker run
command) with the absolute path to the custom folder in the--volume "$(pwd):/home/synced"
flag. - To build the image from source (requires some patience), enter the command
docker build github.com/desihub/desidocker.git --tag desi-docker
Then, replace the tag ghcr.io/desihub/desidocker:main
with desi-docker
when running the image.
To update your Docker image, run
docker pull ghcr.io/desihub/desidocker:main
See maintainance.md.