Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading data from BaseSpace #8

Closed
arisp99 opened this issue Oct 7, 2021 · 22 comments · Fixed by #13 or #25
Closed

Downloading data from BaseSpace #8

arisp99 opened this issue Oct 7, 2021 · 22 comments · Fixed by #13 or #25
Assignees
Milestone

Comments

@arisp99
Copy link
Member

arisp99 commented Oct 7, 2021

Proposal

Currently, we use an outdated python script to download data from BaseSpace: https://github.com/bailey-lab/MIPTools/blob/70c9c26cd86af33f5eb75bdd0c4c43edfedc26d4/bin/BaseSpaceRunDownloader_v2.py. BaseSpace has released tools to work with data on the CLI. To prevent code breakage down the line, we propose using BaseSpace's tools.

Working With BaseSpace CLI

Below, I discuss some of my thoughts in reading through the BaseSpace documentation.

Installation

Installation is straightforward, however, we may consider changing the installation location. We may also need to change file permissions using chmod.

# Install
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs

# Change file permission
chmod u+x $HOME/bin/bs

Authentication

Interactively, the user can run the authentication command and then go the URL provided to sign in.

bs auth
#> Please go to this URL to authenticate:  https://basespace.illumina.com/oauth/device?code=6Cesj

The resulting config file will be stored in $HOME/.basespace.

However, there are a couple of additional factors to consider. We need to think about the best way to automate this process. A couple of notes to consider:

  1. We are able to specify the API server. It may be useful to let the user customize this depending on where they are located (e.g., US vs UK).
  2. The user can store config info in a file and then load it: bs load config. This may make it easier to inject credentials.

Downloading data

Downloading data is simple, but there are many options. What is the best strategy to implement for our purposes?

# Single run
bs download run -i <RunID> -o <output>

# Multiple runs in a project
bs download project -i <ProjectID> -o <output>

# Subset of a project
bs download project -i <ProjectID> -o <output> --extension=fastq.gz

Implementation

We can install the CLI tool into our container. We will need to modify the download app to call either a series of commands or a script rather than the python script. We can provide several options with default values:

Flag Function
-s, --api-server the API server
-i, -run-id run ID
-o, --out-dir output dir
@arisp99 arisp99 added the breaking change ☠️ API change likely to affect existing code label Oct 7, 2021
@arisp99
Copy link
Member Author

arisp99 commented Oct 7, 2021

@CliffO89: Can you provide some input into authentication? How have you been authenticating before downloading your data?

When we add this to MIPTools, it seems tedious to need to authenticate each time you'd like to download data. In that case, the user may want to store their config file somewhere and load it using bs load config.

@arisp99 arisp99 self-assigned this Nov 3, 2021
@arisp99 arisp99 added this to the 2.0.0 milestone Nov 3, 2021
@arisp99
Copy link
Member Author

arisp99 commented Nov 4, 2021

This issue and accompanying PR will be incorporated into future versions of MIPTools.

@JeffAndBailey
Copy link
Member

JeffAndBailey commented Feb 25, 2022

bs is clean CLI for interacting with Illumina cloud

  1. we use it for downloading MIP sequencing run
  2. it is not free -- so a user must add it not us to repository
  3. configuration and authentication is required
  4. users may want to access addtional commands
    Therefore, to add nor not to add -- whether it is better to standalone or be integrated into MIPtools I would suggest if it is a simple singular command then we have users run outside of MIPtools. If internally we need things then integrate it.
    The reason it was previously in the pipe was there wasn't a clean interface in large part.

@aydemiro
Copy link
Contributor

What are the specific issues with the current Illumine downloader script? Why would it cause breakage down the line? Depending on what the problems are, we may try to fix that script instead.

@arisp99
Copy link
Member Author

arisp99 commented Feb 25, 2022

  1. it is not free -- so a user must add it not us to the repository

In order to install the CLI, I did not have to pay or log in to any account. The CLI itself, I believe, is free. It can be downloaded simply by using wget, curl, or even brew.

  1. configuration and authentication is required

You do need an Illumina account in order to download data, but this is no different from the current download app.

  1. users may want to access additional commands

If it is installed within the container, users should still be able to access additional commands by using the singularity exec command.

In my view, the benefits of the proposed app compared to the current app are as follows:

  • Use an official tool instead of a web scrapping python script
  • The python script uses deprecated tools to download data and, therefore, is likely to break1
  • The official CLI is significantly faster in terms of download speeds.

Footnotes

  1. This could be fixed, but I believe it is better to use an official, faster tool.

@aydemiro
Copy link
Contributor

If it is faster we should use CLI. This was the opposite when I was testing some years ago; bs was much slower. The current script is also Illumina's software, btw. It is probably not supported anymore, so it would be up to us to maintain it if needed.

@arisp99
Copy link
Member Author

arisp99 commented Feb 25, 2022

If it is faster we should use CLI. This was the opposite when I was testing some years ago; bs was much slower.

With the example dataset I downloaded, the CLI was noticeably faster. I could certainly run some more tests to compare download speed as well...

The current script is also Illumina's software, btw. It is probably not supported anymore, so it would be up to us to maintain it if needed.

Ahh, I did not know this.

@arisp99 arisp99 removed the breaking change ☠️ API change likely to affect existing code label Feb 25, 2022
@aydemiro
Copy link
Contributor

Some speed testing may be good but I don't know if we need extensive testing. As long as it is not noticeably slower, we should be fine. I wish they included bcl2fastq capability in the client as well.

@arisp99
Copy link
Member Author

arisp99 commented Feb 26, 2022

I wish they included bcl2fastq capability in the client as well.

Yeah that would have been nice... I haven't seen any changes to this since 2017, which I imagine is what are using now...

@arisp99
Copy link
Member Author

arisp99 commented Feb 26, 2022

A quick benchmark test comparing the two methods:

Python script run through singularity: 63.47s user 18.60s system 7% cpu 17:18.41 total
bs CLI run via command line1: 29.45s user 13.27s system 79% cpu 53.829 total

Footnotes

  1. Note that it was not run through singularity. I will test this when I have rebuilt the container with the proposed download app.

@JeffAndBailey
Copy link
Member

OK. People can download and place in container if they build. If they don't build, is it worth the effort to have them put it in the container somehow versus running standalone? do we use it for anything else? Why put it in the container for others if we just use it for a single command?

@arisp99
Copy link
Member Author

arisp99 commented Feb 26, 2022

So in this case, users do not need to place anything in the container. The container packages together a set of software and tools for others to use in one environment. The CLI will be shipped with the container. This is essentially how all the other software in the container is used. For example, MIPWrangler and McCOILR do not need to be downloaded and placed in the container by a user; the tools are already installed in the container so that people can easily use the programs (for reference, the %post section of the definition file defines all the software installed in the container). This is exactly how the CLI will be installed.

To summarize, there is no extra work needed by users regardless of whether they build the container or not.

Why put it in the container for others if we just use it for a single command?

In my view, the reason for putting a tool like this in the container is to simplify the pipeline for analyzing data. By placing the download app in the container, we can leverage some control over the processes used by the user (i.e. where the data is downloaded). All the downstream tools expect data in specific locations (ex: opt/data/ and opt/analysis). If users were to manually install the CLI, download their data, and then use a container (a container without the download app), I think that there would be a higher amount of variability in the inputs fed into the rest of the apps/methods, which could possibly mess up downstream methods. On top of that, if we already ship the tool with the container, then that eliminates an extra step for the user as the users no longer need to worry about installing tools.

@JeffAndBailey
Copy link
Member

I am not sure their license allows for it. I see you can add it to the definition and someone can build their own and it will pop right in. But can we distribute it in our prebuilt? If not, then where does a user drop it in? Or do they just use it externally if they download the prebuilt container?

@arisp99
Copy link
Member Author

arisp99 commented Feb 26, 2022

I have not been able to find anything suggesting that we are unable to distribute the CLI in our prebuilt container. I can certainly continue looking, but I do not think there is anything preventing us from doing this.

@aydemiro
Copy link
Contributor

A quick benchmark test comparing the two methods:

Python script run through singularity: 63.47s user 18.60s system 7% cpu 17:18.41 total bs CLI run via command line1: 29.45s user 13.27s system 79% cpu 53.829 total

Footnotes

  1. Note that it was not run through singularity. I will test this when I have rebuilt the container with the proposed download app.

Am I reading this right? 53 sec vs 17 min? What is CLI downloading in 53 sec, an entire run?

@aydemiro
Copy link
Contributor

I don't see any restrictions in terms of distributing the software. There is no license to be found and you don't have to agree to terms at any step. So I think it is safe to assume we can include it in the prebuilt.

@arisp99
Copy link
Member Author

arisp99 commented Feb 28, 2022

A quick benchmark test comparing the two methods:
Python script run through singularity: 63.47s user 18.60s system 7% cpu 17:18.41 total bs CLI run via command line1: 29.45s user 13.27s system 79% cpu 53.829 total

Footnotes

  1. Note that it was not run through singularity. I will test this when I have rebuilt the container with the proposed download app.

Am I reading this right? 53 sec vs 17 min? What is CLI downloading in 53 sec, an entire run?

So I rebuilt the container with the new download app and here are the results of the benchmarking. Benchmarks were run using hyperfine.

New Download App: singularity run \
    -B base_resources:/opt/resources -B download-test:/opt/analysis \
    --app download /work/apascha1/deploy-miptools/MIPTools/download.sif \
    -i 214264108
  Time (mean ± σ):     64.995 s ± 10.663 s    [User: 25.173 s, System: 12.232 s]
  Range (min … max):   44.411 s … 77.282 s    10 runs

Superseded Download App: singularity run \
    -B base_resources:/opt/resources -B download-superseded-test:/opt/analysis \
    --app download_superseded \
    /work/apascha1/deploy-miptools/MIPTools/download.sif \
    -r 214264108
  Time (mean ± σ):     833.713 s ± 22.844 s    [User: 41.189 s, System: 12.538 s]
  Range (min … max):   814.501 s … 872.990 s    5 runs

Both output directories have the same number of files as well... so the CLI is downloading an entire run about 13 times faster than the script.

@aydemiro
Copy link
Contributor

Look great. Too great :) Do you know the size of this run? Typically a run will have some tens of GB of data to download. 1 min seems too short. Unless of course this is a small test run with little data? My worry is that CLI may be downloading symlinks. If none of these concerns are valid, I am sold.

@arisp99
Copy link
Member Author

arisp99 commented Feb 28, 2022

Do you know the size of this run?

Good question. The run size is about 2.66 GB (checked via bs get run and the basespace website).

My worry is that CLI may be downloading symlinks.

Comparing the directory sizes of the two folders where I downloaded data shows the exact same size for each folder. Given this, I do not believe we have any symlinking going on.

> du -sh download-test
3.3G	download-test

> du -s download-superseded-test
3.3G	download-superseded-test

@aydemiro
Copy link
Contributor

Looks good. I'll merge your PR and close the issue if you have no objections.

@arisp99
Copy link
Member Author

arisp99 commented Feb 28, 2022

No objections! Thanks for all the comments!

@aydemiro
Copy link
Contributor

Sure! Thanks for improving MIPTools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants