-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloading data from BaseSpace #8
Comments
@CliffO89: Can you provide some input into authentication? How have you been authenticating before downloading your data? When we add this to MIPTools, it seems tedious to need to authenticate each time you'd like to download data. In that case, the user may want to store their config file somewhere and load it using |
This issue and accompanying PR will be incorporated into future versions of MIPTools. |
bs is clean CLI for interacting with Illumina cloud
|
What are the specific issues with the current Illumine downloader script? Why would it cause breakage down the line? Depending on what the problems are, we may try to fix that script instead. |
In order to install the CLI, I did not have to pay or log in to any account. The CLI itself, I believe, is free. It can be downloaded simply by using
You do need an Illumina account in order to download data, but this is no different from the current download app.
If it is installed within the container, users should still be able to access additional commands by using the In my view, the benefits of the proposed app compared to the current app are as follows:
Footnotes
|
If it is faster we should use CLI. This was the opposite when I was testing some years ago; bs was much slower. The current script is also Illumina's software, btw. It is probably not supported anymore, so it would be up to us to maintain it if needed. |
With the example dataset I downloaded, the CLI was noticeably faster. I could certainly run some more tests to compare download speed as well...
Ahh, I did not know this. |
Some speed testing may be good but I don't know if we need extensive testing. As long as it is not noticeably slower, we should be fine. I wish they included bcl2fastq capability in the client as well. |
Yeah that would have been nice... I haven't seen any changes to this since 2017, which I imagine is what are using now... |
A quick benchmark test comparing the two methods: Python script run through singularity: Footnotes
|
OK. People can download and place in container if they build. If they don't build, is it worth the effort to have them put it in the container somehow versus running standalone? do we use it for anything else? Why put it in the container for others if we just use it for a single command? |
So in this case, users do not need to place anything in the container. The container packages together a set of software and tools for others to use in one environment. The CLI will be shipped with the container. This is essentially how all the other software in the container is used. For example, MIPWrangler and McCOILR do not need to be downloaded and placed in the container by a user; the tools are already installed in the container so that people can easily use the programs (for reference, the To summarize, there is no extra work needed by users regardless of whether they build the container or not.
In my view, the reason for putting a tool like this in the container is to simplify the pipeline for analyzing data. By placing the download app in the container, we can leverage some control over the processes used by the user (i.e. where the data is downloaded). All the downstream tools expect data in specific locations (ex: |
I am not sure their license allows for it. I see you can add it to the definition and someone can build their own and it will pop right in. But can we distribute it in our prebuilt? If not, then where does a user drop it in? Or do they just use it externally if they download the prebuilt container? |
I have not been able to find anything suggesting that we are unable to distribute the CLI in our prebuilt container. I can certainly continue looking, but I do not think there is anything preventing us from doing this. |
Am I reading this right? 53 sec vs 17 min? What is CLI downloading in 53 sec, an entire run? |
I don't see any restrictions in terms of distributing the software. There is no license to be found and you don't have to agree to terms at any step. So I think it is safe to assume we can include it in the prebuilt. |
So I rebuilt the container with the new download app and here are the results of the benchmarking. Benchmarks were run using hyperfine. New Download App: singularity run \
-B base_resources:/opt/resources -B download-test:/opt/analysis \
--app download /work/apascha1/deploy-miptools/MIPTools/download.sif \
-i 214264108
Time (mean ± σ): 64.995 s ± 10.663 s [User: 25.173 s, System: 12.232 s]
Range (min … max): 44.411 s … 77.282 s 10 runs
Superseded Download App: singularity run \
-B base_resources:/opt/resources -B download-superseded-test:/opt/analysis \
--app download_superseded \
/work/apascha1/deploy-miptools/MIPTools/download.sif \
-r 214264108
Time (mean ± σ): 833.713 s ± 22.844 s [User: 41.189 s, System: 12.538 s]
Range (min … max): 814.501 s … 872.990 s 5 runs Both output directories have the same number of files as well... so the CLI is downloading an entire run about 13 times faster than the script. |
Look great. Too great :) Do you know the size of this run? Typically a run will have some tens of GB of data to download. 1 min seems too short. Unless of course this is a small test run with little data? My worry is that CLI may be downloading symlinks. If none of these concerns are valid, I am sold. |
Good question. The run size is about 2.66 GB (checked via
Comparing the directory sizes of the two folders where I downloaded data shows the exact same size for each folder. Given this, I do not believe we have any symlinking going on. > du -sh download-test
3.3G download-test
> du -s download-superseded-test
3.3G download-superseded-test |
Looks good. I'll merge your PR and close the issue if you have no objections. |
No objections! Thanks for all the comments! |
Sure! Thanks for improving MIPTools. |
Proposal
Currently, we use an outdated python script to download data from BaseSpace: https://github.com/bailey-lab/MIPTools/blob/70c9c26cd86af33f5eb75bdd0c4c43edfedc26d4/bin/BaseSpaceRunDownloader_v2.py. BaseSpace has released tools to work with data on the CLI. To prevent code breakage down the line, we propose using BaseSpace's tools.
Working With BaseSpace CLI
Below, I discuss some of my thoughts in reading through the BaseSpace documentation.
Installation
Installation is straightforward, however, we may consider changing the installation location. We may also need to change file permissions using
chmod
.Authentication
Interactively, the user can run the authentication command and then go the URL provided to sign in.
bs auth #> Please go to this URL to authenticate: https://basespace.illumina.com/oauth/device?code=6Cesj
The resulting config file will be stored in
$HOME/.basespace
.However, there are a couple of additional factors to consider. We need to think about the best way to automate this process. A couple of notes to consider:
bs load config
. This may make it easier to inject credentials.Downloading data
Downloading data is simple, but there are many options. What is the best strategy to implement for our purposes?
Implementation
We can install the CLI tool into our container. We will need to modify the download app to call either a series of commands or a script rather than the python script. We can provide several options with default values:
The text was updated successfully, but these errors were encountered: