Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes in all EBI PRIDE urls #647

Merged
merged 6 commits into from
May 2, 2022
Merged

changes in all EBI PRIDE urls #647

merged 6 commits into from
May 2, 2022

Conversation

ypriverol
Copy link
Member

Huge PR, change of all public URLs to the new PRIDE Archive FTP url. Also move from ftp to http because most of the users wants to see the folder of the dataset.

@ypriverol ypriverol requested a review from fabianegli May 1, 2022 07:59
@ypriverol ypriverol linked an issue May 1, 2022 that may be closed by this pull request
@ypriverol
Copy link
Member Author

ypriverol commented May 1, 2022

This PR superseded PR #629

@ypriverol ypriverol requested a review from daichengxin May 1, 2022 17:47
@fabianegli
Copy link
Contributor

fabianegli commented May 2, 2022

I wonder if using https instead of http in the links would have any drawbacks at all? It would definitely be safer to do so and put the links into the "current best practice for web traffic" category. If there is no really good reason not to adopt the secure http protocol we should adopt it.

@ypriverol
Copy link
Member Author

I will double check if https in the EBI FTP is working. Give me one second and I will update the PR. @fabianegli

@fabianegli
Copy link
Contributor

It should work for you, too - I checked with some urls before making the suggestion ;-)

@julianu
Copy link
Contributor

julianu commented May 2, 2022

I wonder whether moving to HTTP instead of ftp is a good idea.
It might make it more comfortable for a user to browse to the place, but the automatic handling by a tool is actually hindered by this. First, because it is changing a protocol, second, because HTTP (and HTTPS even more) interpreted FTP is much slower on most servers.

@fabianegli
Copy link
Contributor

fabianegli commented May 2, 2022

HTTPS adds Transport Layer Security which FTP does not have. The added security may cost some performance, but that is generally speaking a worthwhile tradeoff. If you want to make a "fair" comparison, compare to FTPS (FTP over TLS). But that is not even available. So no, FTP is not a good option (any more). I think that download speeds are hardly a limiting factor nowadays - at least not in the context of bioinformatics.

@fabianegli
Copy link
Contributor

fabianegli commented May 2, 2022

The protocol change should not matter for most, wget and curl handle both protocols equally with no need to change command line arguments. I doubt that someone using this new format will have any trouble because of this change, but I am aware that I don't know all applications and would welcome concrete examples of applications where this change indeed breaks things.

@fabianegli
Copy link
Contributor

It also seems that "FTP is faster than HTTPS" is a myth.

See the following Terminal log:

$ wget https://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/README.txt
--2022-05-02 13:17:42--  https://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/README.txt
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44742 (44K) [text/plain]
Saving to: ‘README.txt’

README.txt                          100%[================================================================>]  43.69K  --.-KB/s    in 0.05s   

2022-05-02 13:17:42 (812 KB/s) - ‘README.txt’ saved [44742/44742]

$ wget ftp://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/README.txt
--2022-05-02 13:18:11--  ftp://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/README.txt
           => ‘README.txt’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pride-archive/2019/09/PXD007058 ... done.
==> SIZE README.txt ... 44742
==> PASV ... done.    ==> RETR README.txt ... done.
Length: 44742 (44K) (unauthoritative)

README.txt                          100%[================================================================>]  43.69K  --.-KB/s    in 0.1s    

2022-05-02 13:18:11 (390 KB/s) - ‘README.txt’ saved [44742]

$ wget https://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw
--2022-05-02 13:22:30--  https://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202338736 (193M) [application/octet-stream]
Saving to: ‘SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw’

SF_200217_pPeptideLibrary_pool1_HCD 100%[================================================================>] 192.96M  3.48MB/s    in 1m 56s  

2022-05-02 13:24:26 (1.67 MB/s) - ‘SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw’ saved [202338736/202338736]

$ rm SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw 
$ wget ftp://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw
--2022-05-02 13:24:51--  ftp://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw
           => ‘SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pride-archive/2019/09/PXD007058 ... done.
==> SIZE SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw ... 202338736
==> PASV ... done.    ==> RETR SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw ... done.
Length: 202338736 (193M) (unauthoritative)

SF_200217_pPeptideLibrary_pool1_HCD 100%[================================================================>] 192.96M  7.19MB/s    in 2m 19s  

2022-05-02 13:27:11 (1.38 MB/s) - ‘SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw’ saved [202338736]

$ rm SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw 
$ wget ftp://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw
--2022-05-02 13:27:24--  ftp://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw
           => ‘SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pride-archive/2019/09/PXD007058 ... done.
==> SIZE SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw ... 202338736
==> PASV ... done.    ==> RETR SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw ... done.
Length: 202338736 (193M) (unauthoritative)

SF_200217_pPeptideLibrary_pool1_HCD 100%[================================================================>] 192.96M  25.1MB/s    in 9.7s    

2022-05-02 13:27:34 (20.0 MB/s) - ‘SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw’ saved [202338736]

$ rm SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw 
$ wget https://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw
--2022-05-02 13:27:51--  https://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202338736 (193M) [application/octet-stream]
Saving to: ‘SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw’

SF_200217_pPeptideLibrary_pool1_HCD 100%[================================================================>] 192.96M  15.2MB/s    in 16s     

2022-05-02 13:28:08 (11.9 MB/s) - ‘SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.raw’ saved [202338736/202338736]

$ wget ftp://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw
--2022-05-02 13:29:18--  ftp://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw
           => ‘SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pride-archive/2019/09/PXD007058 ... done.
==> SIZE SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw ... 516812072
==> PASV ... done.    ==> RETR SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw ... done.
Length: 516812072 (493M) (unauthoritative)

SF_200217_U2OS_TiO2_EThcD_IT_rep2.r 100%[================================================================>] 492.87M  26.2MB/s    in 23s     

2022-05-02 13:29:41 (21.8 MB/s) - ‘SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw’ saved [516812072]

$ wget https://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw
--2022-05-02 13:30:01--  https://ftp.ebi.ac.uk/pride-archive/2019/09/PXD007058/SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 516812072 (493M) [application/octet-stream]
Saving to: ‘SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw.1’

SF_200217_U2OS_TiO2_EThcD_IT_rep2.r 100%[================================================================>] 492.87M  27.9MB/s    in 19s     

2022-05-02 13:30:20 (26.2 MB/s) - ‘SF_200217_U2OS_TiO2_EThcD_IT_rep2.raw.1’ saved [516812072/516812072]

The point here is, I suspect, that our infrastructure (software and hardware) is being constantly optimized to work well with TLS network traffic and the second thing that can be guessed from the initial slow download speeds is that there has been some protocol independent optimization, either the archive server internally made the files of the PXD007058 folder available for fast download or there network adapted in some other way (e.g. better routing). Anyway, It seems that there is no clear speed benefit for FTP transfers from this small session.

@julianu
Copy link
Contributor

julianu commented May 2, 2022

Ok, just tested it as well (in the university network) for the following file: ftp://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw
Actually, the test was already running while you answered

with FTP: real 0m18.087s
with HTTP: real 7m31.798s

So, please don't disregard this, I experienced this on several servers (e.g. PRIDE, UniProt), and before you ask: this is no firewall issue.

time wget ftp://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw
--2022-05-02 13:29:30--  ftp://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw
           => ‘UPS1_12500amol_R1.raw.2’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pride-archive/2015/12/PXD001819 ... done.
==> SIZE UPS1_12500amol_R1.raw ... 1711906848
==> PASV ... done.    ==> RETR UPS1_12500amol_R1.raw ... done.
Length: 1711906848 (1.6G) (unauthoritative)

UPS1_12500amol_R1.raw.2       100%[=================================================>]   1.59G  79.8MB/s    in 18s

2022-05-02 13:29:48 (91.4 MB/s) - ‘UPS1_12500amol_R1.raw.2’ saved [1711906848]


real    0m18.087s
user    0m0.614s
sys     0m8.583s
time wget http://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw
--2022-05-02 13:30:14--  http://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1711906848 (1.6G) [application/octet-stream]
Saving to: ‘UPS1_12500amol_R1.raw.3’

UPS1_12500amol_R1.raw.3       100%[=================================================>]   1.59G  3.69MB/s    in 7m 32s

2022-05-02 13:37:46 (3.62 MB/s) - ‘UPS1_12500amol_R1.raw.3’ saved [1711906848/1711906848]

real    7m31.798s
user    0m2.394s
sys     0m18.875s

Regarding the protocol: obviously depends on the underlying programming language. But as SDRF is not yet too widely spread, all implementations would probably easily cope.

@julianu
Copy link
Contributor

julianu commented May 2, 2022

I also tested your small examples: they are fine and quick, yes, no problem there. I even have faster downloads than you.
Could it be they are on different servers with different implementations?

@ypriverol
Copy link
Member Author

The url must be a public url, not specification avout protocol. I prefer https because has the additional value that people can browse. Remember all these are examples, this is not a resource

@ypriverol
Copy link
Member Author

Please, remember to accepted the PR when you are confortable with it. I will merged and cpean the issues

@fabianegli
Copy link
Contributor

fabianegli commented May 2, 2022

@julianu You tested HTTP, please try it with HTTPS.

Also note that if you downloaded it before, it might be already be (partially) cached in various network nodes. Assessing network speeds is not trivial. In my example the first access to a file in the repo war rrrealy slow. the subsequent not so much and the last time I tried, with a previously not downloaded raw file, it was almost the same for FTP and HTTPS.

As I mentioned before, testing the impact of the protocol for network transfer the way we do is reading tea leaves at best without in depth knowledge or access to server logs of the ebi archive system. The point I am trying to make here is that the https protocol is not slow. Because it probably has better protocols than http.

Copy link
Contributor

@fabianegli fabianegli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective the observable benefits gained by changing from the FTP to the HTTPS protocol outweigh any potential/hypothetical downsides.

I approve this PR with the caveat that I only sampled and few files for manual review. The fact the sdrf-pipelines check passes eases my mind about also accepting the rest.

@ypriverol If you could document in this PR how you made the changes, that might be a useful resource to troubleshoot future issues, if any ever surface. I suspect a find/replace of some sort?

@ypriverol ypriverol merged commit 6f31044 into bigbio:master May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment