-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changes in all EBI PRIDE urls #647
Conversation
This PR superseded PR #629 |
I wonder if using https instead of http in the links would have any drawbacks at all? It would definitely be safer to do so and put the links into the "current best practice for web traffic" category. If there is no really good reason not to adopt the secure http protocol we should adopt it. |
I will double check if https in the EBI FTP is working. Give me one second and I will update the PR. @fabianegli |
It should work for you, too - I checked with some urls before making the suggestion ;-) |
I wonder whether moving to HTTP instead of ftp is a good idea. |
HTTPS adds Transport Layer Security which FTP does not have. The added security may cost some performance, but that is generally speaking a worthwhile tradeoff. If you want to make a "fair" comparison, compare to FTPS (FTP over TLS). But that is not even available. So no, FTP is not a good option (any more). I think that download speeds are hardly a limiting factor nowadays - at least not in the context of bioinformatics. |
The protocol change should not matter for most, wget and curl handle both protocols equally with no need to change command line arguments. I doubt that someone using this new format will have any trouble because of this change, but I am aware that I don't know all applications and would welcome concrete examples of applications where this change indeed breaks things. |
It also seems that "FTP is faster than HTTPS" is a myth. See the following Terminal log:
The point here is, I suspect, that our infrastructure (software and hardware) is being constantly optimized to work well with TLS network traffic and the second thing that can be guessed from the initial slow download speeds is that there has been some protocol independent optimization, either the archive server internally made the files of the PXD007058 folder available for fast download or there network adapted in some other way (e.g. better routing). Anyway, It seems that there is no clear speed benefit for FTP transfers from this small session. |
Ok, just tested it as well (in the university network) for the following file: ftp://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw with FTP: real 0m18.087s So, please don't disregard this, I experienced this on several servers (e.g. PRIDE, UniProt), and before you ask: this is no firewall issue.
Regarding the protocol: obviously depends on the underlying programming language. But as SDRF is not yet too widely spread, all implementations would probably easily cope. |
I also tested your small examples: they are fine and quick, yes, no problem there. I even have faster downloads than you. |
The url must be a public url, not specification avout protocol. I prefer https because has the additional value that people can browse. Remember all these are examples, this is not a resource |
Please, remember to accepted the PR when you are confortable with it. I will merged and cpean the issues |
@julianu You tested HTTP, please try it with HTTPS. Also note that if you downloaded it before, it might be already be (partially) cached in various network nodes. Assessing network speeds is not trivial. In my example the first access to a file in the repo war rrrealy slow. the subsequent not so much and the last time I tried, with a previously not downloaded raw file, it was almost the same for FTP and HTTPS. As I mentioned before, testing the impact of the protocol for network transfer the way we do is reading tea leaves at best without in depth knowledge or access to server logs of the ebi archive system. The point I am trying to make here is that the https protocol is not slow. Because it probably has better protocols than http. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective the observable benefits gained by changing from the FTP to the HTTPS protocol outweigh any potential/hypothetical downsides.
I approve this PR with the caveat that I only sampled and few files for manual review. The fact the sdrf-pipelines check passes eases my mind about also accepting the rest.
@ypriverol If you could document in this PR how you made the changes, that might be a useful resource to troubleshoot future issues, if any ever surface. I suspect a find/replace of some sort?
Huge PR, change of all public URLs to the new PRIDE Archive FTP url. Also move from ftp to http because most of the users wants to see the folder of the dataset.