Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending ncbi_byid #101

Closed
boopsboops opened this issue Oct 1, 2017 · 5 comments
Closed

Extending ncbi_byid #101

boopsboops opened this issue Oct 1, 2017 · 5 comments
Milestone

Comments

@boopsboops
Copy link
Contributor

boopsboops commented Oct 1, 2017

Hi @sckott et al (and @dwinter too),

I just stumbled on the function ncbi_byid after it cropped in the r-sig-phylo list.

I'm working on assembling reference libraries for eDNA metabarcoding, and being able to curate, update and review the quality of your reference libraries is really important. Therefore having this data in a table, rather than fasta format is essential.

Your function seems super quick, and gives back a lovely table. However, it's lacking a lot of the fields I would require for better filtering and quality control of reference libraries, such as lat_lon, country, specimen_voucher, publication status etc etc etc.

All this kind of info is usually in the GenBank metadata, and I've already implemented such a function to "tabulize" it (gb2df), but I think you will agree, it's an absolute abomination. I used EBI rather than NCBI as it's much faster to download large numbers of sequences to a local tempfile, and then did a lot of multithreaded XML scraping (which is very inefficient).
https://github.com/boopsboops/SeaDNA/blob/master/scripts/gb2df_example.R https://github.com/boopsboops/SeaDNA/blob/master/scripts/gb2df.R

So, if you guys think that these would be important or appropriate additions to ncbi_byid, I'll be more than happy to help (although I confess to not really getting my head around how the function works yet).

Cheers,

Rupert

@sckott
Copy link
Contributor

sckott commented Oct 2, 2017

thanks for the issue @boopsboops

There's a few things going on here, so separating them out:

  • Add more fields to the output of ncbi_byid
  • EBI vs. NCBI: do you get the same data? or are they two complementary data sources (my ignorance of this space is coming through)
  • parallel XML processing: how much of a diff. does this make? curious cause it adds additional complexity so would only use if was really necessary

also:

@dwinter is there anything else to mention here?

@boopsboops
Copy link
Contributor Author

Hi @sckott

  • EBI vs. NCBI: Yes, in theory the European Nucleotide Archive (ENA) at the EBI is a mirror of NCBI GenBank. Although sometimes I find data on NCBI that hasn't found its way onto ENA yet. I used it because it serves up data quicker than NCBI, possibly because the servers are closer to me and/or that it gets a lot less traffic.
  • Parallel XML processing: it made a pretty big difference to my inefficient code, but the difference to yours is minimal. Took about 40 sec for 500 sequences with ncbi_byid, and about 30 sec with gb2df. Clearly that extra bloat isn't helping much. In fact, by multithreading ncbi_byid I managed to download 6,500 sequences in about three minutes, which is more than acceptable.

So, I had a much better look at your code, and it seems fairly trivial to insert the things I need, and have successfully done so. What threw me was that you were using "GBSeq" XML style, which I have never seen before, and did not know was available.

Anyway, if you think it will be valuable to offer these extra fields in your function, I can post the code here or do a pull request. If not, I'm happy to just fork and use it locally for my own work.

Cheers!

@sckott
Copy link
Contributor

sckott commented Oct 4, 2017

Thanks for the follow up. A PR with your changes sounds good

@sckott
Copy link
Contributor

sckott commented Oct 6, 2017

@boopsboops does your merged PR solve all issues you had?

@boopsboops
Copy link
Contributor Author

Sorry for the delay. Yes, it did. Thanks for the help :)

@sckott sckott added this to the v0.4 milestone Oct 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants