Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write script to download and parse datasets #5

Closed
tmmurali opened this issue Apr 2, 2020 · 8 comments
Closed

Write script to download and parse datasets #5

tmmurali opened this issue Apr 2, 2020 · 8 comments
Assignees

Comments

@tmmurali
Copy link
Member

tmmurali commented Apr 2, 2020

Instead of a manual, piecemeal approach to downloading datasets, implement a script that will download every dataset, and parse it into a format that the rest of the code can use. Some of the files we download may not change over the course of a few months (e.g., tissue-specific networks) while others may change rapidly (e.g., COVID-19-specific gene sets).

Initially, we have to regularly update the script every time we see a new source of interesting and useful information. In the longer run, the sources will stabilise and the script will mature.

A config (YAML) file can record and document the different sources, URL, data types, and directories where we will store the files.

@jlaw9
Copy link
Contributor

jlaw9 commented Apr 2, 2020

I implemented a master script and accompanying config file to download and parse network files (See commit 70284f4). Currently only tested using the TissueNet v2 collections. Planning to add gene sets soon.

@tmmurali
Copy link
Member Author

tmmurali commented Apr 2, 2020

Excellent work! Noticed that the config file has your directory name. We need to make it generic somehow.

@jlaw9
Copy link
Contributor

jlaw9 commented Apr 2, 2020

Oh my bad, that was just for testing. It will be the generic datasets folder. I'll fix that.

@tmmurali
Copy link
Member Author

tmmurali commented Apr 6, 2020

  1. Should we add fields in the config-file that name the key columns that we should parse, e.g., gene id, drug id, etc?
  2. Virus-host networks may contain multiple species, certainly on the virus side and maybe on the host side as well. How do the config file and the parser handle such cases?

@jlaw9
Copy link
Contributor

jlaw9 commented Apr 8, 2020

In response to your questions:

  1. Yes, but those fields will depend on the file type. For gene sets, I added the file_type field which currently only supports gmt, but I can imagine a table format where we would need to specify the columns.
  2. Currently the config file has a species field, but it is unused by the parser. When we have data for multiple species, I plan to have the parser use the mapping file for the specified species.

@jlaw9
Copy link
Contributor

jlaw9 commented Apr 8, 2020

Added the ability to download and parse gene sets in this commit: dc0b518

jlaw9 added a commit that referenced this issue Apr 9, 2020
Currently they are treated as gene sets.
Adds to issue #5
@tmmurali
Copy link
Member Author

Can we close this issue now? For every new dataset we want to add, we can open another issue.

@jlaw9
Copy link
Contributor

jlaw9 commented Apr 27, 2020

Sounds good.

@jlaw9 jlaw9 closed this as completed Apr 27, 2020
n-tasnina pushed a commit that referenced this issue May 16, 2020
Currently they are treated as gene sets.
Adds to issue #5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants