Write script to download and parse datasets #5

tmmurali · 2020-04-02T00:52:35Z

Instead of a manual, piecemeal approach to downloading datasets, implement a script that will download every dataset, and parse it into a format that the rest of the code can use. Some of the files we download may not change over the course of a few months (e.g., tissue-specific networks) while others may change rapidly (e.g., COVID-19-specific gene sets).

Initially, we have to regularly update the script every time we see a new source of interesting and useful information. In the longer run, the sources will stabilise and the script will mature.

A config (YAML) file can record and document the different sources, URL, data types, and directories where we will store the files.

jlaw9 · 2020-04-02T18:13:48Z

I implemented a master script and accompanying config file to download and parse network files (See commit 70284f4). Currently only tested using the TissueNet v2 collections. Planning to add gene sets soon.

tmmurali · 2020-04-02T19:07:27Z

Excellent work! Noticed that the config file has your directory name. We need to make it generic somehow.

jlaw9 · 2020-04-02T19:10:29Z

Oh my bad, that was just for testing. It will be the generic datasets folder. I'll fix that.

tmmurali · 2020-04-06T03:19:19Z

Should we add fields in the config-file that name the key columns that we should parse, e.g., gene id, drug id, etc?
Virus-host networks may contain multiple species, certainly on the virus side and maybe on the host side as well. How do the config file and the parser handle such cases?

jlaw9 · 2020-04-08T06:39:03Z

In response to your questions:

Yes, but those fields will depend on the file type. For gene sets, I added the file_type field which currently only supports gmt, but I can imagine a table format where we would need to specify the columns.
Currently the config file has a species field, but it is unused by the parser. When we have data for multiple species, I plan to have the parser use the mapping file for the specified species.

jlaw9 · 2020-04-08T06:39:52Z

Added the ability to download and parse gene sets in this commit: dc0b518

Currently they are treated as gene sets. Adds to issue #5

tmmurali · 2020-04-26T04:19:22Z

Can we close this issue now? For every new dataset we want to add, we can open another issue.

jlaw9 · 2020-04-27T18:34:28Z

Sounds good.

Currently they are treated as gene sets. Adds to issue #5

tmmurali assigned jlaw9 Apr 2, 2020

tmmurali mentioned this issue Apr 2, 2020

Compute enrichment of gene sets in our predictions #6

Closed

jlaw9 added a commit that referenced this issue Apr 9, 2020

Update to download and parse drug target datasets

8f24833

Currently they are treated as gene sets. Adds to issue #5

jlaw9 closed this as completed Apr 27, 2020

n-tasnina pushed a commit that referenced this issue May 16, 2020

Update to download and parse drug target datasets

3e33620

Currently they are treated as gene sets. Adds to issue #5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write script to download and parse datasets #5

Write script to download and parse datasets #5

tmmurali commented Apr 2, 2020

jlaw9 commented Apr 2, 2020 •

edited

Loading

tmmurali commented Apr 2, 2020

jlaw9 commented Apr 2, 2020

tmmurali commented Apr 6, 2020

jlaw9 commented Apr 8, 2020

jlaw9 commented Apr 8, 2020

tmmurali commented Apr 26, 2020

jlaw9 commented Apr 27, 2020

Write script to download and parse datasets #5

Write script to download and parse datasets #5

Comments

tmmurali commented Apr 2, 2020

jlaw9 commented Apr 2, 2020 • edited Loading

tmmurali commented Apr 2, 2020

jlaw9 commented Apr 2, 2020

tmmurali commented Apr 6, 2020

jlaw9 commented Apr 8, 2020

jlaw9 commented Apr 8, 2020

tmmurali commented Apr 26, 2020

jlaw9 commented Apr 27, 2020

jlaw9 commented Apr 2, 2020 •

edited

Loading