-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DSERV-570 Clickhouse Data Ingestion #329
base: DSERV-467-json-based-loading
Are you sure you want to change the base?
Conversation
c522c9e
to
209d8ba
Compare
# every collection has particular edge cases | ||
# this is needed until we have all collections loaded | ||
import pdb | ||
pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, I thought about leaving this until we ingest all collections into Clickhouse. Every dataset has its own small edge cases and we will have to adapt to each one of them.
I can also remove this for the PR, please let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
inverse_name: str | ||
biological_process: str | ||
accessible_via: | ||
name: e-qtls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated question, what is name in accessible_via for?
data/db/clickhouse.py
Outdated
'to': relationship['to'] | ||
} | ||
|
||
for s in clickhouse_schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what does the code to here for relationship?
|
||
temp_file_path = temp_file.name | ||
onto = get_ontology(temp_file_path).load() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are no longer going to download each ontology file on the fly. Each owl file will have to be passed individually
data/schema-config.yaml
Outdated
from: sequence variant | ||
to: protein | ||
from: variants | ||
to: proteins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to can also be complex
data/schema-config.yaml
Outdated
@@ -281,8 +281,8 @@ transcribed to: | |||
label_as_edge: TRANSCRIBED_TO | |||
db_collection_name: genes_transcripts | |||
relationship: | |||
from: gene | |||
to: transcript | |||
from: genes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also from mm_genes to mm_transcripts
if data.get('_from') is not None and data.get('_to') is not None: | ||
# removing collection prefix from _from and _to. E.g. genes/ENSG00000148584 => ENSG00000148584 | ||
processed_data.append(data['_from'].split('/')[-1]) | ||
processed_data.append(data['_to'].split('/')[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you figure out which collection to go if there are more than one collection in from or to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will look into this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments. Thank you!
Some thoughts:
|
17e6d6e
to
3d92bf6
Compare
DSERV-541 Adding CA Ids to Variants
No description provided.