Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSERV-570 Clickhouse Data Ingestion #329

Open
wants to merge 19 commits into
base: DSERV-467-json-based-loading
Choose a base branch
from

Conversation

pedrohr
Copy link
Collaborator

@pedrohr pedrohr commented Sep 27, 2024

No description provided.

@pedrohr pedrohr changed the base branch from dev to DSERV-467-json-based-loading October 9, 2024 05:24
# every collection has particular edge cases
# this is needed until we have all collections loaded
import pdb
pdb.set_trace()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, I thought about leaving this until we ingest all collections into Clickhouse. Every dataset has its own small edge cases and we will have to adapt to each one of them.

I can also remove this for the PR, please let me know your thoughts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

inverse_name: str
biological_process: str
accessible_via:
name: e-qtls
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated question, what is name in accessible_via for?

'to': relationship['to']
}

for s in clickhouse_schema:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what does the code to here for relationship?


temp_file_path = temp_file.name
onto = get_ontology(temp_file_path).load()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are no longer going to download each ontology file on the fly. Each owl file will have to be passed individually

from: sequence variant
to: protein
from: variants
to: proteins
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to can also be complex

@@ -281,8 +281,8 @@ transcribed to:
label_as_edge: TRANSCRIBED_TO
db_collection_name: genes_transcripts
relationship:
from: gene
to: transcript
from: genes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also from mm_genes to mm_transcripts

if data.get('_from') is not None and data.get('_to') is not None:
# removing collection prefix from _from and _to. E.g. genes/ENSG00000148584 => ENSG00000148584
processed_data.append(data['_from'].split('/')[-1])
processed_data.append(data['_to'].split('/')[-1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you figure out which collection to go if there are more than one collection in from or to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will look into this!

Copy link
Collaborator

@mingjiecn mingjiecn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments. Thank you!

@ottojolanki
Copy link
Contributor

ottojolanki commented Oct 25, 2024

Some thoughts:

  • It would probably make sense to put the most specific types (clickhouse) into the schema abstraction config yaml and make the transformation to less specific types when generating schema for ArangoDB.
  • How difficult would it be to make the schema treatment similar between arango and clickhouse. What I mean by this is that both of those would be generated from the schema yaml using scripts and then stored in the repo as json/sql.
  • Schema parsing/generation should be separated from data parsing/loading/ingestion for both databases.
  • We should strive to use the bulk s3 ingestion for clickhouse. (see this blog series https://clickhouse.com/blog/getting-data-into-clickhouse-part-3-s3 for details about what I'm talking about)
  • We need to add a way to define primary key columns.
  • Unfortunately in some cases we may need to "manually" adjust clickhouse schemas (for example if we need to make some transformations like removing a collection_name/ prefix from a value, get a nested value such as a frequency from the annotations for variants OR we need to adjust the parsing to make source JSONL contain these clickhouse specific values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants