DSERV-570 Clickhouse Data Ingestion #329

pedrohr · 2024-09-27T21:03:59Z

No description provided.

pedrohr · 2024-10-09T05:26:34Z

data/db/clickhouse.py

+                    # every collection has particular edge cases
+                    # this is needed until we have all collections loaded
+                    import pdb
+                    pdb.set_trace()


for now, I thought about leaving this until we ingest all collections into Clickhouse. Every dataset has its own small edge cases and we will have to adapt to each one of them.

I can also remove this for the PR, please let me know your thoughts.

sounds good

mingjiecn · 2024-10-09T20:04:27Z

data/schema-config.yaml

-      inverse_name: str
-      biological_process: str
+  accessible_via:
+    name: e-qtls


unrelated question, what is name in accessible_via for?

mingjiecn · 2024-10-09T20:20:01Z

data/db/clickhouse.py

+                    'to': relationship['to']
+                }
+
+        for s in clickhouse_schema:


Can you explain what does the code to here for relationship?

pedrohr · 2024-10-17T03:09:26Z

data/adapters/ontologies_adapter.py

-
-            temp_file_path = temp_file.name
-            onto = get_ontology(temp_file_path).load()
-


we are no longer going to download each ontology file on the fly. Each owl file will have to be passed individually

mingjiecn · 2024-10-17T14:38:29Z

data/schema-config.yaml

-    from: sequence variant
-    to: protein
+    from: variants
+    to: proteins


to can also be complex

mingjiecn · 2024-10-17T14:40:41Z

data/schema-config.yaml

@@ -281,8 +281,8 @@ transcribed to:
  label_as_edge: TRANSCRIBED_TO
  db_collection_name: genes_transcripts
  relationship:
-    from: gene
-    to: transcript
+    from: genes


can also from mm_genes to mm_transcripts

mingjiecn · 2024-10-17T14:58:45Z

data/db/clickhouse.py

+        if data.get('_from') is not None and data.get('_to') is not None:
+            # removing collection prefix from _from and _to. E.g. genes/ENSG00000148584 => ENSG00000148584
+            processed_data.append(data['_from'].split('/')[-1])
+            processed_data.append(data['_to'].split('/')[-1])


how do you figure out which collection to go if there are more than one collection in from or to?

good catch, will look into this!

mingjiecn

Please see my comments. Thank you!

ottojolanki · 2024-10-25T20:14:17Z

Some thoughts:

It would probably make sense to put the most specific types (clickhouse) into the schema abstraction config yaml and make the transformation to less specific types when generating schema for ArangoDB.
How difficult would it be to make the schema treatment similar between arango and clickhouse. What I mean by this is that both of those would be generated from the schema yaml using scripts and then stored in the repo as json/sql.
Schema parsing/generation should be separated from data parsing/loading/ingestion for both databases.
We should strive to use the bulk s3 ingestion for clickhouse. (see this blog series https://clickhouse.com/blog/getting-data-into-clickhouse-part-3-s3 for details about what I'm talking about)
We need to add a way to define primary key columns.
Unfortunately in some cases we may need to "manually" adjust clickhouse schemas (for example if we need to make some transformations like removing a collection_name/ prefix from a value, get a nested value such as a frequency from the annotations for variants OR we need to adjust the parsing to make source JSONL contain these clickhouse specific values.

DSERV-541 Adding CA Ids to Variants

pedrohr added 5 commits September 27, 2024 11:02

adding clickhouse config

80b9ef8

adding clickhouse support

5b2e3cb

schema fixes

d9876e3

reusing downloaded file

66189c2

adding clickhouse driver

209d8ba

pedrohr force-pushed the clickhouse_ingestion branch from c522c9e to 209d8ba Compare October 1, 2024 01:38

pedrohr added 3 commits September 30, 2024 15:39

freezing version

352659d

topld loading fixes

53e5bff

adding clickhouse specs

c3e3825

pedrohr changed the base branch from dev to DSERV-467-json-based-loading October 9, 2024 05:24

pedrohr commented Oct 9, 2024

View reviewed changes

mingjiecn reviewed Oct 9, 2024

View reviewed changes

pedrohr added 5 commits October 9, 2024 15:21

restoring threshold

f5ad559

converting relationship config blocks to collection names

031d923

supporting multiple tables in the relationship config block

8833a4a

supporting ontology loading

282c29e

pep8

34911b9

pedrohr commented Oct 17, 2024

View reviewed changes

mingjiecn reviewed Oct 17, 2024

View reviewed changes

data/schema-config.yaml Outdated

from: sequence variant

to: protein

from: variants

to: proteins

Copy link

Collaborator

mingjiecn Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to can also be complex

mingjiecn reviewed Oct 17, 2024

View reviewed changes

mingjiecn requested changes Oct 17, 2024

View reviewed changes

Updating missing relationships in schema-config.yaml

6e6d121

ottojolanki force-pushed the DSERV-467-json-based-loading branch from 17e6d6e to 3d92bf6 Compare October 28, 2024 16:31

pedrohr added 4 commits October 28, 2024 15:01

ingesting CA ID mappings from HGVS

bac95de

Merge pull request #344 from IGVF-DACC/favor_caid

b581574

DSERV-541 Adding CA Ids to Variants

adding multiple collections for edges

38d0339

normalizing chr value

a170ca7

ignoring variants that fail to generate spdi

2d0255f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSERV-570 Clickhouse Data Ingestion #329

DSERV-570 Clickhouse Data Ingestion #329

pedrohr commented Sep 27, 2024

pedrohr Oct 9, 2024

mingjiecn Oct 17, 2024

mingjiecn Oct 9, 2024

mingjiecn Oct 9, 2024

pedrohr Oct 17, 2024

mingjiecn Oct 17, 2024

mingjiecn Oct 17, 2024

mingjiecn Oct 17, 2024

pedrohr Oct 17, 2024

mingjiecn left a comment

ottojolanki commented Oct 25, 2024 •

edited

Loading


		temp_file_path = temp_file.name
		onto = get_ontology(temp_file_path).load()

DSERV-570 Clickhouse Data Ingestion #329

Are you sure you want to change the base?

DSERV-570 Clickhouse Data Ingestion #329

Conversation

pedrohr commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingjiecn left a comment

Choose a reason for hiding this comment

ottojolanki commented Oct 25, 2024 • edited Loading

ottojolanki commented Oct 25, 2024 •

edited

Loading