-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(openchallenges): add EDAM Extract and Transform Processes #2564
Conversation
@tschaffter Quality Gate doesn't seem to be performing checks for this PR. |
@tschaffter This is ready for review. Version is now an environment variable and the Description includes a preview. |
@mdsage1 thanks for working on this!! You didn't ask me to, but I added some comments to the PR. Feel free to use them or ignore 😄 |
Quality Gate passed for 'openchallenges-edam-etl'Issues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdsage1 Why do you show the std, mean and other metrics for the id column?
Update: here are the information that the script should print
- Version of EDAM processed
- Number of concepts that will be added to the table
- Total number of concepts
- Number of concepts for the following category
- Data concepts
- Operation concepts
- Format concepts
- Operation concepts
- Other concepts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config parameters should be validated before using them, otherwise the following behavior will occur:
$ nx serve-detach openchallenges-edam-etl
> nx run openchallenges-edam-etl:serve-detach
Container openchallenges-mariadb Recreate
Container openchallenges-mariadb Recreated
Container openchallenges-edam-etl Recreate
Container openchallenges-edam-etl Recreated
Container openchallenges-mariadb Starting
Container openchallenges-mariadb Started
Container openchallenges-mariadb Waiting
Container openchallenges-mariadb Healthy
Container openchallenges-edam-etl Starting
Container openchallenges-edam-etl Started
————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
> NX Successfully ran target serve-detach for project openchallenges-edam-etl (33s)
View logs and investigate cache misses at https://cloud.nx.app/runs/Ivfcx2Zd35
vscode@dee30b82cf44:/workspaces/sage-monorepo$ docker logs openchallenges-edam-etl
EDAM Version: None
OC DB URL: jdbc:mysql://openchallenges-mariadb:3306/challenge_service
Downloading the EDAM concepts from GitHub (CSV file)...
Error downloading EDAM concepts: 404 Client Error: Not Found for url: https://github.com/edamontology/edamontology/raw/main/releases/EDAM_None.csv
Processing the EDAM concepts...
File EDAM_None.csv not found.
No data available.
closed in error |
See the suggestion I made above. |
@mdsage1 alternatively, you can also do a EDIT: Since you're interested in the number of concepts per category, you can actually use pandas' >>> df["class_id"].str.contains("data")
0 True
1 True
2 True
3 True
4 True
...
3468 False
3469 False
3470 False
3471 False
3472 False |
Prefer exact match to using |
Good point. Just shooting my shot here, but this can be overcome by using |
@tschaffter I've updated the concept counts to use the class_id column and regex. The case has been ignored to avoid any future issues. I didn't use contains but used search() function from the regex module. I have prevented future issues with data, and any other concept name, listing as a match when there is an additional word following the word of interest by adding the underscore to the regex as @vpchung suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one final suggestion, but otherwise, the script looks good on my end!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested the link and updated it
return None | ||
|
||
|
||
def count_occurrences(identifier_pattern: str, df) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def count_occurrences(identifier_pattern: str, df) -> int: | |
def count_occurrences(identifier_pattern: str, df: pd.DataFrame) -> np.int64: |
^^ IIRC. May need to double-check whether it is a numpy type being returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy int is being returned and made changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool cool. May need to add import numpy as np
in this case!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I need to submit and patch for this project in #2594 and will fix this issue at the same time.
Description
EDAM ETL processes need to be developed to incorporate ETAM ontology in the Maria DB linking the ontology to existing data. This PR will address the extract and transform portion.
Related Issue
Contribute to #2524
Contribute to #2548
Fixes #2547
Fixes #2563
Changelog
Preview