Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(openchallenges): add Mariadb Connection and Load the EDAM Concepts #2898

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mdsage1
Copy link
Contributor

@mdsage1 mdsage1 commented Oct 21, 2024

Description

This is the most recent branch for this issue.

EDAM ETL processes need to be developed to incorporate ETAM ontology in the Maria DB linking the ontology to existing data. This PR will address the load portion.

Related Issue

Contribute to #2524
Contribute to #2548

Replaces #2680

Changelog

  1. Create connection to MariaDB using Python
  2. Load the data in the Pandas dataframe that match the content of this this file

Preview

This is the output as the project is run using nx serve openchallenges-edam-etl:

vscode@868062b6019c:/workspaces/sage-monorepo$ nx serve openchallenges-edam-etl

nx run openchallenges-edam-etl:serve

poetry run python src/main.py

EDAM Version: 1.25
OC DB URL: None
Downloading the EDAM concepts from GitHub (CSV file)...
EDAM concepts downloaded successfully.
Processing the EDAM concepts...
EDAM concepts processed successfully.
Number of Concepts Transformed: 3473
Column names: ['id', 'class_id', 'preferred_label']
Concept Counts:
Data: 1493
Operation: 802
Format: 728
Topic: 728
Identifier: 728
Other: 2
Establishing a connection to the MariaDB Platform.
Connection has been established to MariaDB Platform!
The table edam_etl has been added to the edam database!
The table edam_etl has been populated with the EDAM concepts!

———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

NX Successfully ran target serve for project openchallenges-edam-etl

nx prepare openchallenges-edam-etl was also tested and the resulting output is:

vscode@868062b6019c:/workspaces/sage-monorepo$ nx prepare openchallenges-edam-etl

nx run openchallenges-edam-etl:prepare

./prepare-python.sh

Using virtualenv: /workspaces/sage-monorepo/apps/openchallenges/edam-etl/.venv
Installing dependencies from lock file

No dependencies to install or update

Installing the current project: openchallenges-edam-etl (0.1.0)
Hit:1 https://download.docker.com/linux/ubuntu jammy InRelease
Hit:2 https://apt.releases.hashicorp.com jammy InRelease
Hit:3 https://cli.github.com/packages stable InRelease
Hit:4 https://deb.nodesource.com/node_20.x nodistro InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 https://ngrok-agent.s3.amazonaws.com bullseye InRelease
Hit:9 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists... Done
W: https://apt.releases.hashicorp.com/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libmariadb-dev is already the newest version (1:10.6.18-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 21 not upgraded.
Requirement already satisfied: packaging in /home/vscode/.pyenv/versions/3.12.0/lib/python3.12/site-packages (24.1)

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: pip install --upgrade pip

———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

NX Successfully ran target prepare for project openchallenges-edam-etl (5s)

nx serve-detach openchallenges-edam-etl was used to run the project using Docker and the resulting output is:

vscode@868062b6019c:/workspaces/sage-monorepo$ nx serve-detach openchallenges-edam-etl

nx run openchallenges-edam-etl:serve-detach

docker/openchallenges/serve-detach.sh openchallenges-edam-etl

[+] Running 2/2
✔ Container openchallenges-mariadb Healthy 0.5s
✔ Container openchallenges-edam-etl Started 1.1s

———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

NX Successfully ran target serve-detach for project openchallenges-edam-etl (1s)

@mdsage1 mdsage1 changed the title feat(openchallenges): add Mariadb Connection and Load the EDAM Concepts (#2897) feat(openchallenges): add Mariadb Connection and Load the EDAM Concepts Oct 21, 2024
@mdsage1 mdsage1 self-assigned this Oct 21, 2024
@mdsage1 mdsage1 marked this pull request as ready for review October 21, 2024 21:39
@mdsage1 mdsage1 requested a review from rrchai as a code owner October 21, 2024 21:39
@tschaffter
Copy link
Member

From the first comment:

Fixes #2548

However this PR only aims to connect to the DB, not to load the data.

Copy link
Member

@tschaffter tschaffter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are Python dependencies missing from the project (see below). Try to remove the folder .venv in the project folder and run nx prepare openchallenges-edam-etl to use a fresh virtual environment.

image

Serve

$ nx serve openchallenges-edam-etl

> nx run openchallenges-edam-etl:serve

> poetry run python src/main.py

Traceback (most recent call last):
  File "/workspaces/sage-monorepo/apps/openchallenges/edam-etl/src/main.py", line 6, in <module>
    import mariadb
ModuleNotFoundError: No module named 'mariadb'
Warning: command "poetry run python src/main.py" exited with non-zero status code

Serve-detach

$ nx run-many --target=build-image --projects=openchallenges-mariadb,openchallenges-edam-etl
$ nx serve-detach openchallenges-edam-etl
$ docker logs openchallenges-edam-etl
Traceback (most recent call last):
  File "/opt/app/src/main.py", line 6, in <module>
    import mariadb
ModuleNotFoundError: No module named 'mariadb'

Copy link

sonarcloud bot commented Oct 22, 2024

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@mdsage1
Copy link
Contributor Author

mdsage1 commented Oct 23, 2024

There are Python dependencies missing from the project (see below). Try to remove the folder .venv in the project folder and run nx prepare openchallenges-edam-etl to use a fresh virtual environment.

image ### Serve ``` $ nx serve openchallenges-edam-etl

nx run openchallenges-edam-etl:serve

poetry run python src/main.py

Traceback (most recent call last):
File "/workspaces/sage-monorepo/apps/openchallenges/edam-etl/src/main.py", line 6, in
import mariadb
ModuleNotFoundError: No module named 'mariadb'
Warning: command "poetry run python src/main.py" exited with non-zero status code


### Serve-detach

$ nx run-many --target=build-image --projects=openchallenges-mariadb,openchallenges-edam-etl
$ nx serve-detach openchallenges-edam-etl
$ docker logs openchallenges-edam-etl
Traceback (most recent call last):
File "/opt/app/src/main.py", line 6, in
import mariadb
ModuleNotFoundError: No module named 'mariadb'

I have found a solution for this module not found error by updating the command associated with the nx serve command in the project.json for edam-etl to multiple commands that execute to ensure that the mariadb and sqlalchemy, an issue I discovered after correcting the mariadb issue, are installed. I've also removed the unused "import mysql.connector" package from the ./src/main.py file.

These changes have impacted the results of running nx env remove --all followed by nx prepare openchallenges-edam-etl then executing poetry add mariadb and poetry add sqlalchemy. Once the poetry.lock and poetry.toml files updated the nx serve openchallenges-edam-etl command shows:

nx serve openchallenges-edam-etl --verbose

nx run openchallenges-edam-etl:serve

sudo apt-get update && sudo apt-get install --no-install-recommends -qq -y gosu libmariadb-dev gcc && pip install mariadb && poetry run python src/main.py

Hit:1 https://download.docker.com/linux/ubuntu jammy InRelease
Hit:2 https://cli.github.com/packages stable InRelease
Hit:3 https://apt.releases.hashicorp.com jammy InRelease
Hit:4 https://deb.nodesource.com/node_20.x nodistro InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 https://ngrok-agent.s3.amazonaws.com bullseye InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:9 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists... Done
W: https://apt.releases.hashicorp.com/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: mariadb in /home/vscode/.local/lib/python3.10/site-packages (1.1.10)
Requirement already satisfied: packaging in /home/vscode/.local/lib/python3.10/site-packages (from mariadb) (24.1)
EDAM Version: 1.25
OC DB URL: None
Downloading the EDAM concepts from GitHub (CSV file)...
EDAM concepts downloaded successfully.
Processing the EDAM concepts...
EDAM concepts processed successfully.
Number of Concepts Transformed: 3473
Column names: ['id', 'class_id', 'preferred_label']
Concept Counts:
Data: 1493
Operation: 802
Format: 728
Topic: 728
Identifier: 728
Other: 2
Establishing a connection to the MariaDB Platform.
Connection has been established to MariaDB Platform!
The table edam_etl has been added to the edam database!
The table edam_etl has been populated with the EDAM concepts!

———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

NX Successfully ran target serve for project openchallenges-edam-etl

This then caused an error in the CI/pr check to fail for the same reasons when building the dev container by running nx prepare open challenges-edam-etl so I added the same commands to the project.json for nx prepare.

The command nx run-many --target=build-image --projects=openchallenges-mariadb,openchallenges-edam-etl results in:

✔ nx run openchallenges-mariadb:build-image:local (2s)
✔ nx run openchallenges-edam-etl:build-image:local (44s)

————————————————————————————————————————————————————————————————————————————————————————————————————————————————

NX Successfully ran target build-image for 2 projects (44s)

The output was:

nx serve-detach openchallenges-edam-e
tl

nx run openchallenges-edam-etl:serve-detach

docker/openchallenges/serve-detach.sh openchallenges-edam-etl

[+] Running 2/2
✔ Container openchallenges-edam-etl Started 32.7s
✔ Container openchallenges-mariadb Healthy 32.1s

————————————————————————————————————————————————————————————————————————————————————————————————————————————————

NX Successfully ran target serve-detach for project openchallenges-edam-etl (33s)

docker logs openchallenges-edam-etl is now saying that I can't access the CSV of the EDAM concepts so I'll continue to work on what's happening w/the docker container build.

@tschaffter
Copy link
Member

nx env remove --all

What is this command? It's not list under nx --help

@mdsage1
Copy link
Contributor Author

mdsage1 commented Oct 24, 2024

nx env remove --all

What is this command? It's not list under nx --help

@tschaffter My apologies. That command was my mistake in typing. It should be poetry env remove --all and here's a slack forum where I learned about the command.

@mdsage1
Copy link
Contributor Author

mdsage1 commented Oct 24, 2024

@tschaffter Would you be able to provide some insight into what's going on with docker logs openchallenges-edam-etl command in the backend? I'm unsure how if I run nx serve-detach openchallenges-edam-etl would have access to docker logs since the container isn't actively running to my understanding. Is that the case? When I run docker logs openchallenges-edam-etl it's saying this although there's no issue popping up on GitHub:

docker logs openchallenges-edam-etl
EDAM Version: 1.25
OC DB URL: None
Downloading the EDAM concepts from GitHub (CSV file)...
Traceback (most recent call last):
File "/opt/app/src/main.py", line 171, in
main()
File "/opt/app/src/main.py", line 164, in main
if download_edam_csv(url, VERSION):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app/src/main.py", line 84, in download_edam_csv
with open(f"EDAM_{version}.csv", "wb") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: 'EDAM_1.25.csv'
EDAM Version: 1.25
OC DB URL: None
Downloading the EDAM concepts from GitHub (CSV file)...
Traceback (most recent call last):
File "/opt/app/src/main.py", line 171, in
main()
File "/opt/app/src/main.py", line 164, in main
if download_edam_csv(url, VERSION):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app/src/main.py", line 84, in download_edam_csv
with open(f"EDAM_{version}.csv", "wb") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: 'EDAM_1.25.csv'
EDAM Version: 1.25
OC DB URL: None
Downloading the EDAM concepts from GitHub (CSV file)...
Traceback (most recent call last):
File "/opt/app/src/main.py", line 171, in
main()
File "/opt/app/src/main.py", line 164, in main
if download_edam_csv(url, VERSION):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app/src/main.py", line 84, in download_edam_csv
with open(f"EDAM_{version}.csv", "wb") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: 'EDAM_1.25.csv'
EDAM Version: 1.25
OC DB URL: None
Downloading the EDAM concepts from GitHub (CSV file)...
Traceback (most recent call last):
File "/opt/app/src/main.py", line 171, in
main()
File "/opt/app/src/main.py", line 164, in main
if download_edam_csv(url, VERSION):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app/src/main.py", line 84, in download_edam_csv
with open(f"EDAM_{version}.csv", "wb") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: 'EDAM_1.25.csv'

The EDAM_1.25.csv should be created within the container as it runs but I'd think that if it wasn't actively running then it wouldn't exist anymore. Please let me know if I'm misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants