Skip to content

Commit

Permalink
.
Browse files Browse the repository at this point in the history
  • Loading branch information
kiyoon committed Aug 1, 2024
1 parent 0400b2d commit 3f20900
Show file tree
Hide file tree
Showing 8 changed files with 98 additions and 6 deletions.
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# bio-data-to-db: make Uniprot PostgreSQL database


[![image](https://img.shields.io/pypi/v/bio-data-to-db.svg)](https://pypi.python.org/pypi/bio-data-to-db)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/bio-data-to-db)](https://pypi.python.org/pypi/bio-data-to-db)
[![image](https://img.shields.io/pypi/l/bio-data-to-db.svg)](https://pypi.python.org/pypi/bio-data-to-db)
Expand All @@ -19,6 +20,8 @@ Written in Rust, thus equipped with extremely fast parsers. Packaged for python,

So far, there is only one function implemented: **convert uniprot data to postgresql**. This package focuses more on parsing the data and inserting it into the database, rather than curating the data.

[📚 Documentation](https://deargen.github.io/bio-data-to-db/)

## 🛠️ Installation

```bash
Expand Down Expand Up @@ -77,6 +80,35 @@ from bio_data_to_db.bindingdb.fix_tables import fix_assay_table
fix_assay_table("mysql://username:password@localhost/bind")
```

### PostgreSQL Helpers, SMILES, Polars utils and more

```python
Some useful functions to work with PostgreSQL.

```python
from bio_data_to_db.utils.postgresql import (
create_db_if_not_exists,
create_schema_if_not_exists,
set_column_as_primary_key,
make_columns_unique,
make_large_columns_unique,
split_column_str_to_list,
polars_write_database,
)

from bio_data_to_db.utils.smiles import (
canonical_smiles_wo_salt,
polars_canonical_smiles_wo_salt,
)

from bio_data_to_db.utils.polars import (
w_pbar,
)
```

You can find the usage in the [📚 documentation](https://deargen.github.io/bio-data-to-db/).


## 👨‍💻️ Maintenance Notes

### Install from source
Expand All @@ -88,10 +120,14 @@ bash scripts/install.sh
uv pip install -r deps/requirements_dev.in
```

### Compile requirements (generate lockfiles)
### Generate lockfiles

Use GitHub Actions: `apply-pip-compile.yml`. Manually launch the workflow and it will make a commit with the updated lockfiles.

### Publish a new version to PyPI

Use GitHub Actions: `deploy.yml`. Manually launch the workflow and it will compile on all architectures and publish the new version to PyPI.

### About sqlx

Sqlx offline mode should be configured so you can compile the code without a database present.
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ plugins:
options:
show_symbol_type_heading: true
show_symbol_type_toc: true
members_order: source
allow_inspection: false # for .pyi stubs to work
paths: [src] # search packages in the src folder

extra:
Expand Down
2 changes: 1 addition & 1 deletion rust/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion scripts/gen_ref_nav.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import mkdocs_gen_files

IGNORE_MODULES_EXACT = {
"bio_data_to_db.__init__",
# "bio_data_to_db.__init__",
}

IGNORE_MODULES_STARTSWITH = {
Expand Down
2 changes: 1 addition & 1 deletion src/bio_data_to_db/bindingdb/fix_tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

def fix_assay_table(uri: str):
"""
Fix the assay table in MySQL by decoding HTML entities like ''' and strip empty spaces.
Fix the assay table in MySQL by decoding HTML entities like `'` and strip empty spaces.
Notes:
- the table is replaced.
Expand Down
6 changes: 5 additions & 1 deletion src/bio_data_to_db/bio_data_to_db.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,8 @@ def uniprot_xml_to_postgresql(
*,
uniprot_xml_path: str,
uri: str,
) -> None: ...
) -> None:
"""
(Rust) Load UniProt XML file into PostgreSQL database.
"""

Check failure on line 8 in src/bio_data_to_db/bio_data_to_db.pyi

View workflow job for this annotation

GitHub Actions / ruff-lint / ruff

Ruff (PYI021)

src/bio_data_to_db/bio_data_to_db.pyi:6:5: PYI021 Docstrings should not be included in stubs
...

Check failure on line 9 in src/bio_data_to_db/bio_data_to_db.pyi

View workflow job for this annotation

GitHub Actions / ruff-lint / ruff

Ruff (PIE790)

src/bio_data_to_db/bio_data_to_db.pyi:9:5: PIE790 Unnecessary `...` literal
40 changes: 40 additions & 0 deletions src/bio_data_to_db/uniprot/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,32 @@
def create_empty_table(
uri: str,
):
"""
Create an empty table in the database. Necessary to create the table structure before inserting data.
Note:
It runs the following SQL query:
```sql
CREATE TABLE public.uniprot_info (
uniprot_pk_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
accessions TEXT[],
names TEXT[],
protein_names TEXT[],
gene_names TEXT[],
organism_scientific TEXT,
organism_commons TEXT[],
organism_synonyms TEXT[],
ncbi_taxonomy_id INT,
deargen_ncbi_taxonomy_id INT,
lineage TEXT[],
keywords TEXT[],
geneontology_ids TEXT[],
geneontology_names TEXT[],
sequence TEXT,
deargen_molecular_functions TEXT[]
)
```
"""
uri_wo_dbname, dbname = uri.rsplit("/", 1)
create_db_if_not_exists(uri_wo_dbname, dbname)
create_schema_if_not_exists(uri, "public")
Expand Down Expand Up @@ -55,6 +81,17 @@ def create_empty_table(


def create_accession_to_pk_id(uri: str):
"""
Create a table to map accession to uniprot_pk_id, from the uniprot_info table.
It creates the following tables:
- accession_to_pk_id
- accession_to_pk_id_list
Note:
The mapping is not unique. It is possible to have multiple uniprot_pk_id for a single accession and vice versa.
"""
with psycopg.connect(
conninfo=uri,
) as conn:
Expand Down Expand Up @@ -118,6 +155,9 @@ def keywords_tsv_to_postgresql(
schema_name="public",
table_name="keywords",
):
"""
Load the keywords_all_2024_06_26.tsv (or similar version) file into the database.
"""
tsv_columns = [
"Keyword ID",
"Name",
Expand Down
12 changes: 11 additions & 1 deletion src/bio_data_to_db/utils/postgresql.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,9 @@ def polars_datatype_to_sqlalchemy_type(


def create_db_if_not_exists(uri_wo_db: str, db_name: str, comment: str | None = None):
"""
Create a database if it doesn't exist.
"""
with psycopg.connect(
conninfo=f"{uri_wo_db}",
) as conn:
Expand Down Expand Up @@ -110,6 +113,9 @@ def create_db_if_not_exists(uri_wo_db: str, db_name: str, comment: str | None =


def create_schema_if_not_exists(uri: str, schema_name: str, comment: str | None = None):
"""
Create a schema if it doesn't exist. The DB should already exist.
"""
db_name = uri.split("/")[-1]
with psycopg.connect(
conninfo=uri,
Expand Down Expand Up @@ -318,6 +324,9 @@ def split_column_str_to_list(
separator: str,
pg_element_type: str = "text",
):
"""
Split a string column into a list column.
"""
if pg_element_type.lower() not in {
"text",
}:
Expand Down Expand Up @@ -458,7 +467,8 @@ def polars_write_database(
"""
pl.DataFrame.write_database() but address the issue of writing unsigned and list columns to database.
https://stackoverflow.com/questions/77098480/polars-psycopg2-write-column-of-lists-to-postgresql
Reference:
- https://stackoverflow.com/questions/77098480/polars-psycopg2-write-column-of-lists-to-postgresql
"""
if isinstance(connection, str):
connection = create_engine(connection)
Expand Down

0 comments on commit 3f20900

Please sign in to comment.