Bouzyges

Bouzyges (pronounced boo-zee-jes) is a Python program to interactively generate semantic graphs of medical terms utilizing the SNOMED CT attribute-value pairs. The script can be interfaced with a LLM model to generate graphs in automated fashion. End result of the script is a set of SNOMED CT concepts, that serve as the closest possible strict supertypes that together fully capture the meaning of the input term.

Citing Bouzyges

Authors of Bouzyges kindly ask you to cite the following conference publication if you use Bouzyges in your research or publish a derived work: https://www.ohdsi.org/2024showcase-32/

We include a BibTeX and a modern Hayagriva citation snippets for your convenience:

BibTeX

@inproceedings{Bouzyges,
    title = "Automating data standardization through ad hoc SNOMED modeling with LLM: proof of concept",
    author = "Eduard Korchmar and Vojtech Huser and Christian Reich and Alexander Davydov",
    howpublished = "\url{https://www.ohdsi.org/2024showcase-32/}",
    organization = "OHDSI",
    type = "Collaboration Showcase",
    booktitle = "OHDSI 2024 Global Symposium",
    conference = "OHDSI 2024 Global Symposium",
    year = "2024",
    month = "October",
    day = "20",
    address = "New Brunswick, NJ, USA"
}

Hayagriva

bouzyges:
    type: article
    title: "Automating data standardization through ad hoc SNOMED modeling with LLM: proof of concept"
    author:
        - Eduard Korchmar
        - Vojtech Huser
        - Christian Reich
        - Alexander Davydov
    date: 2024-10-20
    url: https://www.ohdsi.org/2024showcase-32/
    parent:
        type: conference
        title: OHDSI 2024 Global Symposium
        organization: OHDSI
        address: New Brunswick, NJ, USA

Intended use

In current form, Bouzyges serves as a proof-of-concept of a novel approach to automating ontology mapping and standardization. In the future, possible applications include:

Mapping of medical terms to SNOMED CT concepts
SNOMED CT authoring support
Automated SNOMED CT quality assurance
Automated creation of custom local Standard concepts in OMOP CDM.

Installation

Bouzyges requires Python 3.12 or later. To install the script, clone the repository, initialize a virtual environment and install the required packages:

git clone https://github.com/OHDSI/Bouzyges.git
cd Bouzyges
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Prerequisites

SNOMED CT

Current implementation of Bouzyges relies on Snowstorm REST API to interface with SNOMED CT. To use the API, you need to provide the endpoint for the Snowstorm service in default_config.json file or in GUI configuraiton.

Snowstorm version 10 with SNOMED International (July 2024 release) was tested. We recommend using the Docker image provided by SNOMED International to run Snowstorm locally and loading the SNOMED RF2 release archive via the Swagger UI.

External links:

Snowstorm GitHub repository
Using Snowstorm with Docker
SNOMED International release in RF2 format (hosted by NLM)

LLM interface

Bouzyges relies on outputting LLM prompts and parsing their input; currently, three options are supported:

Manual input: the user is prompted to input the desired LLM prompt and is expected to provide the input manually. This can be used to debug the script or test different LLMs interactively.
OpenAI: to use this API, you will need to ensure that a valid OPENAI_API_KEY is set either as environment variable or (recommended) in env file (see below). You can also set environment variables in Bouzyges GUI.
Azure: Azure OpenAI API is also supported. To use this API, you will need to provide the API information either an by explicitly setting environment variables or (preferred way) inside .env file.

Implementing new interfaces

It is possible to implement additional API interfaces (e.g. to locally available models) by inheriting from PromptFormat class to generate prompts in the correct format and inheriting from Prompter to provide interface to send prompts to the LLM.

`.env` file

To avoid accidental exposure of API keys, we strongly recommend using an .env file to manage environment variables. Bouzyges will try to automatically load the .env file in the working directory using the python-dotenv library. You can also manully paste API keys in Bouzyges interface in Edit > Override environment variables menu.

Example content of the .env file:

# OpenAI requirements
# Project API key created at https://platform.openai.com/api-keys
export OPENAI_API_KEY="sk-abc...def"

# Azure OpenAI interface requirements
# Attainable at your organization's infrastructure team
export AZURE_OPENAI_API_KEY="123abcd...789"
export AZURE_OPENAI_API_VERSION="2024-06-01"  # Most recent version
export AZURE_OPENAI_ENDPOINT="https://example.openai.azure.com/

Caching of results

Bouzyges will cache all calls to LLM APIs in an SQLite database prompt_cache.db. Prompts to the same model with the same API options will be reused across runs. Database file can be read and analyzed by any tool supporting sqlite3 APIs. Schema DDL is stored in init_prompt_cache.sql file.

Usage

Warning

Bouzyges makes a lot of API calls and may consume a LOT of tokens. Currently, processing one concept consumes tokens on magnitude of 150,000 (3 cents with gpt-4o-mini). Performant models like GPT-4o or GPT-1o will be more expensive by orders of magnitude.

To run the script, execute the following command:

$ python bouzyges.py

This will start the GUI interface.

Bouzyges can be configured either interactively in GUI or by editing default_config.json file.

`default_config.json` file

Note that sensitive data should not be stored in the configuration file. API keys and Azure endpoints should be stored in .env file.

The configuration file is divided into several blocks:

API block

API options contain fields to configure connections to OpenAI API. There is also an option to set the number of prompt repeats (each query is repeated several times to get "best of N" answer) and the number of concurrent threads to allow for parallel processing.

Log block

Contatains options to enable logging to a file (default to true) and logging level (default to DEBUG -- 10).

Profile block

Contains options to enable profiling of the script with default cProfile module. Should be considered deprecated, as it is not informative about multithreaded performance.

Read block

Contains options to enable reading from a file, and the file name to read from. Contains fields that will be passed to pandas.read_csv function.

Write block

Contains options to enable writing to a file, and the file name to write to. Contains fields that will be passed to DataFrame.to_csv method. If JSON format is chosen for output, options will be ignored.

Format field

Can be either "JSON", "CSV" or "SCG". "JSON" option is recommended.

License

By necessity of license requirements of QT graphical library, Bouzyges is licensed under GNU GPL v3.0. Please refer to the LICENSE file for more information. GNU GPL v3.0 is a strong copyleft license, any derivative works must also be licensed under GNU GPL v3.0, which may not be suitable for commercial use-cases.

We intend to release an embeddable version of Bouzyges under a more permissive and derivation-friendly license in the near future.

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
Test.csv		Test.csv
default_config.json		default_config.json
icd11_sieve.py		icd11_sieve.py
icon.png		icon.png
init_prompt_cache.sql		init_prompt_cache.sql
main.py		main.py
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bouzyges

Citing Bouzyges

BibTeX

Hayagriva

Intended use

Installation

Prerequisites

SNOMED CT

External links:

LLM interface

Implementing new interfaces

`.env` file

Caching of results

Usage

`default_config.json` file

API block

Log block

Profile block

Read block

Write block

Format field

License

Current work in progress

About

Releases 1

Packages

Languages

License

OHDSI/Bouzyges

Folders and files

Latest commit

History

Repository files navigation

Bouzyges

Citing Bouzyges

BibTeX

Hayagriva

Intended use

Installation

Prerequisites

SNOMED CT

External links:

LLM interface

Implementing new interfaces

.env file

Caching of results

Usage

default_config.json file

API block

Log block

Profile block

Read block

Write block

Format field

License

Current work in progress

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`.env` file

`default_config.json` file

Packages