Bouzyges (pronounced boo-zee-jes) is a Python program to interactively generate semantic graphs of medical terms utilizing the SNOMED CT attribute-value pairs. The script can be interfaced with a LLM model to generate graphs in automated fashion. End result of the script is a set of SNOMED CT concepts, that serve as the closest possible strict supertypes that together fully capture the meaning of the input term.
Authors of Bouzyges kindly ask you to cite the following conference publication if you use Bouzyges in your research or publish a derived work: https://www.ohdsi.org/2024showcase-32/
We include a BibTeX and a modern Hayagriva citation snippets for your convenience:
@inproceedings{Bouzyges,
title = "Automating data standardization through ad hoc SNOMED modeling with LLM: proof of concept",
author = "Eduard Korchmar and Vojtech Huser and Christian Reich and Alexander Davydov",
howpublished = "\url{https://www.ohdsi.org/2024showcase-32/}",
organization = "OHDSI",
type = "Collaboration Showcase",
booktitle = "OHDSI 2024 Global Symposium",
conference = "OHDSI 2024 Global Symposium",
year = "2024",
month = "October",
day = "20",
address = "New Brunswick, NJ, USA"
}
bouzyges:
type: article
title: "Automating data standardization through ad hoc SNOMED modeling with LLM: proof of concept"
author:
- Eduard Korchmar
- Vojtech Huser
- Christian Reich
- Alexander Davydov
date: 2024-10-20
url: https://www.ohdsi.org/2024showcase-32/
parent:
type: conference
title: OHDSI 2024 Global Symposium
organization: OHDSI
address: New Brunswick, NJ, USA
In current form, Bouzyges serves as a proof-of-concept of a novel approach to automating ontology mapping and standardization. In the future, possible applications include:
- Mapping of medical terms to SNOMED CT concepts
- SNOMED CT authoring support
- Automated SNOMED CT quality assurance
- Automated creation of custom local Standard concepts in OMOP CDM.
Bouzyges requires Python 3.12 or later. To install the script, clone the repository, initialize a virtual environment and install the required packages:
git clone https://github.com/OHDSI/Bouzyges.git
cd Bouzyges
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Current implementation of Bouzyges relies on Snowstorm REST API to interface with SNOMED CT. To use the API, you need to provide the endpoint for the Snowstorm service in default_config.json
file or in GUI configuraiton.
Snowstorm version 10 with SNOMED International (July 2024 release) was tested. We recommend using the Docker image provided by SNOMED International to run Snowstorm locally and loading the SNOMED RF2 release archive via the Swagger UI.
- Snowstorm GitHub repository
- Using Snowstorm with Docker
- SNOMED International release in RF2 format (hosted by NLM)
Bouzyges relies on outputting LLM prompts and parsing their input; currently, three options are supported:
- Manual input: the user is prompted to input the desired LLM prompt and is expected to provide the input manually. This can be used to debug the script or test different LLMs interactively.
- OpenAI: to use this API, you will need to ensure that a valid
OPENAI_API_KEY
is set either as environment variable or (recommended) inenv
file (see below). You can also set environment variables in Bouzyges GUI. - Azure: Azure OpenAI API is also supported. To use this API, you will need to provide the API information either an by explicitly setting environment variables or (preferred way) inside
.env
file.
It is possible to implement additional API interfaces (e.g. to locally available models) by inheriting from PromptFormat
class to generate prompts in the correct format and inheriting from Prompter
to provide interface to send prompts to the LLM.
To avoid accidental exposure of API keys, we strongly recommend using an .env
file to manage environment variables. Bouzyges will try to automatically load the .env
file in the working directory using the python-dotenv library. You can also manully paste API keys in Bouzyges interface in Edit > Override environment variables
menu.
Example content of the .env
file:
# OpenAI requirements
# Project API key created at https://platform.openai.com/api-keys
export OPENAI_API_KEY="sk-abc...def"
# Azure OpenAI interface requirements
# Attainable at your organization's infrastructure team
export AZURE_OPENAI_API_KEY="123abcd...789"
export AZURE_OPENAI_API_VERSION="2024-06-01" # Most recent version
export AZURE_OPENAI_ENDPOINT="https://example.openai.azure.com/
Bouzyges will cache all calls to LLM APIs in an SQLite database prompt_cache.db
. Prompts to the same model with the same API options will be reused across runs. Database file can be read and analyzed by any tool supporting sqlite3 APIs. Schema DDL is stored in init_prompt_cache.sql
file.
Warning
Bouzyges makes a lot of API calls and may consume a LOT of tokens. Currently, processing one concept consumes tokens on magnitude of 150,000 (3 cents with gpt-4o-mini). Performant models like GPT-4o or GPT-1o will be more expensive by orders of magnitude.
To run the script, execute the following command:
$ python bouzyges.py
This will start the GUI interface.
Bouzyges can be configured either interactively in GUI or by editing default_config.json
file.
Note that sensitive data should not be stored in the configuration file. API keys and Azure endpoints should be stored in .env
file.
The configuration file is divided into several blocks:
API options contain fields to configure connections to OpenAI API. There is also an option to set the number of prompt repeats (each query is repeated several times to get "best of N" answer) and the number of concurrent threads to allow for parallel processing.
Contatains options to enable logging to a file (default to true) and logging level (default to DEBUG -- 10).
Contains options to enable profiling of the script with default cProfile module. Should be considered deprecated, as it is not informative about multithreaded performance.
Contains options to enable reading from a file, and the file name to read from. Contains fields that will be passed to pandas.read_csv
function.
Contains options to enable writing to a file, and the file name to write to. Contains fields that will be passed to DataFrame.to_csv
method. If JSON format is chosen for output, options will be ignored.
Can be either "JSON", "CSV" or "SCG". "JSON" option is recommended.
By necessity of license requirements of QT graphical library, Bouzyges is licensed under GNU GPL v3.0. Please refer to the LICENSE
file for more information. GNU GPL v3.0 is a strong copyleft license, any derivative works must also be licensed under GNU GPL v3.0, which may not be suitable for commercial use-cases.
We intend to release an embeddable version of Bouzyges under a more permissive and derivation-friendly license in the near future.
- Refactoring into submodules
- Headless library mode re-release under a softer license
- Switch to OpenAI batch processing API
- Benchmarking framework and iteration on prompts
- RAG support with SNOMED authoring documentation
- Batch processing interface
- Reproducible run instructions
- Licensing and release preparation
- SNOMED CT API optimization
- OpenAI token consumption profiling
- OpenAI API token consumption optimization
- Automated LLM interface
- SNOMED CT API interface
- SNOMED CT hierarchy traversal