docs: Advanced-use-case (#199)

* start use case * use case details * add repo links to vignette * add links * spelling, phrasing, formatting * add questions * add mermaid of schema * info boxes
biocypher · Aug 21, 2024 · 6bb6072 · 6bb6072
1 parent daf2629
commit 6bb6072
Show file tree

Hide file tree

Showing 27 changed files with 262 additions and 12 deletions.
diff --git a/docs/benchmark/overview.md b/docs/benchmark/overview.md
@@ -6,7 +6,7 @@ explanation, see the [benchmarking documentation](../features/benchmark.md) and
 
 ## Scores per model
 
-Table sorted by mean score in descending order.
+Table sorted by median score in descending order.
 Click the column names to reorder.
 
 {{ read_csv('benchmark/results/processed/overview-model.csv', colalign=("left","right")) }}
@@ -16,7 +16,7 @@ Click the column names to reorder.
 
 ## Scores per quantisation
 
-Table sorted by mean score in descending order.
+Table sorted by median score in descending order.
 Click the column names to reorder.
 
 {{ read_csv('benchmark/results/processed/overview-quantisation.csv', colalign=("left","right")) }}
@@ -26,7 +26,7 @@ Click the column names to reorder.
 ## Scores of all tasks
 
 Wide table; you may need to scroll horizontally to see all columns.
-Table sorted by mean score in descending order.
+Table sorted by median score in descending order.
 Click the column names to reorder.
 
 {{ read_csv('benchmark/results/processed/overview.csv', colalign=("left","right")) }}
diff --git a/docs/images/boxplot-medical-exam-domain.png b/docs/images/boxplot-medical-exam-domain.png
diff --git a/docs/images/boxplot-medical-exam-language-domain.png b/docs/images/boxplot-medical-exam-language-domain.png
diff --git a/docs/images/boxplot-medical-exam-language.png b/docs/images/boxplot-medical-exam-language.png
diff --git a/docs/images/boxplot-medical-exam-task.png b/docs/images/boxplot-medical-exam-task.png
diff --git a/docs/images/boxplot-naive-vs-biochatter.pdf b/docs/images/boxplot-naive-vs-biochatter.pdf
diff --git a/docs/images/boxplot-naive-vs-biochatter.png b/docs/images/boxplot-naive-vs-biochatter.png
diff --git a/docs/images/boxplot-per-quantisation.png b/docs/images/boxplot-per-quantisation.png
diff --git a/docs/images/boxplot-tasks.png b/docs/images/boxplot-tasks.png
diff --git a/docs/images/boxplot-text2cypher.png b/docs/images/boxplot-text2cypher.png
diff --git a/docs/images/dotplot-per-task.pdf b/docs/images/dotplot-per-task.pdf
diff --git a/docs/images/dotplot-per-task.png b/docs/images/dotplot-per-task.png
diff --git a/docs/images/histogram-image-caption-confidence.png b/docs/images/histogram-image-caption-confidence.png
diff --git a/docs/images/scatter-per-quantisation-name.pdf b/docs/images/scatter-per-quantisation-name.pdf
diff --git a/docs/images/scatter-per-quantisation-name.png b/docs/images/scatter-per-quantisation-name.png
diff --git a/docs/images/scatter-quantisation-accuracy.pdf b/docs/images/scatter-quantisation-accuracy.pdf
diff --git a/docs/images/scatter-quantisation-accuracy.png b/docs/images/scatter-quantisation-accuracy.png
diff --git a/docs/images/scatter-size-accuracy.pdf b/docs/images/scatter-size-accuracy.pdf
diff --git a/docs/images/scatter-size-accuracy.png b/docs/images/scatter-size-accuracy.png
diff --git a/docs/images/stripplot-extraction-tasks.png b/docs/images/stripplot-extraction-tasks.png
diff --git a/docs/images/stripplot-per-model.png b/docs/images/stripplot-per-model.png
diff --git a/docs/images/stripplot-rag-tasks.pdf b/docs/images/stripplot-rag-tasks.pdf
diff --git a/docs/images/stripplot-rag-tasks.png b/docs/images/stripplot-rag-tasks.png
diff --git a/docs/vignettes/custom-bclight-advanced.md b/docs/vignettes/custom-bclight-advanced.md
@@ -32,6 +32,11 @@ BioChatter Light web app is available
 
 ## Build the KG
 
+!!! info inline end "GitHub Project access token"
+    Be aware that running this script will require a GitHub token with access to
+    the project board. This token should be stored in the environment variable
+    `BIOCYPHER_GITHUB_PROJECT_TOKEN`.
+
 We modified an existing adapter for the GitHub GraphQL API to pull data from the
 GitHub Project board. Thus, the time investment to build the KG was minimal
 (~3h); this is one central principle of BioCypher. We adapted the code
@@ -58,10 +63,6 @@ such as `Priority` and `Size`, are properties of the project item in our current
 implementation. These assignments, including the schmema of the graph, can be
 flexibly adapted by using BioCypher mechanisms.
 
-Be aware that running this script will require a GitHub token with access to the
-project board. This token should be stored in the environment variable
-`BIOCYPHER_GITHUB_PROJECT_TOKEN`.
-
 ## Add the additional tabs to BioChatter Light
 
 BioChatter Light has a modular architecture to accommodate flexible layout
@@ -124,22 +125,25 @@ services:
       - TASK_SETTINGS_PANEL_TAB=true
 ```
 
+!!! info inline end "Authentication"
+    For using the app with the standard OpenAI LLM, we need to provide the
+    `OPENAI_API_KEY` environment variable. This key can be obtained from the
+    OpenAI website.
+
 You can see the full configuration in the `docker-compose.yml` file of the
 [project-planning](https://github.com/biocypher/project-planning) repository.
 For public deployment, we also added a password-protected version of the KG,
 which only requires a few additional lines in the `docker-compose-password.yml`
-file. To deploy the tool on a cloud VM, we now only need to run the following
+file.
+
+To deploy the tool on a cloud VM, we now only need to run the following
 commands:
 
 ```bash
 git clone https://github.com/biocypher/project-planning.git
 docker-compose -f project-planning/docker-compose-password.yml up -d
 ```
 
-We just need to make sure to provide an `OPENAI_API_KEY` and a
-`BIOCYPHER_GITHUB_PROJECT_TOKEN` in the VM's environment to be accessed by the
-Docker workflow.
-
 ## Useful tips for deployment
 
 Many vendors offer cloud VMs with pre-installed Docker and Docker Compose, as
@@ -156,10 +160,14 @@ server {
 
     location / {
         proxy_pass http://localhost:8501;
+        proxy_http_version 1.1;
         proxy_set_header Host $host;
         proxy_set_header X-Real-IP $remote_addr;
         proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
         proxy_set_header X-Forwarded-Proto $scheme;
+        proxy_set_header Upgrade $http_upgrade;
+        proxy_set_header Connection "upgrade";
+        proxy_read_timeout 86400;
     }
 }
 ```

diff --git a/docs/vignettes/custom-bclight-simple.md b/docs/vignettes/custom-bclight-simple.md
@@ -52,6 +52,11 @@ to `true`, which tells BioChatter Light to connect to the KG on the Docker
 network, which uses the service name as the hostname, so in this case, `deploy`
 instead of the default `localhost`.
 
+!!! info inline end "Authentication"
+    For using the app with the standard OpenAI LLM, we need to provide the
+    `OPENAI_API_KEY` environment variable. This key can be obtained from the
+    OpenAI website.
+
 We then turn off all default tabs (chatting, prompt engineering, RAG, and the
 correcting agent) and turn on the KG tab. Running the docker compose with these
 settings will build and deploy the KG and the BioChatter Light web app with only

diff --git a/docs/vignettes/custom-decider-use-case.md b/docs/vignettes/custom-decider-use-case.md
@@ -0,0 +1,236 @@
+# Custom BioChatter Light and Next: Cancer Genetics Use Case
+
+This example is part of the BioChatter manuscript supplement.
+
+<!-- TODO DOI -->
+
+## Background
+
+Personalised medicine tailors treatment to a patient's unique genetic makeup.
+In cancer care, this approach helps categorize patients and assign them to specific treatment groups in clinical trials.
+However, interpreting and making decisions based on this data is challenging due to the complexity of genetic variations, the interaction between genes and environmental factors, tumor diversity, patient histories, and the vast amount of data produced by advanced technologies.
+
+In the [DECIDER consortium](https://deciderproject.eu), we aim to improve clinical decisions by providing support systems, for instance for the geneticists working on these cases.
+The code for the use case lives at [https://github.com/biocypher/decider-genetics](https://github.com/biocypher/decider-genetics).
+
+Below, we show how we build a support application for this use case.
+
+## Sources of knowledge
+
+We integrate knowledge from diverse resources, using [BioCypher](https://biocypher.org) to build a knowledge graph of:
+
+1. Processed whole genome sequencing data of ovarian cancer patients (synthetic data)
+
+    - genomic changes
+
+        - classified by consequence (protein truncation, amino acid change)
+
+        - algorithmic prediction of deleteriousness
+
+        - variant identifiers 
+
+    - allele dosages
+
+        - gene allele copy number (amplifications, deletions, loss-of-heterogeneity)
+
+        - mutation pervasiveness (estimate of number of affected alleles, or suspected subclonality)
+
+    - proportion of cancer cells in the sample (tumour purity)
+
+
+2. the patients' clinical history (synthetic data)
+
+    - personal information (age at diagnosis, BMI, etc.)
+
+    - treatment history, known side effects, clinical response
+
+    - lab test results (blood, imaging, histopathology)
+
+    - common treatment-relevant mutations (BRCA), HR deficiency, PARP-inhibitor maintenance
+
+3. data from open resources (real data)
+
+    - variant annotations (as provided by the genetics pipeline of the DECIDER consortium)
+
+    - gene annotations (as provided by the genetics pipeline of the DECIDER consortium)
+
+    - pathway / process annotations (from public databases such as [Gene Ontology](http://geneontology.org))
+
+    - drug annotations (from [OncoKB](https://www.oncokb.org))
+
+In addition, we provide access to more resources via the RAG and API agents:
+
+1. relevant publications from
+[PubMed](https://pubmed.ncbi.nlm.nih.gov/?term=high%20grade%20serous%20ovarian%20cancer&filter=simsearch2.ffrft&filter=pubt.review&filter=pubt.systematicreview)
+(real data) embedded in a vector database
+
+2. relevant knowledge streamed live from OncoKB (see below) via API access through BioChatter's API agent
+
+## The geneticist's workflow
+
+Personalised cancer therapy is guided by identifying somatic genomic driver events in specific genes, particularly when these involve well-known hotspot mutations.
+However, unique somatic events in the same genes or pathways can create a "grey zone" that requires manual geneticist analysis to determine their clinical significance.
+
+To address this, a comprehensive BioCypher backend processes whole-genome sequencing data to catalog somatic changes, annotating their consequences and potential actionability.
+These data can then be linked to external resources for clinical interpretation.
+For example, certain mutations in the BRCA1 or ERBB2 genes can indicate sensitivity to specific treatments such as PARP inhibitors or trastuzumab.
+
+To fully leverage actionable data, the integration of patient-specific information with literature on drug targets and mechanisms of action or resistance is essential.
+[OncoKB](https://www.oncokb.org/actionable-genes#sections=Tx) is the primary resource for this information, accessible through drug annotations added to the knowledge graph (KG) and via the BioChatter API calling mechanism.
+
+Additionally, semantic search tools facilitate access to relevant biomedical literature, enabling geneticists to quickly verify findings against established treatments or resistance mechanisms.
+
+In summary, the main contributions of our use case to the productivity of this workflow are:
+
+- making processed and analysed genomic data locally available in a centralised resource by building a custom KG
+
+- allowing comparison to literature via semantic search inside a vector database with relevant publications
+
+- providing live access to external resources via the API agent
+
+<!-- OncoKB annotated - drug, cancer, resistance
+
+TODO add some to welcome page
+
+Questions:
+# meta level
+How many patients do we have on record?
+what was patient1's response to previous treatment, and which treatment did they receive?
+which patients have hr deficiency but have not received parp inhibitors?
+how many patients had severe adverse reactions, and to which drugs
+
+# genetics
+Does patient1 have a sequence variant in a gene that is druggable? Which drug, and what evidence level has the association?
+Does patient1 have a sequence variant in a gene that is druggable with evidence level "1"? Which drug? Return unique values.
+Does patient1 have a copy number variant in a gene that is druggable with evidence level "1"? Which drug? Return unique values.
+Does patient1 have a sequence variant in a gene that is druggable with evidence level "1"? Which drug? Return unique values and the variant information for each drug. Only select variants with CADD_phred above 5.
+What is the variant with the highest CADD_phred of the samples of the patient with id "patient1"
+How many clinically significant (CLNSIG = Pathogenic) variants does each patient have
+- used to distinguish BRCA mutations (there are benign ones, so don't benefit from PARP-I)
+How many clinically significant (CLNSIG = Pathogenic or Likely_pathogenic) variants does each patient have
+How many variants of unclear clinical significance (CLNSIG = Uncertain_significance or Conflicting_interpretations_of_pathogenicity) does each patient have
+which clinically significant (CLNSIG = Pathogenic) sequence variants do the samples of patient5 have?
+which patients have sequence and copy number variants in the same gene?
+What is the sequence variant with the highest CADD_phred, and which patient has it
+which copy number alterations are exclusive to patient1
+is there a patient with overlapping variants compared to patient1
+
+# biology
+what are the biological functions of the gene SETBP1 (??)
+
+Non-funtional:
+which genes of patient2 have more than 2 nMajor copies
+
+Taru - Geneticist: ideal to have all data and evidence in the same place; if it’s easy case, make already a recommendation, give standard interpretation.
+
+Create prompt with explanation of the thought process and important parameters regarding the variants etc? -->
+
+## Building the application
+
+We will explain how to use the BioCypher ecosystem, specifically, BioCypher and BioChatter, to build a decision support application for a cancer geneticist.
+The code base for this use case, including all details on how to set up the KG and the applications, is available at [https://github.com/biocypher/decider-genetics](https://github.com/biocypher/decider-genetics).
+You can find live demonstrations of the application at links provided in the README of the repository.
+The build procedures can be reproduced by cloning the repository and running `docker-compose up -d` (or the equivalent for the Next app) in the root directory (note that the default configuration requires authentication with OpenAI services).
+The process involves the following steps:
+
+1. Identifying data sources and creating a knowledge graph schema
+
+2. Building the KG with BioCypher from the identified sources
+
+3. Using BioChatter Light to develop and troubleshoot the KG application
+
+4. Customising BioChatter Next to yield an integrated conversational interface
+
+5. Deploying the applications
+
+### Identifying data sources and creating a knowledge graph schema
+
+We examine the data sources described above and design a KG schema that can accommodate the data.
+The configuration file, [schema_config.yaml](https://github.com/biocypher/decider-genetics/blob/main/config/schema_config.yaml), can be seen in the `config` directory of the repository.
+The schema should also be designed with LLM access in mind; performance in generating specific queries can be adjusted for in step three (troubleshooting using BioChatter Light).
+We created a bespoke adapters for the genetics data of the DECIDER cohort according to the output format of the genetics pipeline, and reused existing adapters for the open resources.
+They can be found in the [decider_genetics/adapters](https://github.com/biocypher/decider-genetics/tree/main/decider_genetics/adapters) directory of the repository.
+For this use case, we created synthetic data to stand in for the real data for privacy reasons; the synthetic data are available in the `data` directory.
+
+This is the schema of our KG:
+
+```mermaid
+graph TD;
+    Patient[Patient] -->|PatientToSequenceVariantAssociation| SequenceVariant[SequenceVariant]
+    Patient[Patient] -->|PatientToCopyNumberAlterationAssociation| CopyNumberAlteration[CopyNumberAlteration]
+    SequenceVariant[SequenceVariant] -->|SequenceVariantToGeneAssociation| Gene[Gene]
+    CopyNumberAlteration[CopyNumberAlteration] -->|CopyNumberAlterationToGeneAssociation| Gene[Gene]
+    Gene[Gene] -->|GeneToBiologicalProcessAssociation| BiologicalProcess[BiologicalProcess]
+    Gene[Gene] -->|GeneDruggabilityAssociation| Drug[Drug]
+```
+
+### Building the KG with BioCypher
+
+In the dedicated adapters for the DECIDER genetics data, we pull the data from the synthetic data files and build the KG.
+We perform simplifying computations, as described above, to facilitate standard workflows (such as counting alleles, identifying pathogenic variants, and calculating tumour purity).
+We mold the data into the specified schema in a transparent and reproducible manner by configuring the adapters (see the [decider_genetics/adapters](https://github.com/biocypher/decider-genetics/tree/main/decider_genetics/adapters) directory).
+
+After creating the schema and adapters, we run the build script to populate the KG.
+BioCypher is configured using the [biocypher_config.yaml](https://github.com/biocypher/decider-genetics/blob/main/config/biocypher_config.yaml) file in the `config` directory.
+Using the Docker Compose workflow included in the BioCypher template repository, we build a containerised version of the KG.
+We can inspect the KG in the Neo4j browser at `http://localhost:7474` after running the build script.
+Any changes, if needed, can be made to the configuration of schema and adapters.
+
+### Using BioChatter Light to develop and troubleshoot the KG application
+
+Upon deploying the KG via Docker, we can use a custom BioChatter Light application to interact with the KG.
+Briefly, we remove all components except the KG interaction panel via environment variables in the [docker-compose.yml](https://github.com/biocypher/decider-genetics/blob/main/docker-compose.yml) file (see also the corresponding [vignette](custom-bclight-simple.md)).
+This allows us to start the KG and interact with it using an LLM in a reproducible manner with just one command.
+We can then test the LLM-KG interaction by asking questions and examining the generated queries and its results from the KG.
+Once we are satisfied with the KG schema and LLM performance, we can advance to the next step.
+
+!!! info inline end "OpenAI API key needed"
+    In the standard configuration, we use the OpenAI API to generate queries.
+    Provide your `OPENAI_API_KEY` in the shell environment, or modify the
+    application to call a different LLM.
+
+The BioChatter Light application, including the KG creation, can be built using `docker compose up -d` in the root directory of the [repository](https://github.com/biocypher/decider-genetics).
+
+An online demonstration of this application can be found at [https://decider-light.biochatter.org](https://decider-light.biochatter.org).
+You can use this demonstration to test the KG - LLM interaction, asking questions such as:
+
+- How many patients do we have on record, and what are their names?
+
+- What was patient1's response to previous treatment, and which treatment did they receive?
+
+- Which patients have HR deficiency but have not received PARP inhibitors?
+
+- How many patients had severe adverse reactions, and to which drugs?
+
+- Does patient1 have a sequence variant in a gene that is druggable? Which drug, and what evidence level has the association?
+
+- Does patient1 have a sequence variant in a gene that is druggable with evidence level "1"? Which drug? Return unique values.
+
+- Does patient1 have a copy number variant in a gene that is druggable with evidence level "1"? Which drug? Return unique values.
+
+The query returned by the model can also be modified and rerun without an additional call to the LLM, allowing for easy troubleshooting and exploration of the KG.
+The schema information of the KG is displayed in the lower section of the page for reference.
+
+### Customising BioChatter Next to yield an integrated conversational interface
+
+We can further customise the Docker workflow to start the BioChatter Next application, including its REST API middleware `biochatter-server`.
+In addition to deploying all software components, we can also customise its appearance and functionality.
+Using the [biochatter-next.yaml](https://github.com/biocypher/decider-genetics/blob/main/config/biochatter-next.yaml) configuration file (in `config`, as all other configuration files), we can adjust the welcome message, how-to-use section, the system prompts for the LLM, which tools can be used by the LLM agent, the connection details of externally hosted KG or vectorstore, and other parameters.
+We then start BioChatter Next using a [dedicated Docker Compose file](https://github.com/biocypher/decider-genetics/blob/main/docker-compose-next.yml), which includes the `biochatter-server` middleware and the BioChatter Next application.
+
+!!! info inline end "OpenAI API key needed"
+    In the standard configuration, we use the OpenAI API to generate queries.
+    Provide your `OPENAI_API_KEY` in the `.bioserver.env` file, or modify the
+    application to call a different LLM.
+
+The BioChatter Next application, including the customisation of the LLM and the integration of the KG, can be built using `docker compose -f docker-compose-next.yml up -d` in the root directory of the [repository](https://github.com/biocypher/decider-genetics).
+An online demonstration of this application can be found at [https://decider-next.biochatter.org](https://decider-next.biochatter.org).
+
+### Deploying the applications
+
+The final step is to deploy one or both applications on a server.
+Using the Docker Compose workflow, we can deploy the applications in many different environments, from local servers to cloud-based solutions.
+The environment supplied by the Docker software allows for high reproducibility and easy scaling.
+The BioChatter Light app can be used for testing, but also to provide a simple one-way interface to the KG for users who do not need the full conversational interface.
+The BioChatter Next app can be configured to connect to KG and vectorstore deployments on different servers, allowing for a distributed architecture and dedicated maintenance of components; but it can also be deployed in tandem from one Docker Compose, for smaller setups or local use.
+