-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate uniprot info into initial prompt #396
base: development
Are you sure you want to change the base?
Changes from all commits
d8f6903
8f18627
e2a148a
04b85a2
23e8b39
221be40
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,7 @@ | |
import streamlit as st | ||
|
||
from alphastats.gui.utils.ui_helper import DefaultStates, StateKeys | ||
from alphastats.llm.llm_integration import LLMIntegration | ||
from alphastats.llm.llm_integration import LLMIntegration, MessageKeys, Models | ||
from alphastats.llm.uniprot_utils import ( | ||
ExtractedUniprotFields, | ||
format_uniprot_annotation, | ||
|
@@ -24,6 +24,9 @@ def protein_selector(df: pd.DataFrame, title: str, state_key: str) -> List[str]: | |
selected_proteins (List[str]): A list of selected proteins. | ||
""" | ||
st.write(title) | ||
if len(df) == 0: | ||
st.markdown("No significant proteins.") | ||
return [] | ||
c1, c2 = st.columns([1, 1]) | ||
if c1.button("Select all", help=f"Select all {title} for analysis"): | ||
st.session_state[state_key] = df["Protein"].tolist() | ||
|
@@ -183,10 +186,19 @@ def get_display_available_uniprot_info(regulated_features: list) -> dict: | |
|
||
# TODO: Write test for this display | ||
@st.fragment | ||
def display_uniprot(regulated_genes_dict, feature_to_repr_map, disabled=False): | ||
def display_uniprot( | ||
regulated_genes_dict, | ||
feature_to_repr_map, | ||
model_name: str = Models.OLLAMA_31_70B, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please don't specify a default here |
||
*, | ||
disabled=False, | ||
): | ||
"""Display the interface for selecting fields from UniProt information, including a preview of the selected fields.""" | ||
all_fields = ExtractedUniprotFields.get_values() | ||
c1, c2, c3, c4 = st.columns((1, 1, 3, 1)) | ||
st.markdown( | ||
"We reccomend to provide at least limited information from Uniprot for all proteins as part of the initial prompt to avoid misinterpretaiton of gene names or ids by the LLM. You can edit the selection of fields to include while chatting for on the fly demand for more information." | ||
) | ||
c1, c2, c3, c4, c5, c6 = st.columns((1, 1, 1, 1, 1, 1)) | ||
if c1.button("Select all"): | ||
st.session_state[StateKeys.SELECTED_UNIPROT_FIELDS] = all_fields | ||
st.rerun(scope="fragment") | ||
|
@@ -198,21 +210,35 @@ def display_uniprot(regulated_genes_dict, feature_to_repr_map, disabled=False): | |
DefaultStates.SELECTED_UNIPROT_FIELDS.copy() | ||
) | ||
st.rerun(scope="fragment") | ||
if c4.button( | ||
"Integrate into initial prompt", | ||
type="primary", | ||
help="Not implemented yet, but will adjust the initial prompt to include the output from Uniprot already and the system message to avoid calling the tool function again for the genes included.", | ||
): | ||
st.toast("Not implemented yet.", icon="⚠️") | ||
# TODO: Implement this | ||
with c4: | ||
texts = [ | ||
format_uniprot_annotation( | ||
st.session_state[StateKeys.ANNOTATION_STORE].get(feature, {}), | ||
fields=st.session_state[StateKeys.SELECTED_UNIPROT_FIELDS], | ||
) | ||
for feature in regulated_genes_dict | ||
] | ||
dummy_model = LLMIntegration(model_name, api_key="lorem", load_tools=False) | ||
tokens = dummy_model.estimate_tokens( | ||
Comment on lines
+221
to
+222
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a bit hacky ;-)
|
||
[{MessageKeys.CONTENT: text} for text in texts] | ||
) | ||
st.markdown(f"Total tokens: {tokens:.0f}") | ||
with c5: | ||
st.checkbox( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we need the extra checkbox? Can I not just click "update prompt"? (set There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You need a way to also revert to an initial prompt that does not contain the uniprot information. Uncheck the box, then update. |
||
"Integrate into initial prompt", | ||
help="If this is ticked and the initial prompt is updated, the Uniprot information will be included in the prompt and the instructions regarding uniprot will change to onl;y look up more information if explicitly asked to do so. Make sure that the total tokens are below the message limit of your LLM.", | ||
key=StateKeys.INTEGRATE_UNIPROT, | ||
disabled=disabled, | ||
) | ||
if c6.button("Update prompt", disabled=disabled): | ||
st.rerun(scope="app") | ||
c1, c2 = st.columns((1, 3)) | ||
with c1, st.expander("Show options", expanded=True): | ||
selected_fields = [] | ||
for field in all_fields: | ||
if st.checkbox( | ||
field, | ||
value=field in st.session_state[StateKeys.SELECTED_UNIPROT_FIELDS], | ||
disabled=disabled, | ||
): | ||
selected_fields.append(field) | ||
if set(selected_fields) != set( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,8 +15,7 @@ def get_system_message(dataset: DataSet) -> str: | |
|
||
return ( | ||
f"You are an expert biologist and have extensive experience in molecular biology, medicine and biochemistry.{os.linesep}" | ||
"A user will present you with data regarding proteins upregulated in certain cells " | ||
"sourced from UniProt and abstracts from scientific publications. They seek your " | ||
"A user will present you with data regarding proteins upregulated. They seek your " | ||
"expertise in understanding the connections between these proteins and their potential role " | ||
f"in disease genesis. {os.linesep}" | ||
f"Provide a detailed and insightful, yet concise response based on the given information. Use formatting to make your response more human readable." | ||
|
@@ -30,16 +29,29 @@ def get_initial_prompt( | |
parameter_dict: Dict[str, Any], | ||
upregulated_genes: List[str], | ||
downregulated_genes: List[str], | ||
uniprot_info: str, | ||
): | ||
"""Get the initial prompt for the LLM model.""" | ||
group1 = parameter_dict["group1"] | ||
group2 = parameter_dict["group2"] | ||
column = parameter_dict["column"] | ||
if uniprot_info: | ||
uniprot_instructions = ( | ||
f"We have already retireved relevant information from Uniprot for these proteins:{os.linesep}{os.linesep}{uniprot_info}{os.linesep}{os.linesep}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This (and other prompts) are using interpunctation quite sparsely. My knowledge may be outdated, but I learned that the more structured a prompt, the better. Shall we add some backticks or quoatation marks? just an example:
instead of
(maybe a different PR) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will consider this when I start the next PR on prompt engineering. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "proteins" or "genes"? :) |
||
"This contains curated information you may not have encountered before, value it highly. " | ||
"Only retrieve additional information from Uniprot if explicitly asked to do." | ||
) | ||
else: | ||
uniprot_instructions = ( | ||
"You have the ability to retrieve curated information from Uniprot about these proteins. " | ||
"Please do so for individual proteins if you have little information about a protein or find a protein particularly important in the specific context." | ||
) | ||
return ( | ||
f"We've recently identified several proteins that appear to be differently regulated in cells " | ||
f"when comparing {group1} and {group2} in the {column} group. " | ||
f"From our proteomics experiments, we know that the following ones are upregulated: {', '.join(upregulated_genes)}.{os.linesep}{os.linesep}" | ||
f"Here is the list of proteins that are downregulated: {', '.join(downregulated_genes)}.{os.linesep}{os.linesep}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also prompt engineering:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will consider this when I start the next PR on prompt engineering. |
||
f"{uniprot_instructions}{os.linesep}{os.linesep}" | ||
f"Help us understand the potential connections between these proteins and how they might be contributing " | ||
f"to the differences. After that provide a high level summary" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move this check to
06_LLM.py
? e.g. right afterst.markdown("##### Genes of interest")
? I feel here it's a bit hidden