Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add React text #86

Merged
merged 22 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/tex/acronymns.tex
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@
\newacronym{json}{JSON}{JavaScript object notation}
\newacronym{rag}{RAG}{retrieval augmented generation}
\newacronym{ece}{ECE}{expected calibration error}
\newacronym{cot}{CoT}{chain of thought}
21 changes: 19 additions & 2 deletions src/tex/appendix.tex
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ \subsection{Model performance} \label{sec:model_performance_app}
However, since some subjects are composed of questions from different sources, the ranking of the models is, in some instances, different from the one on \chembenchmini.

\begin{table}
\caption{\textbf{Performance of the models on the \chembench corpus.} The table shows the fraction of questions answered correctly by the models for different skills and difficulty levels.}
\caption{\textbf{Performance of the models on the \chembench corpus.} The table shows the fraction of questions answered correctly by the models for different skills and difficulty levels. Models with \enquote{T-one} in the name were run for a temperature of 1, which allows us to study the temperature effect in the benchmark. Systems with \enquote{ReAct} in the name are tool augmented, i.e., they can call external tools such as web search or a calculator to answer the questions better. However, we limit those systems to ten calls to the \gls{llm}. This constraint often led the systems to not find the correct answer within the specified number of calls. In this case, we consider the answer as incorrect (see \Cref{sec:react-environment}).}
\resizebox{\textwidth}{!}{
\variable{output/performance_table.tex}
}
Expand Down Expand Up @@ -481,7 +481,6 @@ \subsection{Influence of model scale}
\label{fig:model_size_plot}
\end{figure}


\clearpage
\subsection{Refusal detection}
\Glspl{llm} typically undergo refusal training to prevent harmful or undesirable outputs. As a result, models may decline to answer questions perceived as potentially adversarial prompts.
Expand Down Expand Up @@ -588,6 +587,24 @@ \subsection{Human baseline} \label{sec:human_baseline}
\label{fig:tool_use}
\end{figure}

\clearpage
\subsection{Tool augmented models} \label{sec:react-environment}
In addition to directly prompting \glspl{llm}, we also investigated the performance of tool-augmented systems.
For this, we investigated the two models which showed the best overall results, \GPTFourO and \ClaudeThreeFiveSonnet (\oone overperformed both models but is not recommended using this model with such \enquote{reasoning} prompts such as ReAct\autocite{yao2022react} or \gls{cot}\autocite{wei2023cot}).
For both models, we created a ReAct-style tool augmentation environment in which models had access to WolframAlpha, the ArXiv \gls{api}, Wikipedia, and web search (using Brave search \gls{api}).
We based the selection of these tools on the most used tools by humans (see \Cref{fig:tool_use}).
Additionally, we added two specific tools to convert \gls{iupac} names to \gls{smiles}, and \gls{smiles} to \gls{iupac} names.
These conversion tools allow us to better understand how the agents perform for some specific questions in this agent environment configuration.
We implemented the systems using Langchain with the default ReAct prompt and constrained the system to a maximum of ten \gls{llm} calls.

While for \ClaudeThreeFiveSonnet the overall performance increased (by less than 1\%), we observe a decrease in performance for \GPTFourO compared with the \llm without tools (\Cref{tab:performance_table}).
MrtinoRG marked this conversation as resolved.
Show resolved Hide resolved
If we study the results by each type of question, we observe an improvement for questions regarding electron counts or point groups of compounds.
However, the scores decreased for questions asking about the number of isomers or \gls{ghs} pictograms.
For the specific questions about converting \gls{iupac} names to \gls{smiles}, the results decreased notably despite the models having access to specific tools prepared for those questions.
By studying the reasoning path for these cases, we found that the error results from the models responding in a format that the LangChain framework with the default ReAct loop cannot handle.
This indicates that agent frameworks need optimization to be more robust. It involves not only equipping the \glspl{llm} with tools but also necessitates engineering efforts to create robust systems.


\clearpage
\subsection{Confidence estimates} \label{sec:confidence_estimates}

Expand Down
2 changes: 1 addition & 1 deletion src/tex/ms.tex
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ \section{Conclusions}
Our work shows that carefully curated benchmarks can provide a more nuanced understanding of the capabilities of \glspl{llm} in the chemical sciences.
Importantly, our findings also illustrate that more focus is required in developing better human-model interaction frameworks, given that models cannot estimate their limitations.

While our findings indicate many areas for further improvement of \gls{llm}-based systems, it is also important to realize that clearly defined metrics have been the key to the progress of many fields of \gls{ml}, such as computer vision.
While our findings indicate many areas for further improvement of \gls{llm}-based systems, such as for the ReAct systems (more discussion in \cref{sec:react-environment}), it is also important to realize that clearly defined metrics have been the key to the progress of many fields of \gls{ml}, such as computer vision.
MrtinoRG marked this conversation as resolved.
Show resolved Hide resolved
Although current systems might be far from reasoning like a chemist, our \chembench framework will be a stepping stone for developing systems that might come closer to this goal.

\clearpage
Expand Down
7 changes: 7 additions & 0 deletions src/tex/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1260,6 +1260,13 @@ @article{urbina2022teachable
journaltitle = {Nat. Mach. Intell.},
}

@article{wei2023cot,
title={Chain-of-Thought Prompting Elicits Reasoning in Large Language Models},
author={Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou},
journal = {arXiv preprint arXiv:2201.11903},
year= 2023,
}

@misc{wei2021chemistryqa,
title = {ChemistryQA: A Complex Question Answering Dataset from Chemistry},
author = {Wei, Z. and Ji, W. and Geng, X. and Chen, Y. and Chen, B. and Qin,
Expand Down
Loading