lamalab-org · kjappelbaum · Oct 29, 2024 · Oct 29, 2024 · Oct 29, 2024 · Oct 29, 2024
diff --git a/src/tex/acronymns.tex b/src/tex/acronymns.tex
@@ -16,3 +16,4 @@
 \newacronym{json}{JSON}{JavaScript object notation}
 \newacronym{rag}{RAG}{retrieval augmented generation}
 \newacronym{ece}{ECE}{expected calibration error}
+\newacronym{cot}{CoT}{chain of thought}
diff --git a/src/tex/appendix.tex b/src/tex/appendix.tex
@@ -343,7 +343,7 @@ \subsection{Model performance} \label{sec:model_performance_app}
 However, since some subjects are composed of questions from different sources, the ranking of the models is, in some instances, different from the one on \chembenchmini.
 
 \begin{table}
-    \caption{\textbf{Performance of the models on the \chembench corpus.} The table shows the fraction of questions answered correctly by the models for different skills and difficulty levels.}
+    \caption{\textbf{Performance of the models on the \chembench corpus.} The table shows the fraction of questions answered correctly by the models for different skills and difficulty levels. Models with \enquote{T-one} in the name were run for a temperature of 1, which allows us to study the temperature effect in the benchmark. Systems with \enquote{ReAct} in the name are tool augmented, i.e., they can call external tools such as web search or a calculator to answer the questions better. However, we limit those systems to ten calls to the \gls{llm}. This constraint often led the systems to not find the correct answer within the specified number of calls. In this case, we consider the answer as incorrect (see \Cref{sec:react-environment}).}
     \resizebox{\textwidth}{!}{
     \variable{output/performance_table.tex}
     }
@@ -481,7 +481,6 @@ \subsection{Influence of model scale}
     \label{fig:model_size_plot}
 \end{figure}
 
-
 \clearpage
 \subsection{Refusal detection}
 \Glspl{llm} typically undergo refusal training to prevent harmful or undesirable outputs. As a result, models may decline to answer questions perceived as potentially adversarial prompts.
@@ -588,6 +587,24 @@ \subsection{Human baseline} \label{sec:human_baseline}
     \label{fig:tool_use}
 \end{figure}
 
+\clearpage
+\subsection{Tool augmented models} \label{sec:react-environment}
+In addition to directly prompting \glspl{llm}, we also investigated the performance of tool-augmented systems.
+For this, we investigated the two models which showed the best overall results, \GPTFourO and \ClaudeThreeFiveSonnet (\oone overperformed both models but is not recommended using this model with such \enquote{reasoning} prompts such as ReAct\autocite{yao2022react} or \gls{cot}\autocite{wei2023cot}). 
+For both models, we created a ReAct-style tool augmentation environment in which models had access to WolframAlpha, the ArXiv \gls{api}, Wikipedia, and web search (using Brave search \gls{api}).
+We based the selection of these tools on the most used tools by humans (see \Cref{fig:tool_use}).
+Additionally, we added two specific tools to convert \gls{iupac} names to \gls{smiles}, and \gls{smiles} to \gls{iupac} names.
+These conversion tools allow us to better understand how the agents perform for some specific questions in this agent environment configuration.
+We implemented the systems using Langchain with the default ReAct prompt and constrained the system to a maximum of ten \gls{llm} calls.
+
+While for \ClaudeThreeFiveSonnet the overall performance increased (by less than 1\%), we observe a decrease in performance for \GPTFourO compared with the \llm without tools (\Cref{tab:performance_table}). 
+If we study the results by each type of question, we observe an improvement for questions regarding electron counts or point groups of compounds.
+However, the scores decreased for questions asking about the number of isomers or \gls{ghs} pictograms.
+For the specific questions about converting \gls{iupac} names to \gls{smiles}, the results decreased notably despite the models having access to specific tools prepared for those questions. 
+By studying the reasoning path for these cases, we found that the error results from the models responding in a format that the LangChain framework with the default ReAct loop cannot handle. 
+This indicates that agent frameworks need optimization to be more robust. It involves not only equipping the \glspl{llm} with tools but also necessitates engineering efforts to create robust systems.
+
+
 \clearpage
 \subsection{Confidence estimates} \label{sec:confidence_estimates}
 

diff --git a/src/tex/ms.tex b/src/tex/ms.tex
@@ -249,7 +249,7 @@ \section{Conclusions}
 Our work shows that carefully curated benchmarks can provide a more nuanced understanding of the capabilities of \glspl{llm} in the chemical sciences.
 Importantly, our findings also illustrate that more focus is required in developing better human-model interaction frameworks, given that models cannot estimate their limitations.
 
-While our findings indicate many areas for further improvement of \gls{llm}-based systems, it is also important to realize that clearly defined metrics have been the key to the progress of many fields of \gls{ml}, such as computer vision.
+While our findings indicate many areas for further improvement of \gls{llm}-based systems, such as for the ReAct systems (more discussion in \cref{sec:react-environment}), it is also important to realize that clearly defined metrics have been the key to the progress of many fields of \gls{ml}, such as computer vision.
 Although current systems might be far from reasoning like a chemist, our \chembench framework will be a stepping stone for developing systems that might come closer to this goal.
 
 \clearpage

diff --git a/src/tex/references.bib b/src/tex/references.bib
@@ -1260,6 +1260,13 @@ @article{urbina2022teachable
     journaltitle = {Nat. Mach. Intell.},
 }
 
+@article{wei2023cot,
+      title={Chain-of-Thought Prompting Elicits Reasoning in Large Language Models}, 
+      author={Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou},
+      journal = {arXiv preprint arXiv:2201.11903},
+      year= 2023,
+}
+
 @misc{wei2021chemistryqa,
     title = {ChemistryQA: A Complex Question Answering Dataset from Chemistry},
     author = {Wei, Z. and Ji, W. and Geng, X. and Chen, Y. and Chen, B. and Qin,