Skip to content

Commit

Permalink
Merge pull request #178 from jschrier/my-pages
Browse files Browse the repository at this point in the history
a few typo fixes in GPT-4 does acid
  • Loading branch information
jschrier authored Jan 16, 2024
2 parents 3a506ab + c49471b commit bbcc19c
Showing 1 changed file with 14 additions and 23 deletions.
37 changes: 14 additions & 23 deletions _posts/2024-01-16-GPT-4-does-acid-(base-chemistry).md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ A recent paper (Clark et al. "Comparing the Performance of College Chemistry Stu

Here is a sample problem, for which they only get 10% correct:

*Q3 (acid): Calculate the pH of 0.25 M NH4Cl. Kb for NH3 = 1.8 *10^-5.*
**Q3 (acid):** *Calculate the pH of 0.25 M NH4Cl. Kb for NH3 = 1.8 *10^-5.*

**ANSWER:** Set up an ICE table, etc. See [OpenStax](https://openstax.org/books/chemistry-2e/pages/14-4-hydrolysis-of-salts) if you are rusty on acid/base equilibrium calculations.

Expand All @@ -23,7 +23,7 @@ With[ (*solve for the change*)
Kw = 10^-14},
NSolveValues[Kw/Kb == (10^-7 + x)*x/(0.25 - x), x, PositiveReals]]
-Log[10, % + 10^-7] (*evaluate pH*)
-Log[10, % + 10^-7] (* evaluate pH *)
(*{0.0000117349}*)
Expand All @@ -50,7 +50,7 @@ So the approximation is a good one. If you are a hardass on significant figures

### Evaluation with default GPT-4

We are going to do this programmatically, because I do not like the tedium of grading. We define a base configuration using `gpt-4-1106-preview`` and return the result in a formatted way. By default, [LLMSynthesize](http://reference.wolfram.com/language/ref/LLMSynthesize.html) uses T = 1, but we will set that explicitly for reproducibility:
We are going to do this programmatically, because I do not like the tedium of grading. We define a base configuration using `gpt-4-1106-preview` and return the result in a formatted way. By default, [LLMSynthesize](http://reference.wolfram.com/language/ref/LLMSynthesize.html) uses T = 1, but we will set that explicitly for reproducibility:

```mathematica
baseConfig = LLMConfiguration[
Expand Down Expand Up @@ -166,7 +166,7 @@ answers = parseAnswer /@ results
(*{4.929, 5.43, 4.93, 4.93, 5.43, 4.93, 5.43, 5.071, 4.929, 4.929, 4.9281, 5.43, 4.93, 5.43, 4.93, 4.93, 4.93, 5.43, 5.43, 4.929}*)
```

Amazing! We get it right 12/20 = 60% (as compared to only 10% reported in Clark et al.) Maybe [GPT-4's dreams of attending medical school](https://doctortanc.substack.com/p/ai-llm-outperforms-physicians-in#:~:text=Key%20Findings,(59.1%25%20vs%2033.6%25).) have not been crushed by the gen chem exam.... Perhaps also worth noting, is that in their paper, only 32% of Gen Chem students and 38% of Analytical Chemistry students got this correct, even after instruction....so GPT-4 out of the box is doing better than the typical large Midwestern university.
Amazing! We get it right 12/20 = 60% (as compared to only 10% reported in Clark et al.) Maybe [GPT-4's dreams of attending medical school](https://doctortanc.substack.com/p/ai-llm-outperforms-physicians-in#:~:text=Key%20Findings,(59.1%25%20vs%2033.6%25).) have not been crushed by the gen chem exam.... Perhaps also worth noting, is that in their paper, only 32% of Gen Chem students and 38% of Analytical Chemistry students got this correct, even after instruction....so GPT-4 out of the box is doing better than the typical large Midwestern university student.

Take a closer look at one of the results that GPT-4 got wrong:

Expand Down Expand Up @@ -224,7 +224,7 @@ The pH of the 0.25 M NH4Cl solution is approximately 5.43.
ANSWER: 5.43
```

The strategy looks OK, but it looks like some of the arithmetic went awry. But...we wouldn't have a student do these calculations without a calculator (or at least a slide rule...) and neither should we expect a large *language* model to perform precise numerical calculations. Fortunately, [in a previous post we showed how to give LLMs a calculator]({{ site.baseurl }}{% post_url 2024-01-09-Accurate-and-safe-LLM-numerical-calculations-using-Interpreter-and-LLMTools %}). (As an aside, it would be functionally easier to just let the LLM use Wolfram Alpha, but we don't want to be accused of cheating by having this side tool. However, in a practical sense, the ChatGPT web app easily lets you include this as a tool.)
The strategy looks OK, but it looks like some of the arithmetic went awry. But...we wouldn't have a student do these calculations without a calculator (or at least a slide rule...) and neither should we expect a large *language* model to perform precise numerical calculations. Fortunately, [in a previous post we showed how to give LLMs a calculator]({{ site.baseurl }}{% post_url 2024-01-09-Accurate-and-safe-LLM-numerical-calculations-using-Interpreter-and-LLMTools %}). (As an aside, it would be functionally easier to just let the LLM use the [Wolfram Alpha tool](https://resources.wolframcloud.com/LLMToolRepository/resources/WolframAlpha/), but we don't want to be accused of cheating by having this side tool that is stronger than a calculator. However, in a practical sense, the ChatGPT web app easily lets you [include Wolfram Alpha as a plugin](https://www.wolfram.com/wolfram-plugin-chatgpt/).)

### If you give an LLM a calculator...

Expand Down Expand Up @@ -264,13 +264,9 @@ parseAnswer /@ %
```

![0ykh0zfh5sd2m](/blog/images/2024/1/16/0ykh0zfh5sd2m.png)

![1blcwhle4u8zx](/blog/images/2024/1/16/1blcwhle4u8zx.png)

![14p7ctuxlw0k6](/blog/images/2024/1/16/14p7ctuxlw0k6.png)

**Comment:** Results are just about the same (12/20 right answers), and we are getting a few cases where we do not get a numerical result returned. Let's start by looking at the first error
**Comment:** Results are just about the same (12/20 right answers), and we are getting a few cases where we do not get a numerical result returned. Let's start by looking at the first error:

```mathematica
resultsQ3aCalc[[2]]
Expand Down Expand Up @@ -400,7 +396,7 @@ This result (17/20) is only marginally better than our first attempt where we di

### Reflect and calculate

Now give the bot a calculator again:
Now give the bot a calculator along with asking it to reflect on its answer:

```mathematica
reflectAndCalculate = LLMConfiguration[
Expand All @@ -425,9 +421,9 @@ parseAnswer /@ %

## Another sample problem: Titration of a weak base by a strong acid

The Clark et al. study reports GPT-3.5 (and analytical chemistry students) had a *complete* failure rate on the following problem:
The Clark et al. study reports GPT-3.5 (and analytical chemistry undergraduate students) had a *complete* failure rate on the following problem:

*If 25.0 mL of 0.25 M HNO3 is combined with 15.0 mL of 0.25 M CH3NH2, what is the pH? Kb for CH3NH2 = 4.38 * 10^-4
**Q5a** *If 25.0 mL of 0.25 M HNO3 is combined with 15.0 mL of 0.25 M CH3NH2, what is the pH? Kb for CH3NH2 = 4.38 * 10^-4*

***Answer:** There is a stoichiometric excess of 2.5 mmol of the strong acid titrant, so one can ignore the base ([see OpenStax](https://openstax.org/books/chemistry-2e/pages/14-7-acid-base-titrations)) The total volume of analyte and titrant is 40.0 mL, so just form the ratio to determine [H3O+] and calculate the pH:

Expand Down Expand Up @@ -513,11 +509,11 @@ parseAnswer /@ %

![0lu451ca6cmey](/blog/images/2024/1/16/0lu451ca6cmey.png)

We get 15/20 = 75% correct just using GPT-4 out of the box. (The gateway error is a technical problem, probably because OpenAI doesn't like us banging on their door or the WiFi is dodgy at my house.) Remember for comparison, Clark *et al.* reported that only 18% of Gen Chem students and 0% of Analytical Chem students at the large Midwestern university answered this correctly, even after instruction, and GPT-3.5 was also 0%. So we get a big jump just by using a more recent version of GPT-4.
We get 15/20 = 75% correct just using GPT-4 out of the box. (The gateway error is a technical problem, probably because OpenAI doesn't like us banging on their door or the WiFi is dodgy at my house.) Remember for comparison, Clark *et al.* reported that only 18% of Gen Chem students and 0% of Analytical Chem students at the large Midwestern university answered this correctly, even after instruction, and GPT-3.5 was also 0%. So again we get a big jump just by using a more recent version of GPT-4....but can we do better?

### How well does GPT-4 perform (*with* calculators and reflection)?
### How well does GPT-4 perform (*with* calculators and reflection)?

This strategy worked well before...does it work well here too?
This strategy worked well before...will it work well here too?

```mathematica
resultQ5aReflectAndCalculate = ParallelTable[
Expand All @@ -529,10 +525,6 @@ resultQ5aReflectAndCalculate = ParallelTable[
parseAnswer /@ %
```

![0v021ykmhe3ws](/blog/images/2024/1/16/0v021ykmhe3ws.png)

![14wbw83hmggg3](/blog/images/2024/1/16/14wbw83hmggg3.png)

![0xe20bgn6b661](/blog/images/2024/1/16/0xe20bgn6b661.png)

**Comment:** 18/20 ran without a hitch! Let's take a look at the two failures:
Expand Down Expand Up @@ -607,9 +599,8 @@ parseAnswer /@ ParallelTable[
(*{1.20412, 1.20412}*)
```

These next 2 results are also perfect answers. So again we have a perfect (20/20=100%) response (compared to only 0% for GPT-3.5 and Analytical Chem students, and 18% of Gen Chem students at the large Midwestern University in the Clark et al. paper.) Again, GPT-4 is at the head of its class.
These next 2 results are also perfect answers. So once again we have a perfect (20/20=100%) response (compared to only 0% for GPT-3.5 and Analytical Chem students, and 18% of Gen Chem students at the large Midwestern University studied by Clark et al.) Again, GPT-4 is at the head of its class.

```mathematica
ToJekyll["GPT-4 does acid (base chemistry)",
"science teaching gpt4 llm mathematica"]
ToJekyll["GPT-4 does acid (base chemistry)", "science teaching gpt4 llm mathematica"]
```

0 comments on commit bbcc19c

Please sign in to comment.