Local LLM Message Cleanup : Code Execution in code_utils.py #399

robzsaunders · 2023-10-24T14:21:46Z

Why are these changes needed?

I've noticed that there have been reports of Local LLM's models not properly executing generated code despite appearing correct. It turns out that some models are outputting "\r\n" instead of just "\n"

Some people use \r\n instead of just \n because they may be working in a context where CRLF line endings are expected or required, so it makes sense that subsequent models are trained the same way.

This is a simple fix that cleans up message text to remove /r so the autogen detector/compiler can detect and run the received code.

Example "CRLF" formatted output:
\r\n```python\r\nprint(\"Hello, World!\")\r\n```

Example cleaned output:
\n```python\nprint(\"Hello, World!\")\n```

Related issue number

I think this is related to #279 where no code would execute and subsequently no response would be fed back into the Local LLM causing a empty string response. Just part of the picture.

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

I've noticed that there have been reports of Local LLM's models not properly executing generated code despite appearing correct. It turns out that some models are outputting "\r\n" instead of just "\n" Some people use \r\n instead of just \n because they may be working in a context where CRLF line endings are expected or required, so it makes sense that subsequent models are trained the same way. This is a simple fix that cleans up message text to remove /r so the autogen detector/compiler can detect and run the received code.

victordibia

Thanks for this. Makes sense.
Given that the \r\n itself does not modify the behavior of the code (atleast I dont see immediate examples of this apart from perhaps string matching logic of \r\n which should be rare...), I can see how this improves reliability for some local models.

Minor comment
Can you share some pointers to any models that exhitit the \r\n behavior to enable reproducibility?

codecov-commenter · 2023-10-24T19:26:46Z

Codecov Report

Merging #399 (5a634b0) into main (50ac547) will decrease coverage by 26.12%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##             main     #399       +/-   ##
===========================================
- Coverage   41.29%   15.18%   -26.12%     
===========================================
  Files          20       19        -1     
  Lines        2448     2450        +2     
  Branches      548      552        +4     
===========================================
- Hits         1011      372      -639     
- Misses       1359     2077      +718     
+ Partials       78        1       -77

Flag	Coverage Δ
unittests	`15.18% <0.00%> (-26.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
autogen/agentchat/contrib/math_user_proxy_agent.py	`15.87% <0.00%> (-28.58%)`	⬇️
autogen/agentchat/conversable_agent.py	`16.48% <0.00%> (-57.03%)`	⬇️
autogen/code_utils.py	`19.16% <0.00%> (-29.32%)`	⬇️

... and 8 files with indirect coverage changes

robzsaunders · 2023-10-24T20:52:48Z

Thanks for this. Makes sense. Given that the \r\n itself does not modify the behavior of the code (atleast I dont see immediate examples of this apart from perhaps string matching logic of \r\n which should be rare...), I can see how this improves reliability for some local models.

Minor comment Can you share some pointers to any models that exhitit the \r\n behavior to enable reproducibility?

Thanks Victor, one of the models I tested with was This one. The code would output perfectly in the console but wouldn't compile and came back as "unknown".

I figured it's a common enough possibility that it may encompass other rogue local LLM's out there.

No major change here. Only formatting updates based on the autogen precommit hook. pre-commit run --all-files

gagb · 2023-10-24T21:54:59Z

Thanks for this. Makes sense. Given that the \r\n itself does not modify the behavior of the code (atleast I dont see immediate examples of this apart from perhaps string matching logic of \r\n which should be rare...), I can see how this improves reliability for some local models.

Minor comment Can you share some pointers to any models that exhitit the \r\n behavior to enable reproducibility?

@victordibia do you think we should be modifying the core logic of code_utils for this or this should addressed by the end user via a custom reply function? Are there case when the code block suggests \r\n and this suggestion would overwrite them. For example, a degenerate case is I purposefully asked the agents to suggest to write via \r\n.

with open("my_file.txt", "w", newline="\r\n") as f:
    f.write("Line 1\r\n")
    f.write("Line 2\r\n")
    f.write("Line 3\r\n")

LMK, if I misunderstood the issue?

victordibia · 2023-10-24T22:41:14Z

@gagb You raise a good point here.
There are tradeoffs to implementing this for all in code_utils.py () or having each user implement their own version.

Current pr/scenario

local model is used by agent to generate code
code contains \r\n, so extract_code fails to return anything
agent response is empty or leads to an error
current pr does some sanitization in extract_code where occurrences of \r\n are replaced with \n enabling correct code extraction.

Can you provide a similar walkthrough of how custom_reply would be used to address the situation? Also, what errors will your degenerate use case lead to. Do you mean that in the scenario where the code is meant to write to a file with \r\n formating, this formatting will be changed?

Overall, we might need some benchmark that indicates how often the \r\n issue occurs and any associated side effects.

robzsaunders · 2023-10-25T00:28:01Z

You know @victordibia and @gagb I may have stumbled across a solution by accident when I was exploring regex options.

Originally I thought that the main problem was how the pattern selection detected the beginning of the code, where it decided what compiler to use. This ties into another similar problem I've been facing with Local LLM's which I'll get to in a second.

Line 21: CODE_BLOCK_PATTERN = r"```(\w*)\n(.*?)\n```" (pattern is set to this later)
Line 68:    if not detect_single_line_code:
Line 69:      match = re.findall(pattern, text, flags=re.DOTALL)
Line 70:      return match if match else [(UNKNOWN, text)]

Related to this PR directly and the above code snippet:
I recall coming across a scenario where only the first part of the code snippet was sterilized. For example, generated code that was say 50 lines, but because I was using /b/r/n/b for my regex, a segment of code that was /r/n/r/n/r/n wouldn't sterilize and then everything beyond it stayed un-sterilized. However, despite this, the code still executed.

At the time I was thinking of a full sterilization for consistency but now that I think about it, the real problem may actually be this section of regex that detects what compiler to use. We could just sterilize everything outside of the code blocks and leave the rest untouched?

Perhaps we scan and find where the compiler is mentioned as the key for the start of the code block? Example: Look for

```python

instead of just the code block fence?

Sort of related to this PR and ties into my proposed solution

Another problem I've been running into which in my opinion is in the same vein here is when a local model fails a snippet of code and rewrites it, or rewrites really anything. It tends to output the start of a code block with nothing throwing off the fixed code block.

I'm thinking we can use the same solution/detection for sanitizing outside the code blocks to detect and sanitize this problem too.

Example:

```\n\n```python\n (Bunch of python code) \n```

As you can imagine, what happens is the block :

```\n\n```

is detected and fails to execute when the

```python\n (Bunch of python code) \n```

is never referenced because it's actually detected as

python\n (Bunch of python code) \n```

This has happened with a few models, the one I mentioned earlier along with mistral's dolphin 2.1 quantified by bloke is the latest one I tried, and minstral instruct did it too. To reproduce, let a model run for awhile and you'll quickly see this happen when some code didn't generate correctly

tl;dr:

I think we can sanitize everything outside of the coding blocks and things will still run. We should look at scanning and using the compiler detection logic

eg; ```python

to detect chunks of code blocks and sterilize everything we need to sterilize outside of them, including rogue fences.

thinkall · 2023-10-25T04:50:24Z

autogen/code_utils.py

    """
+
+    # Some Local LLM models/servers output \r\n insteaf or just \n. Let's clean it up before continuing
+    text = re.sub(r"\r\n", "\n", text)


Could we add a parameter or environment variable to control enable or disable this replacement?

Was testing this morning using detect_single_line_code=True and it worked fine with and without \r\n since it's not looking for \n's.

Perhaps we should have it set to default since it seemingly works much better overall.

Thanks @robzsaunders . My concern is that if in some case there is a line of code like print("something\r\n"), the replacement will modify the code itself while maybe the user wants to keep it.

afourney · 2023-10-25T16:46:02Z

Here is another similar situation where code extraction fails:

#414 (comment)

gagb · 2023-10-25T17:42:27Z

@gagb You raise a good point here. There are tradeoffs to implementing this for all in code_utils.py () or having each user implement their own version.

Current pr/scenario

local model is used by agent to generate code

code contains \r\n, so extract_code fails to return anything

agent response is empty or leads to an error

current pr does some sanitization in extract_code where occurrences of \r\n are replaced with \n enabling correct code extraction.

Can you provide a similar walkthrough of how custom_reply would be used to address the situation? Also, what errors will your degenerate use case lead to. Do you mean that in the scenario where the code is meant to write to a file with \r\n formating, this formatting will be changed?

Overall, we might need some benchmark that indicates how often the \r\n issue occurs and any associated side effects.

@victordibia, Yes that custom_reply will look similar to the preconfigured version but modified with whatever new extraction/code detection logic we need. See https://github.com/microsoft/autogen/blob/main/autogen/agentchat/conversable_agent.py#L127

@afourney's test suite will come in handy for the testing!

JHamburg430 · 2023-10-25T22:47:09Z

Question: I just tried Mistral and Dolphin2.1 from TheBloke, added """ text = re.sub(r"\r\n", "\n", text) """ to code_utils.py as well as set detect_single_line_code = True.

I AM using LMStudio to host the model instead of using the Transformers library, but I believe that only affects the caching. Could this be the reason for my error?
Just to note: I've test both of these models with the math examples from the notebook without issue.
.
.
.
"role": "system"
},
{
"content": "What date is today? Compare the year-to-date gain for META and TESLA.",
"role": "user"
},
{
"content": "\n```python\n# filename: today_gain_comparison.py\nimport datetime\nimport requests\nfrom bs4 import BeautifulSoup\n\ntoday = datetime.date.today()\nprint(f"Today's date is {today}")\n\nmeta_url = "https://finance.yahoo.com/quote/FB?p=FB&.tsrc=fin-srch\"\ntesla_url = "https://finance.yahoo.com/quote/TSLA?p=TSLA&.tsrc=fin-srch\"\n\ndef get_year_to_date_gain(ticker):\n response = requests.get(ticker)\n soup = BeautifulSoup(response.content, 'html.parser')\n \n gain_element = soup.find('span', class_='Trsdu(0.3s",
"role": "assistant"
},
{
"content": "",
"role": "user"
}
],
"model": "Mistral",
"temperature": 0,
"detect_single_line_code": true
}
[2023-10-25 18:36:00.629] [ERROR] Error: 'messages' array must only contain objects with a 'content' field that is not empty.

robzsaunders · 2023-10-25T23:23:17Z

@jhamburger

I think your issue is not related to this.

Notice how your python script doesn't close with ```. You should check your max tokens to ensure that you're not being capped causing premature responses that aren't complete. You have incomplete code there, and without a closing code block fence, The code extractor doesn't match.

JHamburg430 · 2023-10-26T03:01:33Z

@robzsaunders you are correct, I see that now. Unfortunately, I confirmed token count settings and didn't resolve. I'll have to keep digging.

Related to my comment here: microsoft#399 (comment) This is further sterilization of the incoming code blocks, filtering out bad code block fences and possible bad code block language flags. I also flipped the detect_single_line_code to True since it just works better, and moved the /r/n/ sterilization to inside the detect_single_line_code=False mode since that's the only place it matters.

Small addition to support new sterilization in code_utils.py

Small update to support changes made in code_utils.py

robzsaunders · 2023-10-26T13:19:25Z

@robzsaunders please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

Fixed a small overlooked variable bug

ekzhu · 2023-12-25T21:52:22Z

@robzsaunders thanks for the PR, please take a look at the current conflict with the main branch.

robzsaunders had a problem deploying to openai October 24, 2023 14:21 — with GitHub Actions Failure

gagb requested a review from victordibia October 24, 2023 17:44

victordibia approved these changes Oct 24, 2023

View reviewed changes

Merge branch 'main' into Local-LLM-Response-Cleaning

9f76618

victordibia had a problem deploying to openai October 24, 2023 19:22 — with GitHub Actions Failure

Update code_utils.py

bba8b59

No major change here. Only formatting updates based on the autogen precommit hook. pre-commit run --all-files

victordibia had a problem deploying to openai October 24, 2023 21:31 — with GitHub Actions Failure

thinkall reviewed Oct 25, 2023

View reviewed changes

victordibia mentioned this pull request Oct 25, 2023

When is the code executed? #414

Closed

robzsaunders had a problem deploying to openai October 26, 2023 13:12 — with GitHub Actions Failure

Update conversable_agent.py

ad401a6

Small addition to support new sterilization in code_utils.py

robzsaunders had a problem deploying to openai October 26, 2023 13:15 — with GitHub Actions Failure

Update math_user_proxy_agent.py

c2d9fe7

Small update to support changes made in code_utils.py

robzsaunders had a problem deploying to openai October 26, 2023 13:15 — with GitHub Actions Failure

robzsaunders mentioned this pull request Oct 26, 2023

AttributeError: 'str' object has no attribute 'get' after the task is finished #279

Closed

Merge branch 'main' into Local-LLM-Response-Cleaning

5a634b0

sonichi had a problem deploying to openai October 26, 2023 17:57 — with GitHub Actions Failure

Update code_utils.py

d96d41a

Fixed a small overlooked variable bug

robzsaunders had a problem deploying to openai October 26, 2023 21:14 — with GitHub Actions Failure

robzsaunders requested a review from victordibia October 30, 2023 18:43

robzsaunders mentioned this pull request Oct 31, 2023

Finding solution for select_speaker and agent_by_name functions: Sterilization or ?? #489

Closed

afourney mentioned this pull request Nov 7, 2023

improve CODE_BLOCK_PATTERN for a more robust code match #571

Merged

3 tasks

gagb closed this Aug 26, 2024

Local LLM Message Cleanup : Code Execution in code_utils.py #399

Local LLM Message Cleanup : Code Execution in code_utils.py #399

Uh oh!

Conversation

robzsaunders commented Oct 24, 2023

Why are these changes needed?

Related issue number

Checks

Uh oh!

victordibia left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

robzsaunders commented Oct 24, 2023

Uh oh!

gagb commented Oct 24, 2023

Uh oh!

victordibia commented Oct 24, 2023

Uh oh!

robzsaunders commented Oct 25, 2023

tl;dr:

Uh oh!

thinkall Oct 25, 2023

Choose a reason for hiding this comment

Uh oh!

robzsaunders Oct 25, 2023

Choose a reason for hiding this comment

Uh oh!

thinkall Oct 27, 2023

Choose a reason for hiding this comment

Uh oh!

afourney commented Oct 25, 2023

Uh oh!

gagb commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JHamburg430 commented Oct 25, 2023

Uh oh!

robzsaunders commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JHamburg430 commented Oct 26, 2023

Uh oh!

robzsaunders commented Oct 26, 2023

Uh oh!

ekzhu commented Dec 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

codecov-commenter commented Oct 24, 2023 •

edited

Loading

gagb commented Oct 25, 2023 •

edited

Loading

robzsaunders commented Oct 25, 2023 •

edited

Loading