Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Build] Various fixes #936

Merged
merged 9 commits into from
Jul 3, 2024
Merged

[Build] Various fixes #936

merged 9 commits into from
Jul 3, 2024

Conversation

riedgar-ms
Copy link
Collaborator

@riedgar-ms riedgar-ms commented Jul 2, 2024

The builds have broken in a number of interesting ways:

  • The required CMake configuration for llama-cpp-python has changed slightly, so update the various workflow files (not to mention the ReadMe).
  • It also appears that a new version of Phi3 started giving trouble with another test, so fix that as well
  • One of the LLamaCpp tests is consistently failing on Windows (and only Windows), so add a conditional skip

@@ -116,11 +112,10 @@ def test_phi3_transformers_orig():
def test_phi3_chat_basic(phi3_model: models.Model):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes this test match the following 'unrolled' one.

Comment on lines +116 to +118
lm += "You are a counting bot. Just keep counting numbers."
with assistant():
lm += gen(name="five", max_tokens=10)
lm += "1,2,3,4," + gen(name="five", max_tokens=20)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable fix. That being said, tests of this sort feel inherently flaky... I'm not sure we should be testing the actual behaviors of models that we don't have control over..?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and in general, we don't look at the model output (unless it's supposed to be constrained). This is quite an old test, and I'm not sure if there was originally some other goal here.

With the separation of Model and Engine, we should make sure that going forward we test the interface carefully, and then focus testing on the Guidance parser/grammar parts (with mocks, like we do for JSON). This is what I was gesticulating towards in the test doc I wrote up.

Copy link
Collaborator

@Harsha-Nori Harsha-Nori Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, I don't think we can get away with solely compliance-based testing on the interface and grammars. There's an entire class of issues where the code will function, but where we're knocking the model off-distribution and cause poor quality generations during test time (e.g. a malformed chat template, bad token healing, etc.). This particular test is by no means the perfect way to model this behavior, but this was consistently failing with Phi-3 generating nonsense until we fixed up the chat template to align with pre-training.

ML projects deviate from traditional software in that you will often have code that executes, but may have the wrong functional behavior, and it can be challenging to fit that in with traditional software testing workflows. An alternative to writing flaky one-off tests like this is to e.g. maintain a benchmark suite which runs nightly, and monitor model performance over time. Subtle bugs are still hard to catch in both worlds of course, but running on many examples is almost certainly better than one offs like this one here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely a lot more 'unit' testing which could and should be done of the guidance Model and parser.

That said, point taken that problems with 'knocking off the distribution' aren't going to show up with just 'testing the model interfaces'. Perhaps we do need a class of model specific tests which are known to be flaky (these should be carefully split from the 'interface' type tests which should always pass), and monitor them over time. We would need to investigate test monitoring options, or be prepared to build our own (fun with security!), though. Once you start tolerating flakiness, there are two huge risks

  1. Getting used to things 'only' getting 0.1% worse each day
  2. Persistently failing tests hiding in the 'acceptable failure rate'

The test analysis infrastructure would need to be able to address these

@riedgar-ms riedgar-ms changed the title [Build] Fix llama-cpp-python [Build] Various fixes Jul 3, 2024
@hudson-ai
Copy link
Collaborator

Thanks a bunch for the fixes!

I'm noticing that a vast majority of the time, these breakages are due to external changes rather than internal ones. I am glad that we get fast warnings in these cases, but it would be awesome to disentangle them from the requirements of PRs...

What do you think? I don't think my sentiments are terribly out of line with your general testing proposal..?

@riedgar-ms
Copy link
Collaborator Author

I think that we should be writing more unit tests for sure, and cutting back on some of the sort of tests which are failing here (as I said, focus on the interface with Guidance, not performance on individual examples). However, given that we also want to keep up with releases of Transformers, LlamaCpp etc., I think that broken gates like this are things we're going to have to live with to a certain extent.

What I could do is run the 'unit' tests in a subworkflow, so we can see that go through, and if necessary override the gate if PR changes are confined to things which don't need a real LLM.

@riedgar-ms
Copy link
Collaborator Author

I know we're having an interesting discussion about test strategy, but are there any objections to merging this PR, and at least clearing the gate?

@hudson-ai
Copy link
Collaborator

Glad to keep this conversation ongoing... I really appreciate you putting so much thought into our tests. But please go ahead and merge!

@riedgar-ms riedgar-ms merged commit 44af848 into main Jul 3, 2024
106 checks passed
@riedgar-ms riedgar-ms deleted the riedgar-ms/ci-fix-02 branch July 3, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants