-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Build] Various fixes #936
Conversation
@@ -116,11 +112,10 @@ def test_phi3_transformers_orig(): | |||
def test_phi3_chat_basic(phi3_model: models.Model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes this test match the following 'unrolled' one.
lm += "You are a counting bot. Just keep counting numbers." | ||
with assistant(): | ||
lm += gen(name="five", max_tokens=10) | ||
lm += "1,2,3,4," + gen(name="five", max_tokens=20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable fix. That being said, tests of this sort feel inherently flaky... I'm not sure we should be testing the actual behaviors of models that we don't have control over..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and in general, we don't look at the model output (unless it's supposed to be constrained). This is quite an old test, and I'm not sure if there was originally some other goal here.
With the separation of Model and Engine, we should make sure that going forward we test the interface carefully, and then focus testing on the Guidance parser/grammar parts (with mocks, like we do for JSON). This is what I was gesticulating towards in the test doc I wrote up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, I don't think we can get away with solely compliance-based testing on the interface and grammars. There's an entire class of issues where the code will function, but where we're knocking the model off-distribution and cause poor quality generations during test time (e.g. a malformed chat template, bad token healing, etc.). This particular test is by no means the perfect way to model this behavior, but this was consistently failing with Phi-3 generating nonsense until we fixed up the chat template to align with pre-training.
ML projects deviate from traditional software in that you will often have code that executes, but may have the wrong functional behavior, and it can be challenging to fit that in with traditional software testing workflows. An alternative to writing flaky one-off tests like this is to e.g. maintain a benchmark suite which runs nightly, and monitor model performance over time. Subtle bugs are still hard to catch in both worlds of course, but running on many examples is almost certainly better than one offs like this one here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's definitely a lot more 'unit' testing which could and should be done of the guidance Model
and parser.
That said, point taken that problems with 'knocking off the distribution' aren't going to show up with just 'testing the model interfaces'. Perhaps we do need a class of model specific tests which are known to be flaky (these should be carefully split from the 'interface' type tests which should always pass), and monitor them over time. We would need to investigate test monitoring options, or be prepared to build our own (fun with security!), though. Once you start tolerating flakiness, there are two huge risks
- Getting used to things 'only' getting 0.1% worse each day
- Persistently failing tests hiding in the 'acceptable failure rate'
The test analysis infrastructure would need to be able to address these
Thanks a bunch for the fixes! I'm noticing that a vast majority of the time, these breakages are due to external changes rather than internal ones. I am glad that we get fast warnings in these cases, but it would be awesome to disentangle them from the requirements of PRs... What do you think? I don't think my sentiments are terribly out of line with your general testing proposal..? |
I think that we should be writing more unit tests for sure, and cutting back on some of the sort of tests which are failing here (as I said, focus on the interface with Guidance, not performance on individual examples). However, given that we also want to keep up with releases of Transformers, LlamaCpp etc., I think that broken gates like this are things we're going to have to live with to a certain extent. What I could do is run the 'unit' tests in a subworkflow, so we can see that go through, and if necessary override the gate if PR changes are confined to things which don't need a real LLM. |
I know we're having an interesting discussion about test strategy, but are there any objections to merging this PR, and at least clearing the gate? |
Glad to keep this conversation ongoing... I really appreciate you putting so much thought into our tests. But please go ahead and merge! |
The builds have broken in a number of interesting ways:
llama-cpp-python
has changed slightly, so update the various workflow files (not to mention the ReadMe).