[Build] Various fixes #936

riedgar-ms · 2024-07-02T11:53:04Z

The builds have broken in a number of interesting ways:

The required CMake configuration for llama-cpp-python has changed slightly, so update the various workflow files (not to mention the ReadMe).
It also appears that a new version of Phi3 started giving trouble with another test, so fix that as well
One of the LLamaCpp tests is consistently failing on Windows (and only Windows), so add a conditional skip

riedgar-ms · 2024-07-02T16:38:58Z

tests/model_specific/test_transformers.py

@@ -116,11 +112,10 @@ def test_phi3_transformers_orig():
 def test_phi3_chat_basic(phi3_model: models.Model):


The change makes this test match the following 'unrolled' one.

hudson-ai · 2024-07-02T23:40:16Z

tests/model_specific/test_transformers.py

+        lm += "You are a counting bot. Just keep counting numbers."
    with assistant():
-        lm += gen(name="five", max_tokens=10)
+        lm += "1,2,3,4," + gen(name="five", max_tokens=20)


Reasonable fix. That being said, tests of this sort feel inherently flaky... I'm not sure we should be testing the actual behaviors of models that we don't have control over..?

I agree, and in general, we don't look at the model output (unless it's supposed to be constrained). This is quite an old test, and I'm not sure if there was originally some other goal here.

With the separation of Model and Engine, we should make sure that going forward we test the interface carefully, and then focus testing on the Guidance parser/grammar parts (with mocks, like we do for JSON). This is what I was gesticulating towards in the test doc I wrote up.

Sadly, I don't think we can get away with solely compliance-based testing on the interface and grammars. There's an entire class of issues where the code will function, but where we're knocking the model off-distribution and cause poor quality generations during test time (e.g. a malformed chat template, bad token healing, etc.). This particular test is by no means the perfect way to model this behavior, but this was consistently failing with Phi-3 generating nonsense until we fixed up the chat template to align with pre-training.

ML projects deviate from traditional software in that you will often have code that executes, but may have the wrong functional behavior, and it can be challenging to fit that in with traditional software testing workflows. An alternative to writing flaky one-off tests like this is to e.g. maintain a benchmark suite which runs nightly, and monitor model performance over time. Subtle bugs are still hard to catch in both worlds of course, but running on many examples is almost certainly better than one offs like this one here?

There's definitely a lot more 'unit' testing which could and should be done of the guidance Model and parser.

That said, point taken that problems with 'knocking off the distribution' aren't going to show up with just 'testing the model interfaces'. Perhaps we do need a class of model specific tests which are known to be flaky (these should be carefully split from the 'interface' type tests which should always pass), and monitor them over time. We would need to investigate test monitoring options, or be prepared to build our own (fun with security!), though. Once you start tolerating flakiness, there are two huge risks

Getting used to things 'only' getting 0.1% worse each day

Persistently failing tests hiding in the 'acceptable failure rate'

The test analysis infrastructure would need to be able to address these

hudson-ai · 2024-07-03T13:34:51Z

Thanks a bunch for the fixes!

I'm noticing that a vast majority of the time, these breakages are due to external changes rather than internal ones. I am glad that we get fast warnings in these cases, but it would be awesome to disentangle them from the requirements of PRs...

What do you think? I don't think my sentiments are terribly out of line with your general testing proposal..?

tests/model_specific/test_llama_cpp.py

riedgar-ms · 2024-07-03T14:37:04Z

I think that we should be writing more unit tests for sure, and cutting back on some of the sort of tests which are failing here (as I said, focus on the interface with Guidance, not performance on individual examples). However, given that we also want to keep up with releases of Transformers, LlamaCpp etc., I think that broken gates like this are things we're going to have to live with to a certain extent.

What I could do is run the 'unit' tests in a subworkflow, so we can see that go through, and if necessary override the gate if PR changes are confined to things which don't need a real LLM.

riedgar-ms · 2024-07-03T16:07:57Z

I know we're having an interesting discussion about test strategy, but are there any objections to merging this PR, and at least clearing the gate?

hudson-ai · 2024-07-03T16:10:26Z

Glad to keep this conversation ongoing... I really appreciate you putting so much thought into our tests. But please go ahead and merge!

riedgar-ms added 7 commits July 2, 2024 07:50

Change CMake command

3a10e83

Helps to update the right file

d60ae3a

Should be the last one

662ca64

Update ReadMe too

31488c2

A little extra guidance for the model

5c419ce

Try a different prompt

c02d28c

Make it match the next example

4324f3e

riedgar-ms requested review from Harsha-Nori, hudson-ai and paulbkoch July 2, 2024 16:38

riedgar-ms commented Jul 2, 2024

View reviewed changes

hudson-ai reviewed Jul 2, 2024

View reviewed changes

Skip troublesome test on Windows

4680db6

riedgar-ms changed the title ~~[Build] Fix llama-cpp-python~~ [Build] Various fixes Jul 3, 2024

hudson-ai reviewed Jul 3, 2024

View reviewed changes

tests/model_specific/test_llama_cpp.py Outdated Show resolved Hide resolved

Better with xfail

df444d9

hudson-ai approved these changes Jul 3, 2024

View reviewed changes

hudson-ai mentioned this pull request Jul 3, 2024

Update protobuf definitions for remote list_append fix #937

Merged

riedgar-ms merged commit 44af848 into main Jul 3, 2024
106 checks passed

riedgar-ms deleted the riedgar-ms/ci-fix-02 branch July 3, 2024 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Build] Various fixes #936

[Build] Various fixes #936

riedgar-ms commented Jul 2, 2024 •

edited

Loading

riedgar-ms Jul 2, 2024

hudson-ai Jul 2, 2024

riedgar-ms Jul 3, 2024

Harsha-Nori Jul 3, 2024 •

edited

Loading

riedgar-ms Jul 3, 2024

hudson-ai commented Jul 3, 2024

riedgar-ms commented Jul 3, 2024

riedgar-ms commented Jul 3, 2024

hudson-ai commented Jul 3, 2024

		@@ -116,11 +112,10 @@ def test_phi3_transformers_orig():
		def test_phi3_chat_basic(phi3_model: models.Model):

[Build] Various fixes #936

[Build] Various fixes #936

Conversation

riedgar-ms commented Jul 2, 2024 • edited Loading

riedgar-ms Jul 2, 2024

Choose a reason for hiding this comment

hudson-ai Jul 2, 2024

Choose a reason for hiding this comment

riedgar-ms Jul 3, 2024

Choose a reason for hiding this comment

Harsha-Nori Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

riedgar-ms Jul 3, 2024

Choose a reason for hiding this comment

hudson-ai commented Jul 3, 2024

riedgar-ms commented Jul 3, 2024

riedgar-ms commented Jul 3, 2024

hudson-ai commented Jul 3, 2024

riedgar-ms commented Jul 2, 2024 •

edited

Loading

Harsha-Nori Jul 3, 2024 •

edited

Loading