-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A/B test reports #309
A/B test reports #309
Conversation
@crusaderky it looks like you were hitting some permissions issues. Is it the same as #293? I've been unable to reproduce, but you might try reverting #240 to see if that helps |
for test in tests: | ||
if (func := getattr(mod, test, None)) and callable(func): | ||
# FIXME missing decorators, namely @pytest.mark.parametrize | ||
source[fname[len("tests/") :] + "::" + test] = inspect.getsource(func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather continue to be defensive here with a try
/except
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? It makes a lot more sense to me to have the whole test suite crash if someone adds a new import in a test file and doesn't add it to ci/environment-dashboard.yml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I view this code-scraping to be a fairly fragile hack, and weird stuff can happen on import, especially when pytest is involved. For example, a pytest.skip
at the module level raises an exception, and I don't want to force everyone to do non-idiomatic pytest stuff on account of this static report. To me, having a missing code snippet for a test is not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also unhappy about having to add stuff to ci/environment-dashboard.yml
in the first place to just be able to use inspect
. If you have ideas about how to get snippets without relying on importing, I'm certainly open to it. But this seemed preferable to trying to parse .py
files to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially reverted
dashboard.py
Outdated
df = df.sort_values("runtime", key=runtime_sort_key_pd) | ||
# Altair will re-sort alphabetically, unless it's explicitly asked to sort by | ||
# something else. Could not find an option to NOT sort. | ||
y = altair.Y("runtime", sort=altair.EncodingSortField(field="order")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does sort=None
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works
No, the issue was that
|
- name: Download software environment assets | ||
uses: actions/download-artifact@v3 | ||
with: | ||
name: software-environment-py3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you using this? Hopefully these assets will be going away soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the following change (line 471)
1645f3e
to
ca9c525
Compare
6c745b4
to
43460fa
Compare
Ready for review (assuming CI passes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crusaderky it looks like there is a legitimate error in the static site generation script
Traceback (most recent call last):
Discovered 36 tests
File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 507, in <module>
main()
Generating static/coiled-upstream-py3.10.html
Generating static/coiled-upstream-py3.9.html
Generating static/coiled-upstream-py3.8.html
Generating static/coiled-latest-py3.10.html
Generating static/coiled-latest-py3.9.html
Generating static/coiled-latest-py3.8.html
Generating static/coiled-0.1.0-py3.10.html
Generating static/coiled-0.1.0-py3.9.html
Generating static/coiled-0.1.0-py3.8.html
Generating static/coiled-0.0.4-py3.10.html
Generating static/coiled-0.0.4-py3.9.html
Generating static/coiled-0.0.4-py3.8.html
Generating static/coiled-0.0.3-py3.9.html
Generating static/coiled-0.0.3-py3.8.html
Generating static/coiled-0.0.3-py3.7.html
Generating static/AB_by_test.html
Generating static/AB_by_runtime.html
File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 497, in main
align_to_baseline(latest_run, baseline),
File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 80, in align_to_baseline
df_baseline.set_index("fullname")
File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 961, in __getitem__
return self._getitem_tuple(key)
File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1147, in _getitem_tuple
return self._multi_take(tup)
File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1098, in _multi_take
d = {
File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1099, in <dictcomp>
axis: self._get_listlike_indexer(key, axis)
File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1330, in _get_listlike_indexer
keyarr, indexer = ax._get_indexer_strict(key, axis_name)
File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5796, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5859, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: "['stability/test_shuffle.py::test_shuffle_simple', 'stability/test_array.py::test_rechunk_in_memory', 'stability/test_shuffle.py::test_shuffle_parquet'] not in index"
@jrbourbeau fixed now |
@ncclementi @ian-r-rose I called my latest commit |
I think things are working as expected. The |
@crusaderky this is a bit of a bad design in the name. What happened is that when you called test-upstream then "latest" will have the nightlies see that the nightlies were installed. https://github.com/coiled/coiled-runtime/runs/8237542627?check_suite_focus=true#step:5:20 That logic is here https://github.com/coiled/coiled-runtime/blob/1614ab58128e3e3dcfa7d213db2c88e7c55ac6d4/ci/create_latest_runtime_meta.py#L41-L49 |
Re: noise of the samples We can also look into running some tests multiple times. Particularly the ones that have very small runtimes (<30s) are likely more prone to noise and we can afford running them a couple of times. Obviously, all of this is a follow up topic. I think this is already a great first step |
A t-test is reasonable if your null hypothesis is that the data are normally distributed, if you don't have that guarantee (for example a long-tailed distribution due to outliers) then you're better off with a non-parametric test (such as Kolmogorov-Smirnov (available in scipy.stats.kstest) |
Data is definitely not normally distributed: #315 |
I think we should discuss this in a dedicated issue. I think in most situations the normal distribution is a fair assumption. My worry about more complex tests is that we don't have a lot of measurements to start with. We're not actually comparing two distributions since we often only have a single measurement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell this looks like it's ready for a final review. @ian-r-rose do you think you'll have time to take a look at the dashboard.py
changes? Or maybe you already have
There's been some discussion around merge conflicts with #235 but from what I can tell it appears they will be relatively minimal, so I'm okay merging this PR first if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed the code on a high level. If the plots turn out this way, I'm good with this.
In scope
Out of scope
The plots highlight that the performance metrics are heavily affected by noise.
In some cases, this looks like a high variance between runs (which can be seen in the historical plots); in others, the difference is consistent throughout the history, but is not justified by a difference in runtime so I suspect there's some hidden warm-up cost that differs between the runs.
For example,
latest
and0.1.0
only differ by the addition ofpynvml
, but the A/B charts comparing the two (top right plot above) show major differences all over the place - which are all noise. The memory usage plots look just as jittery. Ideally this chart should be used to take the informed decision to finalize a new release (all bars are negative or almost zero) and is also twitter gold - but at the moment, it's not usable due to the underlying data.On one side, I can incorporate historical data into these plots to show brackets of variance for the baseline. I will in a successive PR. However, such brackets will just tell us what we can already see in the historical graphs: that most tests are swinging wildly and that a +/- 50% runtime is nothing to be surprised about.
We need to sit down and find the best way to actually reduce variance in runtime and memory usage.