Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot discovery and reading takes quadratic time #539

Closed
huonw opened this issue Aug 16, 2021 · 9 comments
Closed

Snapshot discovery and reading takes quadratic time #539

huonw opened this issue Aug 16, 2021 · 9 comments

Comments

@huonw
Copy link
Contributor

huonw commented Aug 16, 2021

Describe the bug

We're using syrupy, and it works well. Thank you!

Unfortunately we have nearly 500 snapshots, and our tests runs are starting to get quite slow. It seems like syrupy unfortunately makes the testing take quadratic time with respect to the number of snapshots.

To reproduce

Create this file

# test_performance.py
import pytest
import os

SIZE = int(os.environ.get("SIZE", 1000))

@pytest.mark.parametrize("x", range(SIZE))
def test_performance(x, snapshot):
    assert x == snapshot
    # assert x == x

Run, for instance:

for s in 100 500 1000 2000; do
    echo "size = $s"
    # create the snapshots
    SIZE=$s pytest test_performance.py --snapshot-update
    # just check them
    SIZE=$s pytest test_performance.py
done

The times reported by pytest scale scales quadratically with the number of tests/snapshots (O(size**2)). I think this is because the number of read_file/_read_snapshot_fossil calls and discover_snapshots calls, as reported by python -m cProfile -m py.test test_performance.py, scales linearly (O(size)) with the number of tests/snapshots, and the work required for each call also scales linearly (because the files contain O(size) data, for the snapshots).

The times and number of calls (this is just for the invocation that's just checking snapshot) is something like:

size time (seconds) discover_snapshots calls read_file calls
100 0.15 200 300
500 1.73 1000 1500
1000 6.24 2000 3000
2000 21.84 4000 6000

Things of note:

  • each doubling from 500 -> 1000 -> 2000 multiplies the time by 4, approximately (classic marker of quadratic performance)
  • the number of calls seems rather large: 2 discovery calls per test/snapshot, and 3 read-file calls

Expected behavior

The test runs should be linear in the number of tests/snapshots. For instance, if the assert x == x line is used (and the snapshot fixture removed) instead of assert x == snapshot, the test run is linear: even SIZE=10000 finishes in < 4s on my machine.

It seems like this could be handled by discovering the snapshots once (or once per file) and reading each snapshot file once too.

Screenshots

Environment (please complete the following information):

  • OS: macOS
  • Syrupy Version: 1.4
  • Python Version: 3.8

Additional context

@noahnu
Copy link
Collaborator

noahnu commented Aug 16, 2021

It definitely shouldn't be quadratic. I'll take a look this week. I suspect there's a few easy optimizations we can do. We also have benchmarking setup in the repo.

@noahnu
Copy link
Collaborator

noahnu commented Aug 18, 2021

This was a really good catch. As I mentioned above, there are a number of optimizations that can be made. I'll leave this GitHub issue as I merge in performance improvements over the next couple weeks.

The first improvement is a change to only discover snapshots once per test file rather than once per assertion. This should have significant impact on your use case, especially if using parameterization. See https://github.com/tophat/syrupy/releases/tag/v1.4.2

I'll have to think about some options around hoisting the read/write snapshot logic out of the assertions, and into the session. Perhaps we batch the writes, and memoize the reads? There might be a memory concern for sufficiently large snapshots, however I think we can optimize for performance here.

@huonw
Copy link
Contributor Author

huonw commented Aug 20, 2021

The first improvement is a change to only discover snapshots once per test file rather than once per assertion. This should have significant impact on your use case, especially if using parameterization. See https://github.com/tophat/syrupy/releases/tag/v1.4.2

Thanks for the quick improvement! It leads to a 33% improvement in the test, and a 25% improvement in our real test suite (68s -> 51s), definitely a useful start.

However, I'm not entirely sure it's working as intended: I installed the new version:

$ pip freeze | grep syrupy
syrupy==1.4.2

I reran the test in the issue, and the time, and the number of calls to discover_snapshots/read_file has definitely improved (by 50%/33% respectively).

size time (seconds) discover_snapshots calls read_file calls
100 0.12 101 201
500 1.24 501 1001
1000 4.20 1001 2001
2000 15.70 2001 4001
  1. discover_snapshots is still called more than once per file (i.e. more than one, in this example): pstats's callers functionality says the culprit might be a register_request function. For instance, for SIZE=2000:
    % callers discover_snapshots
    ...
    Function                                              was called by...
                                                              ncalls  tottime  cumtime
    .../syrupy/extensions/base.py:91(discover_snapshots)  <-       1    0.000    0.005  .../syrupy/report.py:83(__attrs_post_init__)
                                                                2000    0.009   11.034  .../syrupy/session.py:81(register_request)
    
  2. read_file is by far the dominant factor, and so the improvement is close to the reduction in read_file calls (33%). For instance, for SIZE=2000, 21.5s out of the 24.2s test run time is spent in read_file (running under cProfile seems to have significant overhead for this, which is why the total number is higher than the 15.7s in the table above).

Assuming the extra calls in 1 can be eliminated (leaving 1 or 2 discover_snapshots calls per file, and number of snapshots + number of files × (1 or 2) read_file calls), this still looks like it could be a ~50% reduction in our real-world test time, which would be awesome (even without doing potentially harder work to resolve the overall quadratic behaviour)!

I'll have to think about some options around hoisting the read/write snapshot logic out of the assertions, and into the session. Perhaps we batch the writes, and memoize the reads? There might be a memory concern for sufficiently large snapshots, however I think we can optimize for performance here.

That seems reasonable to me. I wonder if throwing @functools.lru_cache on the discovery/read functions might resolve this with little effort?

In terms of memory concerns, I think the memory impact should be reasonably controlled, as the cached memory would generally be proportional to the file size, which would generally be relatively small.

For instance, in our real world test suite, we have 7.5MB total .ambr files, and the largest individual one is 2.6MB, which would be totally fine to hold in memory for us. We're likely to be reducing some of the size in future too, because a human can't validate snapshots that large look right, or even that a diff is correct.

Also, my feeling is that read performance is far more important as a focus than write performance for now, as I'd expect read-only tests will happen far more often than write ones (e.g. CI is read-only, and most tests even in a --snapshot-update run would only read, without writing, since my experience is that only a few snapshots change at a time).

@noahnu
Copy link
Collaborator

noahnu commented Aug 20, 2021

I didn't notice any impact from caching the read calls in the ambr serializer. If anything, performance was slightly worse from the additional cache overhead. Ended up repeating the logic around caching "discovery" across "report" and "session". When I tried to memoize the discover_snapshot method, I didn't notice any improvements, which leads me to believe the time spent is actually in the looping + dict merging.

With my latest change, cProfile reports 2 discover_snapshot calls rather than 1001.

https://github.com/tophat/syrupy/releases/tag/v1.4.3

It should at least be linear now

@huonw
Copy link
Contributor Author

huonw commented Aug 20, 2021

That's awesome, thank you! 👏

1.4.3 takes our test suite to 35s (-49% overall), way better!

Here's the updated table; it's much faster, but still quadratic:

size time (seconds) discover_snapshots calls read_file calls
100 0.09 2 102
500 0.75 2 502
1000 2.42 2 1002
2000 8.24 2 2002

I didn't notice any impact from caching the read calls in the ambr serializer. If anything, performance was slightly worse from the additional cache overhead. Ended up repeating the logic around caching "discovery" across "report" and "session". When I tried to memoize the discover_snapshot method, I didn't notice any improvements, which leads me to believe the time spent is actually in the looping + dict merging.

I experimented a little bit and found that putting into the DataSerializer.read_file method made a big difference, seemingly resolving the quadratic behaviour when reading: #543 .

@noahnu
Copy link
Collaborator

noahnu commented Aug 20, 2021

If we're saying the read call is the issue then there's a linear relationship between the number of reads and the number of assertions. In fact there's one read per assertion.

I'm not sure where the quadratic relationship is coming from. Have you tried with substantially more test cases? With and without syrupy?

Also feel free to join our discord: https://discord.gg/kZYy8agD

@huonw
Copy link
Contributor Author

huonw commented Aug 20, 2021

Yeah, unfortunately each file is generally O(number of assertions) size, because it stores info for each assertion, so when reads the whole file again it's doing O(number of assertions) work (since it has to at least touch every byte in the file to find the names). That is, it's O(number of assertions) assertions doing O(number of assertions) parsing work, leading to O(number of assertions**2) quadratic behaviour for each file.

The simple test in the issue is an extreme example, with up to 2000 assertions in a single file, but it has an impact even for our real-world ~450 snapshots spread across 25 files.

Have you tried with substantially more test cases? With and without syrupy?

Here's the test with SIZE=10000:

version time (s)
syrupy==1.4.3 201
syrupy #543 4.64
no syrupy 3.57

(syrupy 1.4.3 seems to be approximately 52 × the time for SIZE=2000 (8.24s), matching expectations for quadratic behaviour. The no syrupy version is the one described in the issue, removing the snapshot fixture and changing to assert x == x.)

Also feel free to join our discord: https://discord.gg/kZYy8agD

Sorry, I'd prefer not to do so for now, but thanks for the invitation! 😄

@noahnu
Copy link
Collaborator

noahnu commented Aug 20, 2021

Based on your metrics, it seems performance is under control now, or at a minimum it's no longer quadratic, so I'll close this issue. If you have other ideas/requests, we're always open to contributors.

@noahnu noahnu closed this as completed Aug 20, 2021
@huonw
Copy link
Contributor Author

huonw commented Aug 20, 2021

Sounds good. I think writing snapshots is still quadratic (in the number of written snapshots), but that's not something we're struggling with, compared to the reading performance as we've resolved here 🎉

Thanks for being so responsive!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants