filter: Improve speed of checking duplicates #1466

victorlin · 2024-05-17T19:21:01Z

Description of proposed changes

pandas's isin(set) converts the set into a list, making each value check scale linearly with the size of the set. In the usage here, it's unfortunate because the size of the set grows with each metadata chunk processed. That means the entire operation scales with (size of chunk * size of metadata).

Do the equivalent but much faster by first making a set from the current chunk's index then intersecting that with the set of previously seen IDs. This is faster because both of these steps scale linearly with the size of the chunk, not the size of the metadata, meaning the entire operation scales with (2 * size of chunk).

Related issue(s)

Prompted by Slack discussion
Follow-up to Properly error on duplicates in parse and filter, handle AugurError globally #918

Checklist

Checks pass (notably filter-metadata-duplicates-error.t)
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

pandas's isin(set) converts the set into a list, making each value check scale linearly with the size of the set.¹ In the usage here, it's unfortunate because the size of the set grows with each metadata chunk processed. That means the entire operation scales with (size of chunk * size of metadata). Do the equivalent but much faster by first making a set from the current chunk's index then intersecting that with the set of previously seen IDs. This is faster because both of these steps scale linearly² with the size of the chunk, not the size of the metadata, meaning the entire operation scales with (2 * size of chunk). ¹ <pandas-dev/pandas#25507> ² <https://wiki.python.org/moin/TimeComplexity#set>

codecov · 2024-05-17T19:40:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.85%. Comparing base (e0f88ca) to head (4923408).
Report is 240 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1466      +/-   ##
==========================================
+ Coverage   68.82%   68.85%   +0.03%     
==========================================
  Files          69       69              
  Lines        7595     7607      +12     
  Branches     1860     1861       +1     
==========================================
+ Hits         5227     5238      +11     
  Misses       2086     2086              
- Partials      282      283       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

huddlej · 2024-05-17T20:57:33Z

Nice one, @victorlin!

trvrb · 2024-05-18T18:19:55Z

Very clever improvement. Thanks @victorlin!

victorlin self-assigned this May 17, 2024

Update changelog

4923408

victorlin marked this pull request as ready for review May 17, 2024 19:30

victorlin requested a review from a team May 17, 2024 19:30

genehack approved these changes May 17, 2024

View reviewed changes

victorlin merged commit c241569 into master May 17, 2024
20 checks passed

victorlin deleted the victorlin/faster-duplicate-check branch May 17, 2024 22:08

victorlin mentioned this pull request Aug 9, 2024

Speed up augur filter without replacing Pandas #1573

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Improve speed of checking duplicates #1466

filter: Improve speed of checking duplicates #1466

victorlin commented May 17, 2024 •

edited

Loading

codecov bot commented May 17, 2024 •

edited

Loading

huddlej commented May 17, 2024

trvrb commented May 18, 2024

filter: Improve speed of checking duplicates #1466

filter: Improve speed of checking duplicates #1466

Conversation

victorlin commented May 17, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented May 17, 2024 • edited Loading

Codecov Report

huddlej commented May 17, 2024

trvrb commented May 18, 2024

victorlin commented May 17, 2024 •

edited

Loading

codecov bot commented May 17, 2024 •

edited

Loading