Skip to content
This repository was archived by the owner on Feb 12, 2025. It is now read-only.

snapshotter: use wcmatch.glob.globmatch function #154

Merged
merged 2 commits into from
Nov 14, 2023

Conversation

joelynch
Copy link
Contributor

@joelynch joelynch commented Nov 3, 2023

All existing python glob match functions equivalent to match(path: Path, glob: str) -> bool have serious issues.

  1. Path.match does not accept ** until python 3.13, see here. We need this for ClickHouse.
  2. fnmatch accepts *, but this character is translated to regex .* and matches a / character, which is not what we want.

Neither of these functions passed the test I wrote.

To fix this mess, I've introduced a new 3rd party library wcmatch to handle the glob matching. I did not want to have to use a 3rd party library, but the matching logic is pretty complex and I did not want to write it myself. In the (very distant) future when our minimum supported python is python3.13 we can strip this out.

@joelynch joelynch force-pushed the joelynch/glob-pain branch 2 times, most recently from 5fbbe41 to 341351f Compare November 6, 2023 08:47
@kmichel-aiven
Copy link
Contributor

kmichel-aiven commented Nov 6, 2023

If we have to add a dependency, then so be it :(

I'm worried about two things :

  • wcmatch.glob.globmatch functions does a lot of work to build its patterns every time it's called, and we call that for every file, using and storing https://facelessuser.github.io/wcmatch/glob/#translate before iterating on all files could be good.
  • Once we have that lib, we have the risk of relying on it (adding patterns only supported by that lib and not by Python 3.13).

I'm wondering if we could just replace the f"shadow/{escaped_freeze_name}/store/**/*" group of ClickHouse with two groups ?
f"shadow/{escaped_freeze_name}/store/*/*/*/*" and f"shadow/{escaped_freeze_name}/store/*/*/*/*/*" and avoid the extra dependency.
The groups handle two paths:

  • whatever/disk/store/3-first-char-of-table-uuid/table-uuid/part-name/file-name
  • whatever/disk/store/3-first-char-of-table-uuid/table-uuid/part-name/projection-name/file-name

As far as I know, the only case where we have subfolder in frozen parts is if we have projections in the table:
https://github.com/ClickHouse/ClickHouse/blob/af469cecb6d5577e960446eb3766318ec83843de/src/Storages/MergeTree/DataPartStorageOnDiskFull.cpp#L32-L40

In any case, we should have some projection in setup_cluster_content and a related test to make sure they are restored. The bug you uncovered probably means they are lost during backup.

(m3db also uses ** but at the beginning of the path, if I understand the bug report, that case is correctly supported by the existing code ?)

@joelynch
Copy link
Contributor Author

joelynch commented Nov 6, 2023

Once we have that lib, we have the risk of relying on it (adding patterns only supported by that lib and not by Python 3.13).

Yes. I have tried to quarantine its use to the utils package so that we do not rely on it too much. This is a worry though.

wcmatch.glob.globmatch functions does a lot of work to build its patterns every time it's called, and we call that for every file, using and storing facelessuser.github.io/wcmatch/glob/#translate before iterating on all files could be good.

This is a good point, it is like 20x faster to compile it first and this shouldn't be too much of a change. path_matches_glob here is not too much slower than Path.match though interestingly.

In [6]: %timeit Path('hello/asdf/sdf/world').match('hello/*/*/world')
3.26 µs ± 34.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit path_matches_glob('hello/asdf/sdf/world', 'hello/*/*/world')
3.63 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [8]: %timeit path_matches_glob('hello/asdf/sdf/world', 'hello/**/world')
3.96 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [13]: compiled = re.compile(wcmatch.glob.translate('hello/**/world')[0][0])

In [14]: %timeit compiled.match('hello/asdf/sdf/world') is not None
196 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

(m3db also uses ** but at the beginning of the path, if I understand the bug report, that case is correctly supported by the existing code ?)

Path.match is very unusual IMO and very different from bash globs which I think is the intuition everyone has for what the globs in the config file are supposed to mean. The behavior of '*' at the prefix also acts the same as '**' which is really weird. I think I would prefer not to use this function until it is fixed.

In [15]: Path('abc/123').match('**')
Out[15]: True

In [16]: Path('abc/123').match('*')
Out[16]: True

In [20]: Path('abc/123/456').match('abc/**')
Out[20]: False

In [21]: Path('abc/123/456').match('abc/*')
Out[21]: False

In [23]: Path('abc/123/456').match('**/456')
Out[23]: True

In [24]: Path('abc/123/456').match('*/456')
Out[24]: True

In [25]: Path('abc/123/456/456').match('abc/*/456')
Out[25]: False

@kmichel-aiven
Copy link
Contributor

Once we have that lib, we have the risk of relying on it (adding patterns only supported by that lib and not by Python 3.13).

Yes. I have tried to quarantine its use to the utils package so that we do not rely on it too much. This is a worry though.

wcmatch.glob.globmatch functions does a lot of work to build its patterns every time it's called, and we call that for every file, using and storing facelessuser.github.io/wcmatch/glob/#translate before iterating on all files could be good.

This is a good point, it is like 20x faster to compile it first and this shouldn't be too much of a change. path_matches_glob here is not too much slower than Path.match though interestingly.

In [6]: %timeit Path('hello/asdf/sdf/world').match('hello/*/*/world')
3.26 µs ± 34.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit path_matches_glob('hello/asdf/sdf/world', 'hello/*/*/world')
3.63 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [8]: %timeit path_matches_glob('hello/asdf/sdf/world', 'hello/**/world')
3.96 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [13]: compiled = re.compile(wcmatch.glob.translate('hello/**/world')[0][0])

In [14]: %timeit compiled.match('hello/asdf/sdf/world') is not None
196 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

👍

In addition to pure speed that is visible in microbenchmark, that will avoid thrashing memory with thousand of short lived objects.

(m3db also uses ** but at the beginning of the path, if I understand the bug report, that case is correctly supported by the existing code ?)

Path.match is very unusual IMO and very different from bash globs which I think is the intuition everyone has for what the globs in the config file are supposed to mean. The behavior of '*' at the prefix also acts the same as '**' which is really weird. I think I would prefer not to use this function until it is fixed.

In [15]: Path('abc/123').match('**')
Out[15]: True

In [16]: Path('abc/123').match('*')
Out[16]: True

In [20]: Path('abc/123/456').match('abc/**')
Out[20]: False

In [21]: Path('abc/123/456').match('abc/*')
Out[21]: False

In [23]: Path('abc/123/456').match('**/456')
Out[23]: True

In [24]: Path('abc/123/456').match('*/456')
Out[24]: True

In [25]: Path('abc/123/456/456').match('abc/*/456')
Out[25]: False

That's fair, let's use the library then.

@joelynch joelynch force-pushed the joelynch/glob-pain branch 6 times, most recently from 6a66cbf to 1db69cf Compare November 6, 2023 13:11
@joelynch joelynch marked this pull request as ready for review November 6, 2023 13:12
@joelynch
Copy link
Contributor Author

joelynch commented Nov 6, 2023

Okay this one is ready now I think, I have put all the SnapshotGroup glob stuff into a little library astacus/node/snapshot_groups.py.

@joelynch joelynch force-pushed the joelynch/glob-pain branch 3 times, most recently from 605ed0e to 31c89d1 Compare November 7, 2023 14:15
@joelynch
Copy link
Contributor Author

joelynch commented Nov 8, 2023

Verified

This commit was signed with the committer’s verified signature.
joelynch Joe Lynch
All existing python glob match functions seem to have issues.
Path.match does not accept '**'.  When using fnmatch, '*' matches a /
character which is not what we want.  To fix this mess, I've
introduced a new 3rd party library wcmatch to handle the globmatching.

Verified

This commit was signed with the committer’s verified signature.
joelynch Joe Lynch
* also fix bug with not reporting progress "done"
@codecov-commenter
Copy link

Codecov Report

Attention: 516 lines in your changes are missing coverage. Please review.

Comparison is base (7477bfd) 87.02% compared to head (63b5063) 87.25%.
Report is 184 commits behind head on master.

Files Patch % Lines
...tion/coordinator/plugins/clickhouse/test_plugin.py 17.04% 73 Missing ⚠️
...gration/coordinator/plugins/clickhouse/conftest.py 56.09% 54 Missing ⚠️
astacus/coordinator/plugins/base.py 79.11% 47 Missing ⚠️
...cus/coordinator/plugins/cassandra/restore_steps.py 67.59% 35 Missing ⚠️
astacus/node/api.py 66.34% 35 Missing ⚠️
astacus/node/download.py 54.00% 23 Missing ⚠️
...dinator/plugins/clickhouse/async_object_storage.py 75.29% 21 Missing ⚠️
astacus/node/memory_snapshot.py 90.55% 17 Missing ⚠️
astacus/client.py 38.46% 16 Missing ⚠️
astacus/node/cassandra.py 81.60% 16 Missing ⚠️
... and 39 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #154      +/-   ##
==========================================
+ Coverage   87.02%   87.25%   +0.22%     
==========================================
  Files         122      141      +19     
  Lines        7648     9980    +2332     
==========================================
+ Hits         6656     8708    +2052     
- Misses        992     1272     +280     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kmichel-aiven kmichel-aiven merged commit f788129 into master Nov 14, 2023
@kmichel-aiven kmichel-aiven deleted the joelynch/glob-pain branch November 14, 2023 17:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants