-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write sharded repodata #161
base: main
Are you sure you want to change the base?
Conversation
…hards for patching
@@ -91,6 +92,23 @@ | |||
repodata_version=2 which is supported in conda 24.5.0 or later. | |||
""", | |||
) | |||
@click.option( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could replace this with an "add-only" or "no-remove" option, that would keep packages in the index even if they are not found in the filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.
Now that the CEP is approved, this branch should be completed and become the way to run conda-index. |
Combining the sharded CLI into the main CLI
Something about this breaks CondaIndex subclasses that want to override upstream_stage, not sure what. (it has to pass upstream_stage to CondaIndex instead of as a CacheClass property) |
Co-authored-by: Travis Hathaway <travis.j.hathaway@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
@@ -91,6 +92,23 @@ | |||
repodata_version=2 which is supported in conda 24.5.0 or later. | |||
""", | |||
) | |||
@click.option( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.
@click.option( | ||
"--sharded", | ||
help=""" | ||
Write index using shards | ||
""", | ||
default=False, | ||
is_flag=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@click.option( | |
"--sharded", | |
help=""" | |
Write index using shards | |
""", | |
default=False, | |
is_flag=True, | |
) | |
@click.option( | |
"--write-shards/--no-write-shards", | |
help=""" | |
Write a repodata.msgpack.zst index and many smaller files per CEP-16. | |
""", | |
default=False, | |
is_flag=True, | |
) |
current_repodata=True, | ||
sharded=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sharded=False, | |
write_shards=False, |
@@ -135,7 +166,10 @@ def cli( | |||
if output: | |||
output = os.path.expanduser(output) | |||
|
|||
channel_index = ChannelIndex( | |||
channel_index_class = ChannelIndexShards if sharded else ChannelIndex | |||
cache_class = ShardedIndexCache if sharded else CondaIndexCache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to write both with a single CLI invocation. Possibly the grouped query (additional method in ShardedIndexCache) becomes the only query we use from now on. We could maintain the subclass to distinguish when conda-index is being extended by non-shard-aware embedders or we could merge them into a single class.
|
||
if new_pkg_fixes is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new_pkg_fixes argument is part of an effort to make repodata patching faster in the case of shards; the way it works now is not as efficient for shards compared to patching all repodata in one go.
CREATE TABLE IF NOT EXISTS index_json ( | ||
path TEXT PRIMARY KEY, index_json BLOB, | ||
name AS (json_extract(index_json, '$.name')), | ||
sha256 AS (json_extract(index_json, '$.sha256')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile to index name; but it might not since we will typically fetch everything.
def __init__( | ||
self, | ||
channel_root: Path | str, | ||
subdir: str, | ||
*, | ||
fs: MinimalFS | None = None, | ||
channel_url: str | None = None, | ||
upstream_stage: str = "fs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this but it's awkward to pass both the class and its parameters separately to class ChannelIndex()
@@ -174,7 +174,21 @@ def merge_or_update_dict( | |||
if base == new: | |||
return base | |||
|
|||
for key, value in new.items(): | |||
if not add_missing_keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another part of the effort to make repodata patching more efficient for shards; not entirely successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dholth you are not currently storing the patches in the sqlite database, correct? Maybe that could help with the efficiency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to execute code to generate patches every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generate patches as in JLAP or apply patches as in repodata patchesd. I understood this code was doing the latter, so I might have misunderstood :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conda-forge downloads repodata, generates hotfixes and then stores those patches in a json file, but in general conda-index executes code to generate hotfixes (unfortunately).
### Enhancements | ||
|
||
* Add `--channeldata/--no-channeldata` flag to toggle generating channeldata. | ||
* Add sharded repodata (repodata split into separate files per package name). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Add sharded repodata (repodata split into separate files per package name). | |
* Add `--write-shards/--no-write-shards` sharded repodata (repodata split into separate files per package name). |
Description
What would it look like to generate sharded repodata per conda/ceps#75
Interested in seeing whether we can efficiently generate shards; how repodata patching should work; and whether we could generate shards as the primary artifact and then derive repodata.json from shards [at the same time in "processes a sequence of package names" code]
How to test
Check out this repository and https://github.com/dholth/conda-test-data
Decompress
conda_test_data/conda-forge/*/.cache/cache.db.zst
Download
conda-forge-repodata-patches-<version>.conda
python3 -m conda_index.index.shards --upstream-stage clone --no-save-fs-state --patch-generator ~/miniconda3/pkgs/conda-forge-repodata-patches-20240401.20.33.07-hd8ed1ab_1.conda --output /tmp/shards ~/prog/conda-test-data/conda-forge
Examine output in
/tmp/shards
I've begun trying to apply the patches to individual shards. This is slow; should compare against applying the many patches against a whole repodata.json.
Checklist - did you ...
news
directory (using the template) for the next release's release notes?