Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write sharded repodata #161

Draft
wants to merge 28 commits into
base: main
Choose a base branch
from
Draft

Write sharded repodata #161

wants to merge 28 commits into from

Conversation

dholth
Copy link
Contributor

@dholth dholth commented May 10, 2024

Description

What would it look like to generate sharded repodata per conda/ceps#75

Interested in seeing whether we can efficiently generate shards; how repodata patching should work; and whether we could generate shards as the primary artifact and then derive repodata.json from shards [at the same time in "processes a sequence of package names" code]

How to test

Check out this repository and https://github.com/dholth/conda-test-data

Decompress conda_test_data/conda-forge/*/.cache/cache.db.zst

Download conda-forge-repodata-patches-<version>.conda

python3 -m conda_index.index.shards --upstream-stage clone --no-save-fs-state --patch-generator ~/miniconda3/pkgs/conda-forge-repodata-patches-20240401.20.33.07-hd8ed1ab_1.conda --output /tmp/shards ~/prog/conda-test-data/conda-forge

Examine output in /tmp/shards

I've begun trying to apply the patches to individual shards. This is slow; should compare against applying the many patches against a whole repodata.json.

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label May 10, 2024
@dholth dholth force-pushed the sharded-repodata branch from 62eb37f to 3cac14a Compare June 7, 2024 18:23
@@ -91,6 +92,23 @@
repodata_version=2 which is supported in conda 24.5.0 or later.
""",
)
@click.option(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could replace this with an "add-only" or "no-remove" option, that would keep packages in the index even if they are not found in the filesystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.

@dholth
Copy link
Contributor Author

dholth commented Aug 19, 2024

Now that the CEP is approved, this branch should be completed and become the way to run conda-index.

@dholth dholth mentioned this pull request Aug 23, 2024
2 tasks
@dholth dholth changed the title Sharded repodata experiment Write sharded repodata Aug 30, 2024
@dholth
Copy link
Contributor Author

dholth commented Aug 30, 2024

Something about this breaks CondaIndex subclasses that want to override upstream_stage, not sure what. (it has to pass upstream_stage to CondaIndex instead of as a CacheClass property)

Copy link
Contributor Author

@dholth dholth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@@ -91,6 +92,23 @@
repodata_version=2 which is supported in conda 24.5.0 or later.
""",
)
@click.option(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.

Comment on lines +133 to +140
@click.option(
"--sharded",
help="""
Write index using shards
""",
default=False,
is_flag=True,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@click.option(
"--sharded",
help="""
Write index using shards
""",
default=False,
is_flag=True,
)
@click.option(
"--write-shards/--no-write-shards",
help="""
Write a repodata.msgpack.zst index and many smaller files per CEP-16.
""",
default=False,
is_flag=True,
)

current_repodata=True,
sharded=False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sharded=False,
write_shards=False,

@@ -135,7 +166,10 @@ def cli(
if output:
output = os.path.expanduser(output)

channel_index = ChannelIndex(
channel_index_class = ChannelIndexShards if sharded else ChannelIndex
cache_class = ShardedIndexCache if sharded else CondaIndexCache
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to write both with a single CLI invocation. Possibly the grouped query (additional method in ShardedIndexCache) becomes the only query we use from now on. We could maintain the subclass to distinguish when conda-index is being extended by non-shard-aware embedders or we could merge them into a single class.


if new_pkg_fixes is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new_pkg_fixes argument is part of an effort to make repodata patching faster in the case of shards; the way it works now is not as efficient for shards compared to patching all repodata in one go.

CREATE TABLE IF NOT EXISTS index_json (
path TEXT PRIMARY KEY, index_json BLOB,
name AS (json_extract(index_json, '$.name')),
sha256 AS (json_extract(index_json, '$.sha256'))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile to index name; but it might not since we will typically fetch everything.

def __init__(
self,
channel_root: Path | str,
subdir: str,
*,
fs: MinimalFS | None = None,
channel_url: str | None = None,
upstream_stage: str = "fs",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this but it's awkward to pass both the class and its parameters separately to class ChannelIndex()

@@ -174,7 +174,21 @@ def merge_or_update_dict(
if base == new:
return base

for key, value in new.items():
if not add_missing_keys:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another part of the effort to make repodata patching more efficient for shards; not entirely successful.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dholth you are not currently storing the patches in the sqlite database, correct? Maybe that could help with the efficiency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to execute code to generate patches every time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate patches as in JLAP or apply patches as in repodata patchesd. I understood this code was doing the latter, so I might have misunderstood :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conda-forge downloads repodata, generates hotfixes and then stores those patches in a json file, but in general conda-index executes code to generate hotfixes (unfortunately).

### Enhancements

* Add `--channeldata/--no-channeldata` flag to toggle generating channeldata.
* Add sharded repodata (repodata split into separate files per package name).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Add sharded repodata (repodata split into separate files per package name).
* Add `--write-shards/--no-write-shards` sharded repodata (repodata split into separate files per package name).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed [bot] added once the contributor has signed the CLA
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

5 participants