Write sharded repodata #161

dholth · 2024-05-10T20:17:43Z

Description

What would it look like to generate sharded repodata per conda/ceps#75

Interested in seeing whether we can efficiently generate shards; how repodata patching should work; and whether we could generate shards as the primary artifact and then derive repodata.json from shards [at the same time in "processes a sequence of package names" code]

How to test

Check out this repository and https://github.com/dholth/conda-test-data

Decompress conda_test_data/conda-forge/*/.cache/cache.db.zst

Download conda-forge-repodata-patches-<version>.conda

python3 -m conda_index.index.shards --upstream-stage clone --no-save-fs-state --patch-generator ~/miniconda3/pkgs/conda-forge-repodata-patches-20240401.20.33.07-hd8ed1ab_1.conda --output /tmp/shards ~/prog/conda-test-data/conda-forge

Examine output in /tmp/shards

I've begun trying to apply the patches to individual shards. This is slow; should compare against applying the many patches against a whole repodata.json.

Checklist - did you ...

Add a file to the news directory (using the template) for the next release's release notes?
Add / update necessary tests?
Add / update outdated documentation?

…packages

…hards for patching

dholth · 2024-07-01T14:14:42Z

conda_index/cli/__init__.py

@@ -91,6 +92,23 @@
        repodata_version=2 which is supported in conda 24.5.0 or later.
        """,
 )
+@click.option(


We could replace this with an "add-only" or "no-remove" option, that would keep packages in the index even if they are not found in the filesystem.

These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.

dholth · 2024-08-19T14:54:17Z

Now that the CEP is approved, this branch should be completed and become the way to run conda-index.

conda_index/index/shards.py

Combining the sharded CLI into the main CLI

dholth · 2024-08-30T16:09:34Z

Something about this breaks CondaIndex subclasses that want to override upstream_stage, not sure what. (it has to pass upstream_stage to CondaIndex instead of as a CacheClass property)

Co-authored-by: Travis Hathaway <travis.j.hathaway@gmail.com>

dholth

Thanks

dholth · 2024-09-19T12:54:55Z

conda_index/cli/__init__.py

@@ -91,6 +92,23 @@
        repodata_version=2 which is supported in conda 24.5.0 or later.
        """,
 )
+@click.option(


These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.

dholth · 2024-09-19T12:58:16Z

conda_index/cli/__init__.py

+@click.option(
+    "--sharded",
+    help="""
+        Write index using shards
+        """,
+    default=False,
+    is_flag=True,
+)


Suggested change

@click.option(

"--sharded",

help="""

Write index using shards

""",

default=False,

is_flag=True,

)

@click.option(

"--write-shards/--no-write-shards",

help="""

Write a repodata.msgpack.zst index and many smaller files per CEP-16.

""",

default=False,

is_flag=True,

)

dholth · 2024-09-19T12:58:31Z

conda_index/cli/__init__.py

    current_repodata=True,
+    sharded=False,


Suggested change

sharded=False,

write_shards=False,

dholth · 2024-09-19T12:59:27Z

conda_index/cli/__init__.py

@@ -135,7 +166,10 @@ def cli(
    if output:
        output = os.path.expanduser(output)

-    channel_index = ChannelIndex(
+    channel_index_class = ChannelIndexShards if sharded else ChannelIndex
+    cache_class = ShardedIndexCache if sharded else CondaIndexCache


We should be able to write both with a single CLI invocation. Possibly the grouped query (additional method in ShardedIndexCache) becomes the only query we use from now on. We could maintain the subclass to distinguish when conda-index is being extended by non-shard-aware embedders or we could merge them into a single class.

dholth · 2024-09-19T13:01:51Z

conda_index/index/__init__.py


+    if new_pkg_fixes is None:


The new_pkg_fixes argument is part of an effort to make repodata patching faster in the case of shards; the way it works now is not as efficient for shards compared to patching all repodata in one go.

dholth · 2024-09-19T13:02:44Z

conda_index/index/convert_cache.py

+        CREATE TABLE IF NOT EXISTS index_json (
+            path TEXT PRIMARY KEY, index_json BLOB,
+            name AS (json_extract(index_json, '$.name')),
+            sha256 AS (json_extract(index_json, '$.sha256'))


It might be worthwhile to index name; but it might not since we will typically fetch everything.

dholth · 2024-09-19T13:04:11Z

conda_index/index/sqlitecache.py

    def __init__(
        self,
        channel_root: Path | str,
        subdir: str,
        *,
        fs: MinimalFS | None = None,
        channel_url: str | None = None,
+        upstream_stage: str = "fs",


We need this but it's awkward to pass both the class and its parameters separately to class ChannelIndex()

dholth · 2024-09-19T13:04:46Z

conda_index/utils_build.py

@@ -174,7 +174,21 @@ def merge_or_update_dict(
    if base == new:
        return base

-    for key, value in new.items():
+    if not add_missing_keys:


Another part of the effort to make repodata patching more efficient for shards; not entirely successful.

@dholth you are not currently storing the patches in the sqlite database, correct? Maybe that could help with the efficiency?

We tend to execute code to generate patches every time.

Generate patches as in JLAP or apply patches as in repodata patchesd. I understood this code was doing the latter, so I might have misunderstood :)

conda-forge downloads repodata, generates hotfixes and then stores those patches in a json file, but in general conda-index executes code to generate hotfixes (unfortunately).

dholth · 2024-09-19T13:05:14Z

news/sharded-repodata

+### Enhancements
+
+* Add `--channeldata/--no-channeldata` flag to toggle generating channeldata.
+* Add sharded repodata (repodata split into separate files per package name).


Suggested change

* Add sharded repodata (repodata split into separate files per package name).

* Add `--write-shards/--no-write-shards` sharded repodata (repodata split into separate files per package name).

dholth added 2 commits May 10, 2024 15:04

begin sharded repodata creation

a2cc811

remove non-overridden method

9001bc8

conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label May 10, 2024

dholth added 13 commits May 10, 2024 16:57

write to output_root

ba6759a

move save_fs_state out of extract_subdir_to_cache

11392e5

shards cli capable of generating repodata from cache database and no …

fd6c9bc

…packages

create output subdir if necessary

504a61c

allow expanduser in --patch-generator; use output_path when reading s…

d049dc2

…hards for patching

continue applying patches to shards

e979689

add --no-current-repodata option

4f17a15

maintain current_repodata=True as the default

90c92a2

Merge branch 'main' into sharded-repodata

69a1e6c

Merge branch 'main' into sharded-repodata

de75383

skip test_cli; why does it cause test failure?

012f234

add news

9584bb5

begin combine-small-shards experiment

3cac14a

dholth force-pushed the sharded-repodata branch from 62eb37f to 3cac14a Compare June 7, 2024 18:23

dholth added 2 commits June 12, 2024 12:08

include virtual name, sha256 columns in create() as well as migration

45d3a7c

Merge branch 'main' into sharded-repodata

06bbd84

dholth commented Jul 1, 2024

View reviewed changes

dholth mentioned this pull request Jul 22, 2024

Implement sharded repodata CEP conda/conda#14060

Open

2 tasks

Merge branch 'main' into sharded-repodata

4e4b3fd

travishathaway reviewed Aug 20, 2024

View reviewed changes

conda_index/index/shards.py Show resolved Hide resolved

travishathaway reviewed Aug 20, 2024

View reviewed changes

conda_index/index/shards.py Outdated Show resolved Hide resolved

Merge branch 'main' into sharded-repodata

193b6c8

dholth mentioned this pull request Aug 23, 2024

Support CEP-16 sharded repodata #182

Open

2 tasks

travishathaway and others added 2 commits August 27, 2024 14:26

combining the sharded CLI into the main CLI

04afa01

Merge pull request #2 from travishathaway/merge-sharded-repodata-cli

c01824b

Combining the sharded CLI into the main CLI

dholth changed the title ~~Sharded repodata experiment~~ Write sharded repodata Aug 30, 2024

dholth and others added 5 commits August 30, 2024 12:18

Merge branch 'main' into sharded-repodata

ff2ae94

add msgpack dependency

b4582cb

import annotations

3c295d2

Update conda_index/index/shards.py

707f482

Co-authored-by: Travis Hathaway <travis.j.hathaway@gmail.com>

Merge branch 'main' into sharded-repodata

9f599e6

dholth commented Sep 19, 2024

View reviewed changes

jjerphan mentioned this pull request Oct 9, 2024

Avoid downloading large package metadata mamba-org/mamba#3431

Open

dholth added 2 commits November 4, 2024 14:22

bump version 0.6.0

ab817a9

Merge branch 'main' into sharded-repodata

4bc897c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write sharded repodata #161

Write sharded repodata #161

dholth commented May 10, 2024 •

edited

Loading

dholth Jul 1, 2024

dholth Sep 19, 2024

dholth commented Aug 19, 2024

dholth commented Aug 30, 2024 •

edited

Loading

dholth left a comment

dholth Sep 19, 2024

dholth Sep 19, 2024

dholth Sep 19, 2024

dholth Sep 19, 2024

dholth Sep 19, 2024

dholth Sep 19, 2024

dholth Sep 19, 2024

dholth Sep 19, 2024

wolfv Sep 20, 2024

dholth Sep 20, 2024

wolfv Sep 20, 2024

dholth Sep 20, 2024

dholth Sep 19, 2024

	* Add sharded repodata (repodata split into separate files per package name).
	* Add `--write-shards/--no-write-shards` sharded repodata (repodata split into separate files per package name).

Write sharded repodata #161

Are you sure you want to change the base?

Write sharded repodata #161

Conversation

dholth commented May 10, 2024 • edited Loading

Description

Checklist - did you ...

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth commented Aug 19, 2024

dholth commented Aug 30, 2024 • edited Loading

dholth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth commented May 10, 2024 •

edited

Loading

dholth commented Aug 30, 2024 •

edited

Loading