Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build and process .conda artifacts #1586

Closed
20 tasks done
beckermr opened this issue Jan 20, 2022 · 25 comments
Closed
20 tasks done

build and process .conda artifacts #1586

beckermr opened this issue Jan 20, 2022 · 25 comments

Comments

@beckermr
Copy link
Member

beckermr commented Jan 20, 2022

This issue is to track to-do items for building and handling .conda artifacts.

to dos:

@jakirkham
Copy link
Member

cc @conda-forge/core (as Anaconda.org & CDN can now handle .conda)

@beckermr
Copy link
Member Author

beckermr commented Nov 16, 2022

edit: all of these items are done

notes from core call:

        * [x] make sure on announcement you mention the minimum conda version (4.7)
        * [x] check that ci services do not do duplicate uploads
        * [x] set compression level for big packages
            * flag is [`--zstd-compression-level`](https://github.com/conda/conda-build/blob/3baa21e0af022b3f971068566831c812497545f1/conda_build/cli/main_build.py#L159-L165)
            * default is 22, set [here](https://github.com/conda/conda-build/blob/3baa21e0af022b3f971068566831c812497545f1/conda_build/config.py#L53)

@beckermr
Copy link
Member Author

use

conda_build:
  zstd_compression_level: 16 to 19

in the condarc

@beckermr
Copy link
Member Author

@conda-forge/core I went with compression level 16. LMK if you have any issues with that.

cc @mbargull @mariusvniekerk

@jakirkham
Copy link
Member

jakirkham commented Nov 16, 2022

How long does it take to run 16 (or 19)? How much more compression does one see between the two? Understand we may not have benchmarks, but any info that can help guide us would be useful.

Should we allow this to be overridable? For example if compression takes too long on a feedstock close to a CI time limit and we want to dial it down.

Edit: Just realized PR ( #1852 ) shows this being configurable. So think that answer the last question.

@beckermr
Copy link
Member Author

How long does it take to run 16 (or 19)? How much more compression does one see between the two? Understand we may not have benchmarks, but any info that can help guide us would be useful.

I don't have any of this info. I think we ship this PR and then figure out as things run how it is working.

@jakirkham
Copy link
Member

Yeah being able to configure it is more important I think.

From past experience with compressors the last little bit tends to take a lot longer for minimal gain. So was just trying to get a sense of how "flat" the curve was getting to aid in decision making.

@beckermr
Copy link
Member Author

beckermr commented Nov 17, 2022

Here is a benchmark for numpy using the following script on my (old intel) mac

#!/usr/bin/env bash

in_pkg=$1
out_pkg=${in_pkg/.tar.bz2/.conda}
bak_pkg=${in_pkg}.bak

cp ${in_pkg} ${bak_pkg}

for level in 1 4 10 16 17 18 19 20 21; do
    cp ${bak_pkg} ${in_pkg}
    rm -f ${out_pkg}
    rm -rf ${out_pkg/.conda//}
    start=`python -c "import time; print(time.time())"`
    cph transmute --zstd-compression-level=${level} ${in_pkg} .conda
    end=`python -c "import time; print(time.time())"`
    ttime=$( echo "$end - $start" | bc -l )
    start=`python -c "import time; print(time.time())"`
    cph x ${out_pkg}
    end=`python -c "import time; print(time.time())"`
    runtime=$( echo "$end - $start" | bc -l )

    size=$(ls -lah ${out_pkg} | cut -w -f 5)

    echo "${level} ${size} ${runtime} ${ttime}"
done

cp ${bak_pkg} ${in_pkg}
rm -f ${bak_pkg}
rm -f ${out_pkg}
rm -rf ${out_pkg/.conda//}

results (columns are zstd level, size, extraction time, transmute time)

$ ./bench.sh numpy-1.23.4-py39h9e3402d_1.tar.bz2 
1 8.2M 2.182568 7.1936909                                                                                                                                                 
4 7.2M 1.803828 7.8562648                                                                                                                                                 
10 6.4M 1.9773452 8.359201                                                                                                                                                
16 5.9M 1.975351 16.997171                                                                                                                                                
17 5.8M 3.171298 20.3572858                                                                                                                                               
18 5.7M 2.3847492 23.3421962                                                                                                                                              
19 5.7M 2.237947 36.101651                                                                                                                                                
20 5.2M 3.756540 35.1239249                                                                                                                                               
21 5.2M 3.2139912 40.8598119           

Things flatten for this size around 10-16. This package is ~32M uncompressed.

@beckermr
Copy link
Member Author

beckermr commented Nov 17, 2022

Here is the start of a benchmark for a much bigger file (compressed around 450 MB)

$ ./bench.sh stackvana-afw-0.2022.46-py310hff52083_0.tar.bz2 
1 464M 17.158145 161.082254                                                                                                                                               
4 419M 14.839792 157.401964                                                                                                                                               
10 375M 16.084401 199.193263         
16 338M 13.825499 711.3774772

@beckermr
Copy link
Member Author

I think 16 will be fine for now. We can lower it as needed for big packages and we only take a small hit on small ones.

@beckermr
Copy link
Member Author

We really could use an adaptive option in conda build.

@jakirkham
Copy link
Member

jakirkham commented Nov 17, 2022

Thanks Matt! 🙏

Agreed 16 seems like plenty.

Also notably better than their .tar.bz2 equivalents:

package name .tar.bz2 (MB) .conda @ 16 (MB)
numpy-1.23.4-py39h9e3402d_1 6.6 5.9
stackvana-afw-0.2022.46-py310hff52083_0 435.0 338.0

@jakirkham
Copy link
Member

jakirkham commented Nov 17, 2022

On a different note (kind of related to adaptive), we may in the future want to leverage Zstandard's dictionary to pretrain on the content of many packages. We could then package this constructed dictionary and use it to improve overall compression and cutdown compression/decompression time.

One question here is how compressible are packages in aggregate. There may be some things (like text) that compress really well and other things (like dynamic libraries) that may do less well. A somewhat related question is whether it is worth creating per file format dictionaries (though this would be a modification to the format). Given other packagers, filesystems, etc. have gone down the path of using Zstandard already, we may be able to glean results from their efforts.

@dhirschfeld
Copy link
Member

Great to see this happening - I'm excited about the performance improvements this might bring! ❤️

I've been building my own .conda packages for a bit now and one usability issue I've run into is not being able to see inside the package.

With the old .tar.bz2 packages I could open them in 7-Zip and browse the contents / folder-structure. That was often invaluable in debugging broken builds.

With the .conda format the contents appear as a pkg-*.tar.zst binary blob:

image

Is there an easy way to browse the contents of a .conda package?

@jakirkham
Copy link
Member

We have made the same observation ( conda/conda-package-handling#5 ) 🙂

@chrisburr
Copy link
Member

Is there an easy way to browse the contents of a .conda package?

All of this information is in https://github.com/regro/libcfgraph and I have a local clone which I regularly use with rg.

Longer term I was already thingking this would be a great feature for https://prefix.dev/ if @wolfv is interested.

@dhirschfeld
Copy link
Member

dhirschfeld commented Nov 18, 2022

All of this information is in https://github.com/regro/libcfgraph

The idea is that this is useful for debugging broken builds - i.e. the build fails because of missing files in the package so the new package version never gets published outside of CI or my local desktop. As a dev I want to know what the internal file/folder structure of the newly built (broken) package was so I can compare with my expectations.

I don't know much about prefix.dev but doesn't that just report on dependencies between published packages?

@beckermr beckermr unpinned this issue Nov 18, 2022
@jakirkham
Copy link
Member

Related to this have filed an issue to create a CEP spelling out the .conda spec more fully ( conda/ceps#42 )

@jaimergp
Copy link
Member

@dhirschfeld - you can use conda_package_handling (via cph extract) for your .conda needs!

@dholth
Copy link
Contributor

dholth commented Nov 20, 2022

zstd has built-in benchmarking

@beckermr
Copy link
Member Author

What does this mean?

@dholth
Copy link
Contributor

dholth commented Nov 21, 2022

If you run zstd -b1 -e19 somefile it will tell you how long each level took.

@mbargull
Copy link
Member

" copy-pasta" of some comments of mine from internal chat:

Re: dictionary:
Those can make sense in some cases, but we're also then treading in the "over-optimizing and leaving utilities behind" area. Meaning, we get diminishing returns with those minor optimizations and also I wouldn't be able to bsdtar -x all-the-things then ;).

Re: memory usage:
This is also why I recommend to not use the --ultra, e.g., -22 settings, but limit ourselves to -19 at max. Yes, compressing would take an unreasonable amount of resources in some cases, but for me more importantly would that decompression would also be affected! zstd by default uses can use much more memory on decompression (i.e., our users' side) as ye olde gzip/bzip2 anyway. If you use --ultra settings, then it its window sizes can surpass the 128MB mark and needs may need as much memory on decompression as well.
[edited to say "can use"/"may need"]

A thing orthogonal to tweaking the compression level would be to give the compressor a better arranged stream of data. Previously, Ray did some experimentation with binsort, but it was rarely used and slow as hell in the chosen configuration. But in some cases it could yield notable improved compression. I'm not sure how much of an impact it would have for zstd because it already uses much bigger window sizes (128MB vs 900KB block size of bzip2 -9 IIRC). Nowadays, one would probably look at, e.g., how https://github.com/mhx/dwarfs/tree/v0.6.2#overview (see similarity hashing) arrange their input instead of trying the binsort approach.

@dholth
Copy link
Contributor

dholth commented Nov 23, 2022

Thankfully the window size is also limited by the uncompressed archive size.

kou pushed a commit to apache/arrow that referenced this issue Dec 12, 2022
Synching after conda-forge/arrow-cpp-feedstock#875, which does quite a lot of things, see this [summary](conda-forge/arrow-cpp-feedstock#875 (review)). I'm not keeping the commit history here, but it might be instructive to check the commits there to see why certain changes came about.

It also fixes the CI that was broken by a3ef64b (undoing the changes of #14102 in `tasks.yml`).

Finally, it adapts to conda making a long-planned [switch](conda-forge/conda-forge.github.io#1586) w.r.t. to the format / extension of the artefacts it produces.

I'm very likely going to need some help (or at least pointers) for the R-stuff. CC @ xhochy
(for context, I never got a response to conda-forge/r-arrow-feedstock#55, but I'll open a PR to build against libarrow 10).

Once this is done, I can open issues to tackle the tests that shouldn't be failing, resp. the segfaults on PPC resp. in conjunction with `sparse`.
* Closes: #14828

Authored-by: H. Vetinari <h.vetinari@gmx.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@michaelosthege
Copy link

Even though I have conda 4.12.0, my mamba 0.27.0 has trouble finding packages that are only avaialable as .conda artifacts: mamba-org/mamba#2172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants