Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File size check when detecting changed inputs (Provide a mechanism to force cache rebuild) #108

Closed
2 tasks done
scdub opened this issue Aug 2, 2023 · 5 comments · Fixed by #129 or #135
Closed
2 tasks done
Labels
locked [bot] locked due to inactivity type::bug describes erroneous operation, use severity::* to classify the type

Comments

@scdub
Copy link

scdub commented Aug 2, 2023

Checklist

  • I added a descriptive title
  • I searched open reports and couldn't find a duplicate

What happened?

If builds are not deduplicated with the hash suffix, replacing package archives will not be detected by conda_index and it will provide out of date metadata. Checking on file size would be a relatively inexpensive way to detect this case, but an option to force cache recreation may be simpler to implement and not impose runtime costs for environments where hash suffixes are exclusively used.

Conda Info

active environment : base
    active env location : /home/user/miniconda
            shell level : 1
       user config file : /home/user/.condarc
 populated config files :
          conda version : 23.7.2
    conda-build version : 3.26.0
         python version : 3.9.16.final.0
       virtual packages : __archspec=1=x86_64
                          __glibc=2.27=0
                          __linux=6.2.9=0
                          __unix=0=0
       base environment : /home/user/miniconda  (writable)
      conda av data dir : /home/user/miniconda/etc/conda
  conda av metadata url : None
          package cache : /home/user/miniconda/pkgs
                          /home/user/.conda/pkgs
       envs directories : /home/user/miniconda/envs
                          /home/user/.conda/envs
               platform : linux-64
             user-agent : conda/23.7.2 requests/2.31.0 CPython/3.9.16 Linux/6.2.9-x86_64-linode160 ubuntu/18.04.6 glibc/2.27
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

Conda Config

==> envvars <==
allow_softlinks: False

==> cmd_line <==
debug: False
json: False

Conda list

No response

Additional Context

Removing the .cache directories forces a refresh, this wasn't immediately obvious to me migrating from the historical conda index

@scdub scdub added the type::bug describes erroneous operation, use severity::* to classify the type label Aug 2, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in 🧭 Planning Aug 2, 2023
@dholth
Copy link
Contributor

dholth commented Aug 24, 2023

I'm surprised we don't already detect that kind of change. In these lines we store the path, mtime, and size. When we retrieve it we only check mtime though.

https://github.com/conda/conda-index/blob/main/conda_index/index/sqlitecache.py#L420-L489

We also try to copy those values over from the older many-small-json cache. https://github.com/conda/conda-index/blob/main/conda_index/index/convert_cache.py#L246-L251

@scdub
Copy link
Author

scdub commented Aug 25, 2023

Thanks for the details. Do you think we could safely add a check on the size during the comparison with fs.mtime != cached.mtime OR fs.size != cached.size OR cached.path IS NULL)? I rebuilt the index manually after I encountered this so I can't directly reproduce it currently, but I can try to do so and breakpoint into the code if that'd be helpful.

@dholth
Copy link
Contributor

dholth commented Aug 25, 2023 via email

@dholth dholth changed the title Provide a mechanism to force cache rebuild Provide a mechanism to force cache rebuild (add file size check when detecting changed inputs) Sep 19, 2023
@scdub
Copy link
Author

scdub commented Oct 11, 2023

OK, I was able to observe this again: if you replace a package with an exact matching name to one that is already cached, e.g. graphviz-8.1.0-0.conda, the metadata will not be rewritten. I believe the suggested change will address that for this particular case of not using hash strings.

@dholth dholth changed the title Provide a mechanism to force cache rebuild (add file size check when detecting changed inputs) File size check when detecting changed inputs (Provide a mechanism to force cache rebuild) Oct 13, 2023
@dholth
Copy link
Contributor

dholth commented Oct 13, 2023

This excerpt from a changed-packages query checks md5 or sha256 sums, only when the upstream or 'fs' state has a non-empty checksum.

            WHERE fs.path LIKE :path_like AND
                (cached.path IS NULL OR
                    (LENGTH(fs.sha256) > 0 AND fs.sha256 != cached.sha256)
                    OR
                    (LENGTH(fs.md5) > 0 AND fs.md5 != cached.md5)
                )

@github-project-automation github-project-automation bot moved this from 🆕 New to 🏁 Done in 🧭 Planning Oct 16, 2023
@dholth dholth reopened this Jan 24, 2024
@github-project-automation github-project-automation bot moved this from 🏁 Done to 🏗️ In Progress in 🧭 Planning Jan 24, 2024
@github-project-automation github-project-automation bot moved this from 🏗️ In Progress to 🏁 Done in 🧭 Planning Jan 24, 2024
@github-actions github-actions bot added the locked [bot] locked due to inactivity label Aug 27, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity type::bug describes erroneous operation, use severity::* to classify the type
Projects
Archived in project
2 participants