Introduce scope feature of shasum similar to modtime #53

borestad · 2024-08-08T10:17:46Z

borestad
Aug 8, 2024

In some of my pipelines, I can't rely on the modtime, since the files may be updated anytime (but contain same or different content/shasum)

My proposal is to create an alternative to --modtime to be used within a scope.

Example of how todays mechanism works

bkt --ttl=1h --modtime ./run-something -- ./run-something  # Runs first time
bkt --ttl=1h --modtime ./run-something -- ./run-something  # Cached
touch ./run-something
bkt --ttl=1h --modtime ./run-something -- ./run-something  # Runs again

I'd like the last command to not be run unless the shasum actually changed (without creating something "subshell hackish" like the discussion below)

Similar? #26

dimo414 · 2024-09-02T19:53:33Z

dimo414
Sep 2, 2024
Maintainer

Hey @borestad thanks for bringing this up and sorry for the delayed response. The use case you describe makes total sense, but I'm honestly unsure how I feel about adding support for it. First off, I worry that we could end up coming up with use cases for all sorts of different signals and end up with a heavily polluted CLI with lots of knobs most callers don't care about (and presumably a larger / slower bkt binary as well). Obviously we're just talking about one more knob right now, and maybe the risk I'm describing is overblown, but that is at least on my mind. I would feel more comfortable treating --modtime as a singular special-case, unless there's a significant and clear need for additional invalidation mechanisms.

To the specific question of file contents / hashes as a caching signal, one concrete concern I have is overhead. Checking a file's modtime is (fairly) cheap and independent of the file's size or other metadata. Whereas computing a hash is O(n) and, depending on the chosen algorithm, potentially quite slow. Granted, it'd be faster to do it in-process than to shell-out, but it's plausible computing the hashes could overshadow the subprocess overhead for sufficiently large files. Even if we added an optimization like only recomputing on modtime changes, presumably in practice such files would be expected to be modified-but-not-changed fairly frequently (otherwise why bother checking the contents) so I expect this would require an O(n) pass over the contents on many / most invocations.

I wonder if it would be possible in your use-case to either a) not update the file unless it's actually (expected to be) being changed or b) re-compute the hash on write and then update a separate file only when it changes, then having bkt watch that other file? Either of those approaches would avoid the O(n) overhead on bkt invocations and passes the responsibility to the writer, where I'd argue it's more appropriate.

Curious for your thoughts 🙂

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce scope feature of shasum similar to modtime #53

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Introduce scope feature of shasum similar to modtime #53

borestad Aug 8, 2024

Replies: 1 comment

dimo414 Sep 2, 2024 Maintainer

borestad
Aug 8, 2024

dimo414
Sep 2, 2024
Maintainer