Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak archive construction and compactification tolerance interfaces #601

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

brookslogan
Copy link
Contributor

@brookslogan brookslogan commented Jan 28, 2025

Checklist

Please:

  • Make sure this PR is against "dev", not "main" (unless this is a release
    PR).
  • Request a review from one of the current main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION. Always increment
    the patch version number (the third number), unless you are making a
    release PR from dev to main, in which case increment the minor version
    number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    1.7.2, then write your changes under the 1.8 heading).
  • See DEVELOPMENT.md for more information on the development
    process.

Change explanations for reviewer

  • This brings us in line with construction convention with new only doing very basic checks, validate taking its output and doing heavier checks, and other functions handling other heavy operations.
  • Compactification tolerance naming has been tweaked, defaults have been changed to be non-lossy by default, and made nonstrict so that compactify_abs_tol = 0 works as the default. It's also been made available in epix_merge().

This serves to help set up for and maintain compatibility with some more performant archive construction & sliding utilities in the works.

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

- Make `new_epi_archive()` perform only basic checks and construction, and
  output an "unvalidated" `epi_archive`.
- Make `validate_epi_archive()` operate on (unvalidated or already-validated)
  `epi_archive`s (rather than `as_epi_archive` arguments) and perform the more
  expensive checks omitted by `new_epi_archive`. (But not necessarily all
  required checks; it may be assumed that basic checks have already passed.)
- Move compactification tolerance option to `as_epi_archive()`.
- Standardize to names & defaults of `compactify = TRUE, compactify_abs_tol = 0`
  in most places; `abs_tol` for compactify-specific functions. Add missing
  tolerance setting in `epix_merge()`. Keep `should_compactify` as-is in
  `revision_summary()` for now.
- Change `compactify = NULL` possibility to `compactify = "message"`, and
  message instead of warn.
- Make `compactify_abs_tol = 0` still compactify when exactly equal by using `<=
  compactify_abs_tol` rather than `<` via `dplyr::near()`.
- Update examples and vignettes to not have unnecessary `compactify = TRUE`.
- Use compactification-with-tolerance also on "bare"/"unclassed" integer
  columns.
- Don't use locf-with-tolerance on key columns, to avoid dropping epikeytimes
  entirely in some situations.
@brookslogan brookslogan requested a review from nmdefries January 29, 2025 00:19
@brookslogan brookslogan force-pushed the lcb/tweak-archive-tol-interface branch from 6b24dc5 to 889dbee Compare January 29, 2025 00:20
@brookslogan
Copy link
Contributor Author

brookslogan commented Jan 31, 2025

The extra validation apply_compactify() actually slows it down significantly on some downstream code I'm writing that calls it repeatedly. Might need to check for a performance regression here, redundant/slow validation, and/or write an apply_compactify0() variant; maybe here, maybe in the downstream stuff. [The fancy data-masking filter & potential data.table over-copying might also be a significant slowdown for the rapid case.]

  • todo: check if it's significant slowdown for "normal" case or just go ahead and put perf improvements/options here
    • The filter stuff seems somewhat significant. The validation hit seems minor for "normal" case (and now on some smaller chunks, still fairly minor, like 10% rather than the 100% I was thinking it was...) I've pushed the filter change since it still looks pretty similar.

clobberable_versions_start,
versions_end,
compactify_tol = .Machine$double.eps^0.5) {
versions_end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I might've started in on this review too soon, but we should have the compactify_abs_tol and compactify args here, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was confused by the shared arg documentation. So because new_epi_archive is low level, it never does compactifiction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants