Skip to content

Clarify documentation for and sanity-check nafill_buffer usage #320

Closed
@brookslogan

Description

@brookslogan

The args list help entries for the canned forecasters, including ones that don't use lags, is:

nafill_buffer: At predict time, recent values of the training data are
          used to create a forecast. However, these can be 'NA' due to,
          e.g., data latency issues. By default, any missing values
          will get filled with less recent data. Setting this value to
          'NULL' will result in 1 extra recent row (beyond those
          required for lag creation) to be used. Note that we require
          at least 'min(lags)' rows of recent data per 'geo_value' to
          create a prediction. For this reason, setting 'nafill_buffer
          < min(lags)' will be treated as _additional_ allowed recent
          data rather than the total amount of recent data to examine.

This confused me, and I think there may be some implementation bug(s).

Requests:

  • State the default value, Inf, and what it does.
  • State what "regular" values (non-NULL, not < min(lags)) do.
  • "Setting this value to 'NULL' will result in 1 extra recent row (beyond those required for lag creation) to be used." ---
    • An extra row from where, for what? --- I think I understand now after a while of thinking about the name and looking at the implementation, but things could be clearer.
    • Why would this be expected to work so much that it's a special value?
    • It looks like "those required for lag creation" corresponds to max_lags + max_horizon, but factors into keep in x <- dplyr::filter(x, forecast_date - time_value <= keep), which
      • if forecast_date is the default max time_value, seems like more than is required for lag creation.
      • if forecast_date has been passed as the actual forecast date, makes it seem that max_horizon should be something around the reporting latency instead, although maybe this is off by one since keep is at minimum min_required + 1 not min_required.
      • (Also, this type of filter calls into question the naming n_recent.)
    • Actually, trying to use this, I get
out <- arx_forecaster(
  case_death_rate_subset, "death_rate",
  "death_rate",
  args_list = arx_args_list(
    ahead = 16,
    lags = 7,
    n_training = 30,
    nafill_buffer = NULL
  )
)
#> Error in if (is.finite(nafill_buffer)) arg_is_pos_int(nafill_buffer, allow_null = TRUE) : 
#>   argument is of length zero
  • ("For this reason, setting 'nafill_buffer < min(lags)' will be treated as additional allowed recent data rather than the total amount of recent data to examine" --- still sounds like a footgun to have < min(lags) and >= min(lags) do different things.)
  • Note/remind that this NA filling is not used in the training process?
  • For forecasters that don't use lags, provide an explanation that doesn't mention lags?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions