[Backfill corrections] Account for differences in fields in daily input files; convert date fields #1758
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Daily and rollup (covering ~4 weeks) input files are formatted slightly differently. Daily input files don't contain
lagorissue_datefields that are necessary for data filtering and modeling. In the pipeline, we combine rollup and daily files usingbind_rows, whose output includes the union of all fields seen in component dfs. If a given field is missing from one of the component dfs, those entries are filled withNA. This happens to thelagandissue_datefields for daily files.In the current version of the pipeline, we check if the
lagandissue_datefields are entirely missing. However, even if this check passes, the missing values from daily files cause problems later in the pipeline.issue_datefield to daily dfs on read. From this, derive thelagfield.Date fields
time_valueandissue_date, when available, are read in asdatetimeclass with the local timezone. We expect these to be dates. All of the datetimes correspond to UTC midnight datetimes (such that the date is one day later). In the Python pipelines that produce input files, these date fields are actually formatted asdatetime64[ns](timezone-naive datetimes). It appears that R'sarrow::read_parquetis assuming these are in UTC, and converting to the host's timezone.Convert these back to UTC and then to dates for appropriate handling.
The pipeline expects a field call
geo_value; input files call itfipsinstead, so rename.Changelog
io.Rmain.Rutils.R