[Backfill corrections] Align daily and rollup file formats; make dates portable #1760
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Make sure
lagandissue_datefields are included in both daily and combined files. Store dates and location info as strings for better portability. These were previouslydatetime64(causing the timezone issue) andobjecttypes.claims_hospbackfill file generation was never merged, so changes to those functions are not included here.Changelog
quidel_covidtest'sbackfill.pychangehc'sbackfill.pyFixes
The original
time_values are meant to be plain dates ("2020-01-01") with no timestamp or timezone info.The
parquetformat uses a schema with types. So if Python writes thetime_values as either pure dates (no timestamp info) or strings, R will read them in using the same types. The implicit timezone conversion (from no timezone, which R'sarrow::read_parquetinterprets as UTC, to the local timezone) only happens when thetime_values are saved intoparquetform asdatetime64s.Python doesn't seem to have a "pure" date class (no time/timezone info). The
datetime64type assumes the time is 00:00 even if none is given. To drop the time info, we can convert to a pure date but this changes the type toobject. Theobjectclass can be a little dangerous to use. In this case, it's not clear what typeparquetwill assign to such a column or how R will read it in, and within Pythonobjects can behave in unusual ways. Saving the dates to string is safer.R will have to do an extra step to convert from string to a date, but it should avoid any weird time/timezone issues.