Plug hole in MDS type system: add arbitrary-precision decimal #390
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Streaming PR 363 seeks to add delta to MDS conversion to Streaming. This means MDS must have an answer for each of the types in
pyspark.sql.types
.Non-solutions
Currently, we don't: arbitrary-precision decimals (and for that matter, ints) are not natively supported. Several workarounds are available using our current type system:
pkl
encoding, which will correctly encode and decode, but with explosively bad data usage.json
encoding to dump the digits as str, then custom decode the field(s) in your subclass of StreamingDataset back to your arbitrary-precision int/float/Decimal type.bytes
encoding and roll your own custom encoding /and/ decoding.Using Pickle will make me look incompetent, you say? How bad can it be?
Oh. That is, if you were expecting serializing 0 to require 4 bytes, pickle is 5x worse for floats and 10x worse for decimals. At scale, that really adds up.
Solution
This PR adds three new MDS types that are small, simple, and general:
str_int
,str_float
, andstr_decimal
. We take the JSON approach of just dumping digits to str and decoding it back, providing arbitrary precision without any hassle or complexity.Of course, ASCII digits are not the most efficient way to serialize. To support possible future work, we also include a script analyzing what the most useful configurations of fixed-size binary decimal types would be. The actual fixed-size decimal binary type(s) are not in this PR, as I would like to study the problem a bit more before committing to a certain API for them and that work is off the critical path.