Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(csv): Ensure df_to_escaped_csv does not coerce integer columns to float #20151

Merged

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented May 20, 2022

SUMMARY

When exporting a SQL Lab result set with a integer column containing NULL values, Numpy/Pandas coerces them to floats during the pd.DataFrame.applymap(...) call given that NaN is actually a float, i.e.,

>>> import pyarrow as pa
>>>
>>> df = pa.array([1, 2, None]).to_pandas(integer_object_nulls=True).to_frame()
>>> df
      0
0     1
1     2
2  None
>>>
>>> df.applymap(lambda x: x)
     0
0  1.0
1  2.0
2  NaN

Type coercion in Pandas is overly magical and often undesirable. The fix is somewhat yuck as well, iterating over the columns and rows of the DataFrame and escaping only those cells which are actually str.

I tried a number of more performant/vectorized solutions but at the end of the day they rely on numpy.arrays which require a priori declared types otherwise coercion is performed. Note the numpy.dtype(object) data type is used to handled both strings an integers—as a byproduct of how data is exported from PyArrow—given special handling is required for dealing wiith missing values with integers per here. The TL;DR is this is all very unpleasant.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Added unit tests.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley changed the title fix(csv): Ensure df_to_escaped_csv handles NULL fix(csv): Ensure df_to_escaped_csv does not coerce integer columns to float May 20, 2022
@codecov
Copy link

codecov bot commented May 20, 2022

Codecov Report

Merging #20151 (b7f2197) into master (56e9695) will decrease coverage by 0.16%.
The diff coverage is 100.00%.

❗ Current head b7f2197 differs from pull request most recent head c1ac817. Consider uploading reports for the commit c1ac817 to get more accurate results

@@            Coverage Diff             @@
##           master   #20151      +/-   ##
==========================================
- Coverage   66.45%   66.29%   -0.17%     
==========================================
  Files        1721     1721              
  Lines       64513    64518       +5     
  Branches     6806     6806              
==========================================
- Hits        42875    42772     -103     
- Misses      19906    20014     +108     
  Partials     1732     1732              
Flag Coverage Δ
hive ?
mysql 82.15% <100.00%> (+<0.01%) ⬆️
postgres 82.21% <100.00%> (+<0.01%) ⬆️
presto ?
python 82.29% <100.00%> (-0.35%) ⬇️
sqlite 81.95% <100.00%> (+<0.01%) ⬆️
unit ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/utils/csv.py 97.91% <100.00%> (+0.24%) ⬆️
superset/db_engines/hive.py 0.00% <0.00%> (-85.19%) ⬇️
superset/db_engine_specs/hive.py 70.22% <0.00%> (-15.65%) ⬇️
superset/db_engine_specs/presto.py 83.36% <0.00%> (-5.34%) ⬇️
superset/connectors/sqla/models.py 88.41% <0.00%> (-1.19%) ⬇️
superset/initialization/__init__.py 91.13% <0.00%> (-0.36%) ⬇️
superset/db_engine_specs/base.py 88.08% <0.00%> (-0.34%) ⬇️
superset/models/core.py 88.91% <0.00%> (-0.25%) ⬇️
superset/utils/core.py 90.02% <0.00%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56e9695...c1ac817. Read the comment docs.

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@john-bodley john-bodley merged commit 97ce920 into apache:master May 31, 2022
philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022
Co-authored-by: John Bodley <john.bodley@airbnb.com>
michael-s-molina pushed a commit that referenced this pull request Aug 30, 2022
Co-authored-by: John Bodley <john.bodley@airbnb.com>
(cherry picked from commit 97ce920)
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 and removed 🚢 2.0.1 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/S 🍒 1.5.2 🍒 1.5.3 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants