Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Deprecate compact_ints and use_unsigned in read_csv #13323

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented May 30, 2016

Title is self-explanatory.

xref #12686 - I don't quite understand why these are marked (if at all) as internal to the C engine only, as the benefits for having these options accepted for the Python engine is quite clear based on the documentation I added as well.

Implementation simply just calls the already-written function in pandas/parsers.pyx - as it isn't specific to the TextReader class, crossing over to grab this function from Cython (instead of duplicating in pure Python) seems reasonable while maintaining that separation between the C and Python engines.

@gfyoung gfyoung force-pushed the python-engine-compact-ints branch 3 times, most recently from 4924f90 to 0d56c09 Compare May 30, 2016 01:58
@gfyoung
Copy link
Member Author

gfyoung commented May 30, 2016

FYI: the test_compact_ints test that I deleted was a duplicate of test_compact_ints_as_recarray and also did not correctly reflect on the situation of the test data.

@codecov-io
Copy link

codecov-io commented May 30, 2016

Current coverage is 84.22%

Merging #13323 into master will decrease coverage by <.01%

@@             master     #13323   diff @@
==========================================
  Files           138        138          
  Lines         50713      50671    -42   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          42715      42676    -39   
+ Misses         7998       7995     -3   
  Partials          0          0          

Powered by Codecov. Last updated by 99e78da...0d56c09

@jreback
Copy link
Contributor

jreback commented May 30, 2016

This is fine to document these. In reality the parser should not be doing this, and these options just add to an already huge API. Better to have this an an option to to_numeric() or something. Or the user can simply pass in dtypes as appropriate.

I get that this makes things more 'automatic'.

So I think that we should deprecate these for now.

@jorisvandenbossche ?

@jreback jreback added IO CSV read_csv, to_csv API Design labels May 30, 2016
@gfyoung
Copy link
Member Author

gfyoung commented May 30, 2016

@jreback : No issue with simplifying the API, though it did seem like it was a somewhat decent memory-conscientious option. If it is deprecated, telling them to refer to dtype might make more sense because I don't see where we would be going with adding to these to to_numeric, though that would be more inconvenient if there was a large table of small integers.

@jreback
Copy link
Contributor

jreback commented May 30, 2016

@gfyoung you could simply add them as options to to_numeric, which is basically a smart .astype.
(that handles errors and such).

@gfyoung
Copy link
Member Author

gfyoung commented May 30, 2016

How would they be toggled from the parser API exactly?

@jreback
Copy link
Contributor

jreback commented May 30, 2016

@gfyoung they wouldn't that why we deprecate :>

though you could very easily. Move these to a post-processing step and simply call something like

if compact_ints:
    x = pd.to_numeric(x, errors='ignore', compact_ints=compact_ints, use_unsigned=use_unsigned)

and simply move the code downcast_int64 to a call inside pd.to_numeric. (and move the code to inference.pyx just for cleanliness).

This is indepedent of deprecation or not though. (e.g. you can fix this and can separately deprecate).

@gfyoung
Copy link
Member Author

gfyoung commented May 30, 2016

Ah, I see. Will do that then.

@gfyoung gfyoung force-pushed the python-engine-compact-ints branch from 0d56c09 to dff6a69 Compare May 30, 2016 21:56
@gfyoung
Copy link
Member Author

gfyoung commented May 30, 2016

Actually, it's not as straightforward as it seems. parser.pyx creates its own list of na_values (see here) that it uses when casting inside _downcast_to_int64, and that list of values is used in other functions within the file. If I move it to inference.pyx to be called via to_numeric (which I'm no longer sure about since that is a further API change), I would have to duplicate the code that creates na_values - how does the refactoring sound now?

@gfyoung gfyoung force-pushed the python-engine-compact-ints branch from dff6a69 to a42eb7c Compare May 30, 2016 22:34
@jreback
Copy link
Contributor

jreback commented May 30, 2016

you can just do this is maybe_convert_numeric or equivalent

@gfyoung gfyoung force-pushed the python-engine-compact-ints branch 2 times, most recently from a753b45 to c0ff67e Compare May 31, 2016 01:16
@gfyoung
Copy link
Member Author

gfyoung commented May 31, 2016

@jreback : Refactored downcast_to_int64 as a function in inference.pyx, and Travis is happy. So just waiting to see what @jorisvandenbossche has to say about deprecation.

@jreback
Copy link
Contributor

jreback commented May 31, 2016

I don't think this works if the arr has any na_values in it
(not tested either)

@gfyoung
Copy link
Member Author

gfyoung commented May 31, 2016

@jreback : I literally moved the function from one location to another. If that's an issue, then it would have appeared in the parser as well. Will add a test though for that.

@jreback
Copy link
Contributor

jreback commented May 31, 2016

@gfyoung no, if its by efinition int64 coming in, then na_values are superfulous here. I know you copied it. that's not to say it was completely correct before.

@gfyoung
Copy link
Member Author

gfyoung commented May 31, 2016

@jreback : na_values is actually used in several places in the implementation, so it's not as superfluous as you make it sound (unless you would like to hard-code values). Also, the function does work when the na_value for np.int64 is included in the array. Added test to make sure.

@gfyoung gfyoung force-pushed the python-engine-compact-ints branch 2 times, most recently from 1d19e36 to a1263d4 Compare May 31, 2016 12:35
@jreback
Copy link
Contributor

jreback commented May 31, 2016

@gfyoung you are missing my point. int arrays don't support missing values (currently), so if you actually have a missing value then this must be converted to float which is impossible in the current impl. so they ARE superfluous.

@gfyoung
Copy link
Member Author

gfyoung commented May 31, 2016

@jreback : That's not the point of this function at all. If it has missing values (i.e.nan), the array won't go through this path. That na_values parameter is hardly superfluous if you look at what it actually has stored in it.

@jreback
Copy link
Contributor

jreback commented May 31, 2016

@gfyoung then prove it by showing an example

@gfyoung
Copy link
Member Author

gfyoung commented May 31, 2016

@jreback : I think my test cases speak for themselves.

@jreback
Copy link
Contributor

jreback commented May 31, 2016

it's kind of like a fancy astype

@gfyoung gfyoung force-pushed the python-engine-compact-ints branch from 7467dda to 48ab4db Compare May 31, 2016 21:24
@gfyoung
Copy link
Member Author

gfyoung commented Jun 1, 2016

Hmm...not sure why my test for the deprecation is failing on just one machine. What am I doing wrong here? It couldn't be this difficult to add, should it?

@@ -76,6 +76,7 @@ Other enhancements

- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``decimal`` option (:issue:`12933`)
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``na_filter`` option (:issue:`13321`)
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``compact_ints`` and ``use_unsigned`` options (:issue:`13323`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to announce this as a new feature, while it has been deprecated at the same time ...

Copy link
Member Author

@gfyoung gfyoung Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fair point, but I did add support though. The exact direction of this PR is still somewhat in flux, and I'll adjust the whatsnew once things have settled down a bit. More importantly, I would like to figure out why Travis is unhappy with me at the moment, as the tests pass on my machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, certainly fine to change things here in the end (and not sure why travis is failing, I also do not see something obvious)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I agree here. Since we are deprecating this, I think that no need to announce it for python engine support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Will change as soon as I placate the Travis.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 1, 2016

@jreback : any thoughts on why Travis is unhappy on one machine? Seems like me and @jorisvandenbossche are stumped at the moment

@jorisvandenbossche
Copy link
Member

Regarding the deprecation, I would be fine with that (I don't think many people already have heard of these keywords (in any case, I don't), although it would be nice if we could check this somewhere), but, it would be nice if there is then an alternative available (like the pd.to_numeric idea).

I am only wondering if it is needed to add the documentation to the docstring when we are deprecating it at the same time (so we don't actually want to encourage people to use it, and the docstring is already really long)

@gfyoung
Copy link
Member Author

gfyoung commented Jun 1, 2016

@jorisvandenbossche : yeah, me neither. I hadn't really heard of it until I started looking into what it did in the CParser. Surprising just how many options are undocumented for a function that works quite well for most everyday purposes such as my own. 😄

The pd.to_numeric idea will need some work to flesh out, but I don't think that it is appropriate for this PR as @jreback has already mentioned. I think the doc-string is still useful because comprehensive documentation is always a good idea IMO just so that there is no confusion about all of these options. Otherwise, how would you know that it was deprecated?

})

# default behaviour for 'use_unsigned'
out = self.read_csv(StringIO(data), compact_ints=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every one of these needs a tm.assert_produces_warning block. The warning on travis is because of below when you DO try to catch the warning it has already been triggered (and hence doesn't show). These are somewhat hard to isolated. Usually just keep running smaller and smaller blocks of code to figure out where it is triggering the uncaught warning, then inspect the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I did not know that. Fixed.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2016

I am ok with doc-stringing this (always helpful) and deprecating is fine with me. Whenever we add this to pd.to_numeric (which I think is best place). Then can simply update the deprecation warnings to recommend a new usage (even if its done later). maybe make a placeholder issue for that enhancement.

@gfyoung gfyoung force-pushed the python-engine-compact-ints branch from c51c5b6 to 95f7ba8 Compare June 2, 2016 17:58
@gfyoung
Copy link
Member Author

gfyoung commented Jun 2, 2016

@jreback @jorisvandenbossche : Travis is happy now, and I made the requested doc changes. Ready to merge if there are no other concerns.

@jorisvandenbossche
Copy link
Member

It still feels a bit strange to add (to the python parser) and doc a feature that we are deprecating directly :-), but OK to merge for me

@jreback
Copy link
Contributor

jreback commented Jun 2, 2016

@jorisvandenbossche though this will prob be deprecated for a while...

@jreback
Copy link
Contributor

jreback commented Jun 2, 2016

@gfyoung ty, pls also create an issue to add to pd.to_numeric

@gfyoung gfyoung deleted the python-engine-compact-ints branch June 2, 2016 23:19
@gfyoung gfyoung changed the title ENH: Add support for compact_ints and use_unsigned in Python engine API: Deprecate compact_ints and use_unsigned in read_csv Jun 2, 2016
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 19, 2017
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 20, 2017
jreback pushed a commit that referenced this pull request Dec 21, 2017
* CLN: Drop compact_ints/use_unsigned from read_csv

Deprecated in v0.19.0

xref gh-13323

* CLN: Remove downcast_int64 from inference.pyx

It was only being used for the compact_ints
and use_unsigned parameters in read_csv.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants