-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add decimal parameter to to_numeric #4674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
there is a |
The idea of factoring out this functionality is so that it can be used for parsing numbers in other locations without having to go through the |
why would you not simply use |
How would you use replace for the parenthesis? Bear in mind that the value then needs to be negated. |
I suppose I could replace the parenthesis with a leading negative sign and then convert that string to an int/ float. |
That being said I was thinking of places where I wanted to be able to take an arbitrary string which might not be in a |
that would work; replace should infer the return types correctly |
@jreback That addresses using replace for a DataFrame or Series but not in the general sense where we have a string to parse. |
out of scope for pandas |
Fair enough. I was just trying to factor out code that could be used both for pandas and for what is out of scope for pandas. |
@cancan101 close this then? or do you want to try to do some of this in the actual parse itself? (which depending on the API/scope may of may not be useful). As its a small edge case and if you did have parens/dollar signs in your numbers you can just replace after you read it in, because there is a perf impact in even checking for it on more general parsing |
I was going to / could write this such that there is no performance impact. I would instantiate a parser at the following location: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1505 based on parsing options. Theoretically are a few more branches per FILE (not per cell or line). |
Well, so you want to parse this is JUST for PythonParser, right? (if so I am -1 on this); the c-parser is most often the default/used |
What do you mean -1 on this? Like you don't like the change? Or you care less about the performance hit because this code path isn't the default? |
And yes, what I was thinking was just the |
no, doing a change just for the |
This does not actually have to change the API for the |
for what purpose? |
There are cases in the HTML parser that I want to be able to use this functionality. |
Can you show an example of input where this is necessary? |
@cancan101 I would not be averse to seeing a suite of |
Having a set of converters seems reasonable. Do you think it makes sense then for the |
when I mean converters, I really mean essentially a set of regular expressions/functions that you just pass to I mean you could argue that the parents/ignore dollar signs are general purpose too, but I have never seen people actually write those to csv's (and only use them as formatting). |
Far more often I see these in HTML tables, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI I was just hoping to centralize the number parsing logic for CSV and for HTML in one place. It looks like logic for parsing HTML will need to be richer. |
|
@jreback We still have not really answered the question as to whether we should copy and paste the following code from
|
@cancan101 also, keep in mind that it's often much simpler to change things around after converting into a |
@jtratner That makes sense. With that line of thinking, why even have this |
@cancan101 its all a trade-off. More complicated parsing code by MUCH faster for in-line conversions, (e.g. having thousands separators) is not so uncommon, and once you take them out you are left with a float. Doing after conversions is much more memory intensive (and time intensive), you have to scan twice (or more). Getting good/great performance is hard. Always solve the problem first, then optimize. That said it is keeping in mind that certain bottlenecks can be dealt with by more complicated code. |
@jreback Fair enough. I was going to say there is there is something to be said about keeping the parsers about extracting structure from the info (column, indexes, etc) and leaving extracting meaning (i.e. number parsing) to the post processing. This would make writing new parsers simpler. |
@cancan101 you have a fair point; in this case it was all about performance when @wesm wrote the new parsers late last year. I am not sure of any gains in anew 'new' parsers thought. What 'new' parsers are needed? |
At this point the only one on my horizon would be the XBRL parser (#4407), although I don't think that one will have issues with string data. |
It also is not all about "new" parsers but maintaining and improving old parsers. This does contract jtratner's previous points about keeping the APIs simple. Admittedly there might be a valid performance reason here to do so. |
a lot of times people want different API's for things but the implementation can be a single one (with say options) to support these API's the parsers are the reverse there is bascially a single API, with different implementations (e.g. most use read_csv machinery, but excel puts a layer on top, and HTML is completely different, as is the stata and JSON implementations for that matter) |
Here the number of rows is small (<100), so for me, I can definitely go the "read as string and then apply more complicated parsing logic" without worrying about performance. |
Did a decision get reached on this? |
I think it would be nice to add a |
following onto this. should to_numeric support cases like thousand seperator? Or that is better handled outside to_numeric. The issue I am having is sometimes the data comes in as "1,200" instead of "1200" or 1200. So when I call to_numeric
When I would want "1,200" to be coverted to 1200 when I call to_numeric. |
take |
I see that to_numeric() uses precise_xstrtod() from tokenizer.c to do the string to numeric conversion, this function already has functionality to handle different characters for decimal and thousands, but the normal separators are hardcoded in pd_parser.c where the call to precise_xstrtod() occurs. I could add parameters to to_numeric() for decimal and thousand separators like read_csv() has, and then pass these down to the underlying converter? |
Create a class / method for number parsing. This would be a generalization of the code that remove the thousands separators from numbers (see: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1502).
Other items include:
I can submit PR for this. Do people have other suggestions for this?
The text was updated successfully, but these errors were encountered: