-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Speed up tokenizing of a row in csv and xstrtod parsing #25784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run the asv benchmarks for csv and report the changes. we may not have benchmarks that are specificying hitting the code you changes; if so can you add one.
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -132,7 +132,8 @@ Performance Improvements | |||
- Improved performance of :meth:`Series.searchsorted`. The speedup is especially large when the dtype is | |||
int8/int16/int32 and the searched key is within the integer bounds for the dtype (:issue:`22034`) | |||
- Improved performance of :meth:`pandas.core.groupby.GroupBy.quantile` (:issue:`20405`) | |||
|
|||
- Improved performance of `tokenize_bytes` in `tokenizer.c` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also say :meth:`read_csv`
a user has no idea what any of the other things are
So this basically gives from 5% to 10% increase on mini-benchmarks. Note that this scales quite well, and on cases where csv files are million lines long it still gives same 5-10%. |
a4f6dcd
to
12a6df9
Compare
The results are even more promising if I allow more warmup and more sampling time so Turboboost and frequency scaling don't impact the performance too much. Running
|
@jreback I've fixed whatsnew entry per your comment and rebased to latest master for clean history. |
Codecov Report
@@ Coverage Diff @@
## master #25784 +/- ##
==========================================
- Coverage 91.26% 91.26% -0.01%
==========================================
Files 173 173
Lines 52982 52982
==========================================
- Hits 48356 48355 -1
- Misses 4626 4627 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25784 +/- ##
==========================================
+ Coverage 91.26% 91.27% +<.01%
==========================================
Files 173 173
Lines 52982 53002 +20
==========================================
+ Hits 48356 48375 +19
- Misses 4626 4627 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on whatsnew, ping on green.
lgtm. |
@jreback all tests are green |
thanks @vnlitvin nice patch! |
* upstream/master: (55 commits) PERF: Improve performance of StataReader (pandas-dev#25780) Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784) BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588) BUG-24971 copying blocks also considers ndim (pandas-dev#25521) CLN: Panel reference from documentation (pandas-dev#25649) ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955) BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769) DOC: clean bug fix section in whatsnew (pandas-dev#25792) DOC: Fixed PeriodArray api ref (pandas-dev#25526) Move locale code out of tm, into _config (pandas-dev#25757) Unpin pycodestyle (pandas-dev#25789) Add test for rdivmod on EA array (GH23287) (pandas-dev#24047) ENH: Support datetime.timezone objects (pandas-dev#25065) Cython language level 3 (pandas-dev#24538) API: concat on sparse values (pandas-dev#25719) TST: assert_produces_warning works with filterwarnings (pandas-dev#25721) make core.config self-contained (pandas-dev#25613) CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721) TST: Check pytables<3.5.1 when skipping (pandas-dev#25773) DOC: Fix typo in docstring of DataFrame.memory_usage (pandas-dev#25770) ...
git diff upstream/master -u -- "*.py" | flake8 --diff
I will update the PR when CI finishes running, as I locally tested
io.parser.test_common
only.