Skip to content

Speed up tokenizing of a row in csv and xstrtod parsing #25784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 20, 2019

Conversation

vnlitvinov
Copy link
Contributor

@vnlitvinov vnlitvinov commented Mar 19, 2019

  • closes: N/A
  • tests passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I will update the PR when CI finishes running, as I locally tested io.parser.test_common only.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run the asv benchmarks for csv and report the changes. we may not have benchmarks that are specificying hitting the code you changes; if so can you add one.

@@ -132,7 +132,8 @@ Performance Improvements
- Improved performance of :meth:`Series.searchsorted`. The speedup is especially large when the dtype is
int8/int16/int32 and the searched key is within the integer bounds for the dtype (:issue:`22034`)
- Improved performance of :meth:`pandas.core.groupby.GroupBy.quantile` (:issue:`20405`)

- Improved performance of `tokenize_bytes` in `tokenizer.c`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also say :meth:`read_csv` a user has no idea what any of the other things are

@jreback jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 19, 2019
@vnlitvinov
Copy link
Contributor Author

asv continuous -f 1.05 origin/master HEAD -b io.csv results:

before after ratio test name
[e8d951d] [a4f6dcd]
master speed-up-tokenizer
1.58±0.02ms 1.49±0.02ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
5.23±0.1ms 4.87±0.03ms 0.93 io.csv.ReadUint64Integers.time_read_uint64_na_values
10.1±0.2ms 9.15±0.3ms 0.91 io.csv.ReadCSVSkipRows.time_skipprows(10000)
13.0±0.3ms 11.6±0.2ms 0.89 io.csv.ReadCSVThousands.time_thousands('|', None)

So this basically gives from 5% to 10% increase on mini-benchmarks. Note that this scales quite well, and on cases where csv files are million lines long it still gives same 5-10%.

@vnlitvinov
Copy link
Contributor Author

The results are even more promising if I allow more warmup and more sampling time so Turboboost and frequency scaling don't impact the performance too much.

Running asv continuous -f 1.05 origin/master HEAD -b io.csv -a sample_time=2 -a warmup_time=2 yields:

before after ratio test name
[e8d951d] [a4f6dcd]
master speed-up-tokenizer
34.9±0.2ms 32.9±1ms 0.94 io.csv.ReadCSVCategorical.time_convert_direct
13.5±0.02ms 12.6±0.08ms 0.93 io.csv.ReadCSVThousands.time_thousands(',', ',')
5.10±0.07ms 4.71±0.09ms 0.92 io.csv.ReadUint64Integers.time_read_uint64_neg_values
14.6±0.06ms 13.0±0.04ms 0.89 io.csv.ReadCSVThousands.time_thousands('|', ',')
16.0±0.3ms 13.8±0.09ms 0.86 io.csv.ReadCSVSkipRows.time_skipprows(None)
10.3±0.1ms 8.81±0.1ms 0.86 io.csv.ReadCSVSkipRows.time_skipprows(10000)
12.9±0.1ms 10.7±0.09ms 0.84 io.csv.ReadCSVThousands.time_thousands('|', None)
13.0±0.05ms 10.8±0.08ms 0.83 io.csv.ReadCSVThousands.time_thousands(',', None)

@vnlitvinov
Copy link
Contributor Author

@jreback I've fixed whatsnew entry per your comment and rebased to latest master for clean history.

@codecov
Copy link

codecov bot commented Mar 20, 2019

Codecov Report

Merging #25784 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25784      +/-   ##
==========================================
- Coverage   91.26%   91.26%   -0.01%     
==========================================
  Files         173      173              
  Lines       52982    52982              
==========================================
- Hits        48356    48355       -1     
- Misses       4626     4627       +1
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.76% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/util/testing.py 89.3% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...12a6df9. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 20, 2019

Codecov Report

Merging #25784 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25784      +/-   ##
==========================================
+ Coverage   91.26%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       52982    53002      +20     
==========================================
+ Hits        48356    48375      +19     
- Misses       4626     4627       +1
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.77% <ø> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/util/testing.py 89.3% <0%> (-0.11%) ⬇️
pandas/core/series.py 93.67% <0%> (-0.01%) ⬇️
pandas/core/ops.py 91.74% <0%> (ø) ⬆️
pandas/core/frame.py 96.79% <0%> (ø) ⬆️
pandas/core/generic.py 93.52% <0%> (ø) ⬆️
pandas/core/computation/expr.py 88.52% <0%> (+0.35%) ⬆️
pandas/core/computation/common.py 89.47% <0%> (+3.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...72c7570. Read the comment docs.

@jreback jreback added this to the 0.25.0 milestone Mar 20, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on whatsnew, ping on green.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

lgtm.

@vnlitvinov
Copy link
Contributor Author

ping on green

@jreback all tests are green

@jreback jreback merged commit 4c21e5c into pandas-dev:master Mar 20, 2019
@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

thanks @vnlitvin nice patch!

thoo added a commit to thoo/pandas that referenced this pull request Mar 20, 2019
* upstream/master: (55 commits)
  PERF: Improve performance of StataReader (pandas-dev#25780)
  Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784)
  BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588)
  BUG-24971 copying blocks also considers ndim (pandas-dev#25521)
  CLN: Panel reference from documentation (pandas-dev#25649)
  ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955)
  BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769)
  DOC: clean bug fix section in whatsnew (pandas-dev#25792)
  DOC: Fixed PeriodArray api ref (pandas-dev#25526)
  Move locale code out of tm, into _config (pandas-dev#25757)
  Unpin pycodestyle (pandas-dev#25789)
  Add test for rdivmod on EA array (GH23287) (pandas-dev#24047)
  ENH: Support datetime.timezone objects (pandas-dev#25065)
  Cython language level 3 (pandas-dev#24538)
  API: concat on sparse values (pandas-dev#25719)
  TST: assert_produces_warning works with filterwarnings (pandas-dev#25721)
  make core.config self-contained (pandas-dev#25613)
  CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721)
  TST: Check pytables<3.5.1 when skipping (pandas-dev#25773)
  DOC: Fix typo in docstring of DataFrame.memory_usage  (pandas-dev#25770)
  ...
@vnlitvinov vnlitvinov deleted the speed-up-tokenizer branch March 28, 2019 14:30
anmyachev pushed a commit to anmyachev/pandas that referenced this pull request Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants