- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19.2k
Speed up tokenizing of a row in csv and xstrtod parsing #25784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run the asv benchmarks for csv and report the changes. we may not have benchmarks that are specificying hitting the code you changes; if so can you add one.
        
          
                doc/source/whatsnew/v0.25.0.rst
              
                Outdated
          
        
      | int8/int16/int32 and the searched key is within the integer bounds for the dtype (:issue:`22034`) | ||
| - Improved performance of :meth:`pandas.core.groupby.GroupBy.quantile` (:issue:`20405`) | ||
|  | ||
| - Improved performance of `tokenize_bytes` in `tokenizer.c` | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also say :meth:`read_csv`  a user has no idea what any of the other things are
| 
 
 So this basically gives from 5% to 10% increase on mini-benchmarks. Note that this scales quite well, and on cases where csv files are million lines long it still gives same 5-10%. | 
a4f6dcd    to
    12a6df9      
    Compare
  
    | The results are even more promising if I allow more warmup and more sampling time so Turboboost and frequency scaling don't impact the performance too much. Running  
 | 
| @jreback I've fixed whatsnew entry per your comment and rebased to latest master for clean history. | 
| Codecov Report
 @@            Coverage Diff             @@
##           master   #25784      +/-   ##
==========================================
- Coverage   91.26%   91.26%   -0.01%     
==========================================
  Files         173      173              
  Lines       52982    52982              
==========================================
- Hits        48356    48355       -1     
- Misses       4626     4627       +1
 
 Continue to review full report at Codecov. 
 | 
| Codecov Report
 @@            Coverage Diff             @@
##           master   #25784      +/-   ##
==========================================
+ Coverage   91.26%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       52982    53002      +20     
==========================================
+ Hits        48356    48375      +19     
- Misses       4626     4627       +1
 
 Continue to review full report at Codecov. 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on whatsnew, ping on green.
| lgtm. | 
| 
 @jreback all tests are green | 
| thanks @vnlitvin nice patch! | 
git diff upstream/master -u -- "*.py" | flake8 --diffI will update the PR when CI finishes running, as I locally tested
io.parser.test_commononly.