Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1463: [C++] Utilize common hashing machinery for dictionary encoding #3036

Closed
wants to merge 1 commit into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Nov 26, 2018

No description provided.

@pitrou pitrou force-pushed the PARQUET-1463-hashing-refactor branch from 84a88f3 to f909f0a Compare November 26, 2018 16:42
@pitrou
Copy link
Member Author

pitrou commented Nov 26, 2018

A crude micro-benchmark doesn't show any significant performance difference. I don't know if we have more significant benchmarks.

@wesm
Copy link
Member

wesm commented Nov 27, 2018

cc @xhochy @majetideepak

@pitrou thanks so much for doing this. I will review the code and also run some before/after write perf benchmarks to kick the tires

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is LGTM. In the absence of benchmarks I often use http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml as the basis for real world data in performance testing.

@pitrou
Copy link
Member Author

pitrou commented Nov 27, 2018

Ok, I tested on a subset of taxi data. It's a bit faster with this PR. Actually, dictionary encoding does not seem to slow things down (anymore?):

>>> %time table = csv.read_csv("yellow.csv")                                                                                                                      
CPU times: user 1.11 s, sys: 309 ms, total: 1.41 s
Wall time: 105 ms
>>> table                                                                                                                                                         
pyarrow.Table
VendorID: int64
tpep_pickup_datetime: timestamp[s]
tpep_dropoff_datetime: timestamp[s]
passenger_count: int64
trip_distance: double
RatecodeID: int64
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
>>> table.num_rows                                                                                                                                                
999998
>>> %time pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True)                                                             
CPU times: user 337 ms, sys: 3 ms, total: 340 ms
Wall time: 339 ms
>>> %time pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False)                                                            
CPU times: user 334 ms, sys: 11.4 ms, total: 345 ms
Wall time: 345 ms

(note how reading the data from CSV is faster in wall-clock time than saving it to Parquet ;-))

@pitrou
Copy link
Member Author

pitrou commented Nov 27, 2018

Oops, I should have disabled compression. Updated results:

  • before:
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False, compression=None)                                        
203 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True, compression=None)                                         
338 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • after:
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False, compression=None)                                        
204 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True, compression=None)                                         
297 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So it looks like a 10% improvement when dictionary encoding is enabled.

@majetideepak
Copy link
Contributor

+1 LGTM. This is a nice improvement!

@xhochy
Copy link
Member

xhochy commented Nov 27, 2018

Thanks @pitrou for the optimization. The Parquet write path is something we have never looked into performance optimization as it was already quite fast in its basic form. So there's probably still some low hanging fruit.

@pitrou
Copy link
Member Author

pitrou commented Nov 28, 2018

Will merge soon if nobody objects.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'll run some ad hoc perf tests when I can out of my own curiosity and post results here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants