Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023 #207

lexarrow · 2024-03-08T15:13:56Z

Finished timestamps for this video: https://www.youtube.com/watch?v=vy8VrhaYR2M
Title: Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023
Timestamps:
00:00 - General introduction
03:33 - About Matt
06:50 - Pandas 2 introduction
10:08 - Presentation of Pandas 2 main feature no 1, using pyarrow for dtype backend instead of numpy
12:28 - Presentation of Pandas 2 main feature no 2, copy on write
13:07 - Start of Pandas 2 with pyarrow example in Jupyter Notebook
15:16 - Dealing with columns for which pyarrow did not detect dtype by default
18:28 - Presenting the actions on the dataset implemented with numpy
19:11 - Inefficiencies of .apply function in pandas
20:40 - Presenting the actions on the dataset implemented with a vectorized function
21:38 - Processing time benchmark between the .apply and the vectorized solutions
24:09 - Audience question: Are there any backwards compatibility issues between Pandas 2 and Pandas 1?
26:55 - Audience question: Are there any reasons not to use pyarrow?
27:40 - Audience question: How can I easily migrate to Polars or handle the missing index?
29:06 - Polars introduction
36:04 - Start of Polars example in Jupyter Notebook
36:11 - Audience question: Can Polars run in a distributed way?
36:34 - Polars example with the eager implementation
38:30 - Polars eager example - convert column dtypes to dates where auto-detection didn't work
40:18 - Polars eager example - implementation of the Pandas numpy .apply in Polars
42:40 - Polars eager example - processing time benchmark
43:14 - Considerations of Pandas vs Polars speed
45:16 - Polars example with the lazy implementation
47:20 - Answer to the question: Can Polars run in a distributed way?
48:50 - Audience question: Is there an advantage to using Polars over pyspark?
52:33 - Audience question: Is there an advantage to using Polars over Daft?
53:50 - Introduction to DuckDB in the context of dataframes and tabular data
55:48 - DuckDB background and main features
58:08 - Start of DuckDB example in Jupyter Notebook using SQL
58:56 - DuckDB how to load data
1:01:30 - Audience Question: What is a median-sized dataset?
1:02:20 - DuckDB complicated query example
1:03:07 - DuckDB Arrow integration
1:04:48 - Audience Question: Where can I get a copy of temp bill file?
1:05:26 - Main conclusions and aspects related to switching from Pandas to Polars
1:09:21 - Audience consideration: The Pandas pyarrow integration is incomplete (ref dt accessor)
1:11:10 - Audience question: How do you deal with reading variables as strings in DuckDB?
1:12:16 - Audience question: What tool do you recommend to start learning as a beginner?
1:12:32 - Presentation of Tabular Tools (API & Scale) chart
1:16:12 - Answer to the question: What tool do you recommend to start learning as a beginner?
1:16:43 - Audience question: Will 'Effective Pandas 2' book have the same datasets as 'Effective Pandas' original edition?
1:18:06 - Audience question about mass renaming variables
1:19:54 - Which tool to use of the ones presented?
1:21:48 - Matt contact details and areas of expertise
Resources:
- Github repo of the notebook used in the presentation: https://github.com/mattharrison/talks/tree/2023-12-pydata
- Books:
  - Effective Pandas - Metasnake https://store.metasnake.com/effective-pandas1-book
  - Effective Pandas - Amazon https://www.amazon.com/Effective-Pandas-Patterns-Manipulation-Treading/dp/B09MYXXSFM
  - Effective Pandas 2 - Metasnake https://store.metasnake.com/effective-pandas-book
  - Effective Pandas 2 - Amazon https://www.amazon.com/gp/product/B0CSRGH8R3?ref_=dbs_m_mng_rwt_calw_tpbk_3&storeType=ebooks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023 #207

Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023 #207

lexarrow commented Mar 8, 2024 •

edited

Loading

Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023 #207

Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023 #207

Comments

lexarrow commented Mar 8, 2024 • edited Loading

lexarrow commented Mar 8, 2024 •

edited

Loading