Potential performance Issue: Slow read_csv() Function with pandas 2.0.0 #224

TendouArisu · 2024-03-02T08:26:50Z

Issue Description:

Hello.
I have discovered a performance degradation in the read_csv function of pandas version below 2.0.1. And I notice some parts of the repository depend on pandas 2.0.0 in environments/minimal_requires.txt and some other dependencies require pandas below 2.0.1. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on pandas GitHub related to this issue, including #52546 and #52548.
I also found that app.py and demos/data_process_loop/app.py used the influenced api. There may be more files using the influenced api.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 2.0.1 or exploring other solutions to optimize the performance of read_csv.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-25T09:31:54Z

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions · 2024-04-16T09:32:04Z

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

HYLcool · 2024-04-23T11:21:28Z

Hi @TendouArisu , thanks for your attention and suggestions!

We have conducted a few experiments and proved what you said. We limited pandas to 2.0.0 mainly because:

pandas >= 2.1.x and datasets==2.11.0 might raise a ValueError when exporting a dataset to a JSON file.

ValueError: 'index=True' is only valid when 'orient' is 'split', 'table', 'index', or 'columns'.

pandas >= 2.1.x requires Python >= 3.9, but we want to support Python 3.7/3.8 as well.

However, we found that pandas 2.0.1 - 2.0.3 work well both on performance and these two problems above. So we update the version of pandas to 2.0.3 in the latest PR #303 .

Thanks for your suggestion again! Feel free to discuss with us if you have any further suggestions~

github-actions · 2024-05-15T09:31:55Z

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions · 2024-05-18T09:32:00Z

Close this stale issue.

yxdyc assigned HYLcool and zhijianma Mar 4, 2024

github-actions bot added the stale-issue label Mar 25, 2024

HYLcool removed the stale-issue label Mar 25, 2024

github-actions bot added the stale-issue label Apr 16, 2024

HYLcool removed the stale-issue label Apr 16, 2024

github-actions bot added the stale-issue label May 15, 2024

github-actions bot closed this as completed May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential performance Issue: Slow read_csv() Function with pandas 2.0.0 #224

Potential performance Issue: Slow read_csv() Function with pandas 2.0.0 #224

TendouArisu commented Mar 2, 2024

github-actions bot commented Mar 25, 2024

github-actions bot commented Apr 16, 2024

HYLcool commented Apr 23, 2024

github-actions bot commented May 15, 2024

github-actions bot commented May 18, 2024

Potential performance Issue: Slow read_csv() Function with pandas 2.0.0 #224

Potential performance Issue: Slow read_csv() Function with pandas 2.0.0 #224

Comments

TendouArisu commented Mar 2, 2024

github-actions bot commented Mar 25, 2024

github-actions bot commented Apr 16, 2024

HYLcool commented Apr 23, 2024

github-actions bot commented May 15, 2024

github-actions bot commented May 18, 2024