Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Performance issue with fillna() after merging DataFrames #61180

Open
2 of 3 tasks
sjfakharian opened this issue Mar 26, 2025 · 2 comments
Open
2 of 3 tasks

BUG: Performance issue with fillna() after merging DataFrames #61180

sjfakharian opened this issue Mar 26, 2025 · 2 comments
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance

Comments

@sjfakharian
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import time

# Create two large DataFrames with missing data
np.random.seed(0)
size = 1_000_000

df1 = pd.DataFrame({
    'ID': range(size),
    'Name': np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eve', None], size)
})

df2 = pd.DataFrame({
    'ID': range(size // 2, size * 3 // 2),  # Overlapping and new IDs
    'Age': np.random.choice([None, 20, 30, 40, 50, 60], size)
})

# Measure time for merge operation
start_time = time.time()
merged_df = pd.merge(df1, df2, on='ID', how='outer')
merge_time = time.time() - start_time
print(f"Merge time: {merge_time:.2f} seconds")

# Measure time for fillna operation
start_time = time.time()
merged_df['Name'].fillna('Unknown', inplace=True)
merged_df['Age'].fillna(0, inplace=True)
fillna_time = time.time() - start_time
print(f"Fillna time: {fillna_time:.2f} seconds")

# Print some statistics
print(f"Total rows after merge: {len(merged_df)}")
print(f"Null values in 'Name' after fillna: {merged_df['Name'].isnull().sum()}")
print(f"Null values in 'Age' after fillna: {merged_df['Age'].isnull().sum()}")

Issue Description

Bug Description

When using fillna() after merging DataFrames, unexpected behavior and performance issues occur.

Reproducible Code Example

Expected Behavior

Expected Behavior

The fillna() operation should efficiently fill missing values after merging, without unexpected behavior or significant performance degradation.

Actual Behavior

The fillna() operation may exhibit unexpected behavior or poor performance, especially with larger datasets.

Additional Context

This issue becomes more apparent when working with larger datasets and complex merge operations. Improving the performance and reliability of fillna() after merging would greatly benefit data processing workflows.

Environment

  • pandas version: 3.0.0
  • Python version: 3.13.2
  • Operating System: Linux

Installed Versions

INSTALLED VERSIONS

commit : None
python : 3.13.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0
numpy : 1.26.3
pytz : 2024.1
dateutil : 2.8.2
pip : 24.0
setuptools : 69.0.2
Cython : 3.0.8
pytest : 8.0.0
hypothesis : 6.98.3
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.21.0
pandas_datareader: None

[other dependencies ...]

@sjfakharian sjfakharian added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 26, 2025
@rhshadrach
Copy link
Member

When using fillna() after merging DataFrames, unexpected behavior and performance issues occur.

Thanks for the report. You should be seeing a ChainedAssignmentError due to the use of inplace=True. Changing the code to not use this:

merged_df['Name'] = merged_df['Name'].fillna('Unknown')
merged_df['Age'] = merged_df['Age'].fillna(0)

gives me the proper behavior.

If you believe there are performance issues, can you detail why it is you think that?

@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Mar 26, 2025
@sjfakharian
Copy link
Author

Thank you for addressing the performance issue with fillna() after merging large DataFrames.

Upon testing, I observed that the slowdown occurs when applying fillna() immediately after a merge operation that results in a DataFrame with a non-standard index and scattered missing data. It seems that merging might alter the DataFrame’s internal structure, leading to inefficiencies in how fillna() processes and locates missing values. Additionally, the performance hit could be related to the following factors:

Index Misalignment: The merge operation may produce a DataFrame with an irregular index, causing additional overhead in the alignment process during fillna().

Memory Layout Changes: Merging can lead to non-contiguous memory blocks, which might result in less efficient operations when fillna() is applied.

Data Type Conversions: There could be implicit type conversions after a merge that delay processing or require extra computations.

Caching Effects: The reorganization of data post-merge might impact cache locality, slowing down subsequent operations like fillna().

Would it be helpful if I provided more detailed benchmarks or a profiling summary of these operations? I am happy to contribute further to diagnosing the root cause and exploring potential optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants