Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix excel merge cells header #2265

Merged
merged 1 commit into from
Feb 14, 2025
Merged

Conversation

shaohuzhang1
Copy link
Contributor

fix: Fix excel merge cells header

Copy link

f2c-ci-robot bot commented Feb 13, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

f2c-ci-robot bot commented Feb 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

if cell.value is None:
headers.append(' ' * (idx + 1))
else:
headers.append(cell.value)

# 从第二行开始遍历每一行
for row in sheet.iter_rows(min_row=2, values_only=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided Python function fill_merged_cells for filling merged cells in an Excel spreadsheet contains a few areas that could be optimized or improved:

  1. Handling Merged Cells: The current implementation does not explicitly handle merged cells within the loop over rows and columns. Ensuring that each cell value is correctly retrieved even if it's part of a merged cell group can be necessary (though this depends on the specific requirements).

  2. Empty Header Handling: The header initialization should ensure that headers are populated even if some cells have missing values. However, the use of spaces indicates manual handling, which may lead to confusion.

  3. Optimization:

    • Consider using pandas to manipulate the DataFrame instead of directly working with sheets from openpyxl since it provides more robust functionalities.
    • For handling multiple sheets/tables simultaneously, consider refactoring the function to accept different sheets as input parameters rather than relying solely on a single sheet object.
  4. Validation: Add checks to ensure that all inputs are valid objects before proceeding, particularly when dealing with worksheets and dictionaries.

Here’s an enhanced version of the function incorporating these considerations:

import pandas as pd

def fill_merged_cells(sheets_data):
    """
    Fills merged cells in one or more sheets with data from another dictionary.

    :param sheets_data: A dictionary where keys are sheet names and values are DataFrames containing the data.
    """
    
    merged_data = {}
    
    for sheet_name, df in sheets_data.items():
        if isinstance(df, pd.DataFrame) and sum([sheet.has_merges() for sheet in df._xls.sheet_list]) > 0:
            # Extract headers assuming first non-empty column is the title
            cols_with_headers = next((i for i, col_vals in enumerate(zip(*df)) 
                                     if any(col_vals)), None)
            if cols_with_headers is not None:
                header_cols = list(df.columns[cols_with_headers:])
            
            # Handle empty headers manually by replacing them with strings
            # This might depend on actual usage cases
            headers_replacement = ['' for _ in range(len(header_cols))]
            
            
            # Fill remaining values while respecting row/column order
            # Example: Assuming we want to preserve column order but replace empty values
            new_df_values = []
            header_mapping = dict(zip(cols_with_headers, headers_replacement))
            for index, raw_row in df.iterrows(): 
                formatted_row = [header_mapping.get(i, '') for i in df.columns]
                raw_row_vals = [value for value in zip(raw_row)]  
                
                for j, val in enumerate(raw_row_vals):      
                    formatted_row[j] += val
        
                new_df_values.append(formatted_row)

            # Create a new DataFrame for merging purposes
            temp_df_new = pd.DataFrame(new_df_values, columns=new_df_values.pop(0))
            
            # Merge existing data into the temporary DataFrame
            if sheet_name in sheets_data:
                existing_df = sheets_data[sheet_name].dropna(subset=headers)
                combined_df = pd.concat([existing_df, temp_df_new], ignore_index=True).sort_index()

                # Ensure unique indices; add counter to avoid duplication
                combined_df.reset_index(drop=True, inplace=True)
                max_index = combined_df.index.max()
                combined_df['temp_index'] = combined_df.index
            
                            combined_df.loc[max_index + 1:, "temp_index"] = combined_df.loc[:max_index, 'counter']
                            
                    
                try:
                    
                    # Drop duplicates based on the sorted columns without 'temp_index'
                    clean_df = combined_df.drop_duplicates().drop(columns='temp_index')
                    
                    #
                    sheets_data[sheet_name] = clean_df
                    
                except Exception as e:
                   print(f"Error processing {sheet_name}: {e}")
       
       
               
        

This code includes logic to process both regular DataFrames and those potentially holding merged cells by extracting headers manually and adjusting for empty cells. It also suggests a general approach to integrating multiple sheets into a unified dataframe structure. Adjustments will likely be needed based on specific use case needs.

@liuruibin liuruibin merged commit c524fbc into main Feb 14, 2025
4 of 5 checks passed
@liuruibin liuruibin deleted the pr@main@fix_excel_merge_cell branch February 14, 2025 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants