Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix excel merge cells header #2265

Merged
merged 1 commit into from
Feb 14, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion apps/common/handle/impl/table/xlsx_parse_table_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,12 @@ def fill_merged_cells(self, sheet, image_dict):
data = []

# 获取第一行作为标题行
headers = [cell.value for cell in sheet[1]]
headers = []
for idx, cell in enumerate(sheet[1]):
if cell.value is None:
headers.append(' ' * (idx + 1))
else:
headers.append(cell.value)

# 从第二行开始遍历每一行
for row in sheet.iter_rows(min_row=2, values_only=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided Python function fill_merged_cells for filling merged cells in an Excel spreadsheet contains a few areas that could be optimized or improved:

  1. Handling Merged Cells: The current implementation does not explicitly handle merged cells within the loop over rows and columns. Ensuring that each cell value is correctly retrieved even if it's part of a merged cell group can be necessary (though this depends on the specific requirements).

  2. Empty Header Handling: The header initialization should ensure that headers are populated even if some cells have missing values. However, the use of spaces indicates manual handling, which may lead to confusion.

  3. Optimization:

    • Consider using pandas to manipulate the DataFrame instead of directly working with sheets from openpyxl since it provides more robust functionalities.
    • For handling multiple sheets/tables simultaneously, consider refactoring the function to accept different sheets as input parameters rather than relying solely on a single sheet object.
  4. Validation: Add checks to ensure that all inputs are valid objects before proceeding, particularly when dealing with worksheets and dictionaries.

Here’s an enhanced version of the function incorporating these considerations:

import pandas as pd

def fill_merged_cells(sheets_data):
    """
    Fills merged cells in one or more sheets with data from another dictionary.

    :param sheets_data: A dictionary where keys are sheet names and values are DataFrames containing the data.
    """
    
    merged_data = {}
    
    for sheet_name, df in sheets_data.items():
        if isinstance(df, pd.DataFrame) and sum([sheet.has_merges() for sheet in df._xls.sheet_list]) > 0:
            # Extract headers assuming first non-empty column is the title
            cols_with_headers = next((i for i, col_vals in enumerate(zip(*df)) 
                                     if any(col_vals)), None)
            if cols_with_headers is not None:
                header_cols = list(df.columns[cols_with_headers:])
            
            # Handle empty headers manually by replacing them with strings
            # This might depend on actual usage cases
            headers_replacement = ['' for _ in range(len(header_cols))]
            
            
            # Fill remaining values while respecting row/column order
            # Example: Assuming we want to preserve column order but replace empty values
            new_df_values = []
            header_mapping = dict(zip(cols_with_headers, headers_replacement))
            for index, raw_row in df.iterrows(): 
                formatted_row = [header_mapping.get(i, '') for i in df.columns]
                raw_row_vals = [value for value in zip(raw_row)]  
                
                for j, val in enumerate(raw_row_vals):      
                    formatted_row[j] += val
        
                new_df_values.append(formatted_row)

            # Create a new DataFrame for merging purposes
            temp_df_new = pd.DataFrame(new_df_values, columns=new_df_values.pop(0))
            
            # Merge existing data into the temporary DataFrame
            if sheet_name in sheets_data:
                existing_df = sheets_data[sheet_name].dropna(subset=headers)
                combined_df = pd.concat([existing_df, temp_df_new], ignore_index=True).sort_index()

                # Ensure unique indices; add counter to avoid duplication
                combined_df.reset_index(drop=True, inplace=True)
                max_index = combined_df.index.max()
                combined_df['temp_index'] = combined_df.index
            
                            combined_df.loc[max_index + 1:, "temp_index"] = combined_df.loc[:max_index, 'counter']
                            
                    
                try:
                    
                    # Drop duplicates based on the sorted columns without 'temp_index'
                    clean_df = combined_df.drop_duplicates().drop(columns='temp_index')
                    
                    #
                    sheets_data[sheet_name] = clean_df
                    
                except Exception as e:
                   print(f"Error processing {sheet_name}: {e}")
       
       
               
        

This code includes logic to process both regular DataFrames and those potentially holding merged cells by extracting headers manually and adjusting for empty cells. It also suggests a general approach to integrating multiple sheets into a unified dataframe structure. Adjustments will likely be needed based on specific use case needs.

Expand Down