Skip to content

ENH: read_excel: Add a callable to access the Cell before getting the value #49872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
Polandia94 opened this issue Nov 23, 2022 · 2 comments
Closed
1 of 3 tasks
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@Polandia94
Copy link

Polandia94 commented Nov 23, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

In pandas.read_excel it's not possible to access information about the cell object before converting to dataframe. In some cases as in #47269,#46895, and #49770 it's necessary to acces the cell, especially number_format to change the values

Feature Description

Add a new parameter to read_excel, something like value_of_cell, that receive a cell and return a value. In my case i change _convert_cell in _openpyxl to:

def _convert_cell(self, cell, convert_float: bool, value_of_cell:Union[None, Callable]) -> Scalar:
        from openpyxl.cell.cell import (
            TYPE_ERROR,
            TYPE_NUMERIC,
        )
        if cell.value is None:
            return ""  # compat with xlrd
        elif cell.data_type == TYPE_ERROR:
            return np.nan
        elif value_of_cell is not None: # this is new
            return value_of_cell(cell) # this is new
        elif cell.data_type == TYPE_NUMERIC:
            # GH5394, GH46988
            if convert_float:
                val = int(cell.value)
                if val == cell.value:
                    return val
            else:
                return float(cell.value)
        return cell.value

so in my code i call

def conv(cell):
            if cell.number_format.startswith("0"):
                return "0" * (len(cell.number_format) - len(str(cell.value))) + str(cell.value)
            else:
                return cell.value
 df = pd.read_excel(file, header=None, na_filter=False, dtype=str, value_of_cell:=conv)

of course is neccesary to edit all the calls from read_excel to _convert_cell ( i think is something like 10 functions), but i will create a pull request if can pass the test. I hope this is useful for the community

Alternative Solutions

Another option is to use openpyxl directly as, but loss the extra functions of read_excel and more code is needed:

        document = load_workbook(file)
        sheet = document.active
        for row in sheet:
            for cell in row:
              cell.value = value_of_cell(cell)
        df = pd.DataFrame(sheet.values)
        df = df.fillna('')

Additional Context

No response

@Polandia94 Polandia94 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 23, 2022
@Polandia94
Copy link
Author

Polandia94 commented Nov 23, 2022

I think this could be a provisory solution to:
#20828
If value_of_cell is:

def value_of_cell(cell):
    if len(str(cell.value))> 15:
        return str(cell.value)
    else:
        return cell.value

@mroeschke
Copy link
Member

Thanks for the request but it appears there hasn't been much interest in this issue from the core team and community over the years so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants