Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new option for table header fallback. #161

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

SomeBottle
Copy link

For the instance that a table does not contain a header defined by <thead> or <th> , markdownify will use the first row of the table as header fallback by default.

However, sometimes I expect the first row to be parsed as a part of the tbody, for example:

<table>
    <tbody>
        <tr>
            <td>John</td>
            <td>123-456-7890</td>
            <td>john@example.com</td>
        </tr>
        <tr>
            <td>Bob</td>
            <td>987-654-3210</td>
            <td>bob@example.com</td>
        </tr>
    </tbody>
</table>

I expect it to be converted as follows:

|     |     |     |
| --- | --- | --- |
| John | 123-456-7890 | john@example.com |
| Bob | 987-654-3210 | bob@example.com |

I think I may not be the only one who has encountered this problem, so I added a new option table_header_fallback to control this behavior.

table_header_fallback is set to True by default, and the html above will be converted to:

| John | 123-456-7890 | john@example.com |
| --- | --- | --- |
| Bob | 987-654-3210 | bob@example.com |

If I set table_header_fallback=False, the result will be:

|  |  |  |
| --- | --- | --- |
| John | 123-456-7890 | john@example.com |
| Bob | 987-654-3210 | bob@example.com |

I've added this option to the ArgumentParser in main.py (alongwith another --escape-misc that may be forgotten), and I also created corresponding test cases.

Hope this will help, thank you for your work!

@chrispy-snps
Copy link
Collaborator

@SomeBottle - thanks for the pull request! I was not aware of this behavior, and I would want your proposed behavior in our own application. Thanks for catching this and proposing a solution.

I have some suggestions:

  1. Could you consider renaming table_header_fallback to table_infer_header?
  2. Could you move the missing --escape-misc argument to a separate pull request?

@AlexVonB, @matthewwithanm - my preference would be to not infer <th> header rows by default so that round-trips between HTML and Markdown are symmetrical. What do you think?

@matthewwithanm
Copy link
Owner

@AlexVonB, @matthewwithanm - my preference would be to not infer <th> header rows by default so that round-trips between HTML and Markdown are symmetrical. What do you think?

yeah, that behavior is surprising to me

@SomeBottle
Copy link
Author

SomeBottle commented Jan 2, 2025

Thank you for your response! I've renamed the option and removed --escape-misc from argument parser in the recent commit.

Should I change the default value of the option to False? Looking forward to your reply.

@chrispy-snps
Copy link
Collaborator

@SomeBottle - yes, please change the default to not infer header rows from data rows unless specified by the user.

@SomeBottle
Copy link
Author

OK, I just set the default value of the option to False and changed corresponding test cases and docs.

@chrispy-snps
Copy link
Collaborator

@SomeBottle - thanks!

Looking through your code, I see this:

        is_headrow = (
            all([cell.name == 'th' for cell in cells])
            or (el.parent.name == 'thead'
                # avoid multiple tr in thead             # <----
                and len(el.parent.find_all('tr')) == 1)  # <----
        )

What is the thinking behind the check for multiple <tr> elements? Is the thinking that if a <thead> contains multiple <tr> rows, they are probably truly data rows instead of header rows?

@SomeBottle
Copy link
Author

SomeBottle commented Jan 2, 2025

@SomeBottle - thanks!

Looking through your code, I see this:

        is_headrow = (
            all([cell.name == 'th' for cell in cells])
            or (el.parent.name == 'thead'
                # avoid multiple tr in thead             # <----
                and len(el.parent.find_all('tr')) == 1)  # <----
        )

What is the thinking behind the check for multiple <tr> elements? Is the thinking that if a <thead> contains multiple <tr> rows, they are probably truly data rows instead of header rows?

Yes, considering the possibility of such an unusual situation, which is actually an invalid header for markdown table, I think it should be treated as a header missing case, and the rows should be regarded as data rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants