Skip to content

Version 1.7.3

Compare
Choose a tag to compare
@Belval Belval released this 26 Feb 12:39
· 141 commits to master since this release
c5120b0

What's Changed

  • Table linearization improvements by @Belval in #313

    • Add .get_text(), .to_html() and .to_markdown() functions to Linearizable which is now implemented by Document, Page, DocumentEntity and EntityList
    • Add HTMLLinearizationConfig and MarkdownLinearizationConfig as pre-configured TextLinearizationConfig
    • Add the follow parameters to TextLinearizationConfig
      • duplicate_text_in_merged_cells duplicates the text in merge cells to preserve row-level alignment
      • table_flatten_headers combines multi-row headers into a single row, duplicating the merged cells horizontally as needed
      • table_tabulate_remove_extra_hyphens removes extra hyphens '-' in markdown tables to reduce context length
      • max_number_of_consecutive_spaces defines the maximum number of contiguous whitespace characters, similar to max_number_of_consecutive_new_lines
  • Fixes:

    • Fix trailing whitespace in cell text
    • Fix table_column_separator being hardcoded as '\t'
    • Fix table_row_separator being hardcoded as '\n'
    • Resets BytesIO buffer to 0 position by @abest0 in #310

New Contributors

Full Changelog: v1.7.2...v1.7.3