You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the textractor pipeline is rather basic. It simply calls Apache Tika (if available) and returns the raw text. If Tika isn't available, the textractor converts HTML to raw text.
Support section parsing. This will add a new flag called sections. When enabled, it will split the text by section or page breaks. This will better organize content into related sections.
Currently, the textractor pipeline is rather basic. It simply calls Apache Tika (if available) and returns the raw text. If Tika isn't available, the textractor converts HTML to raw text.
Tika is a mature and stable project with a large number of file formats supported. It also supports extracting content to XHTML. The following improvements should be made to better support downstream retrieval augmented generation (RAG) use cases.
sections
. When enabled, it will split the text by section or page breaks. This will better organize content into related sections.The text was updated successfully, but these errors were encountered: