-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADD strikethrough to Markdown and HTML export #3810
Comments
@arisjr This seems like a reasonable feature request to me! |
This is very complex to implement. I am afraid, we won't be able to come up with a solution in the foreseeable future. |
Thinking in a way to do it, perhaps, with a flag detect_strikethrough, that when is enabled:
This linear algebra may increase a lot of processing on the parsing (I don't know yet), but if chosen by the programmer, must be a motive. Also, it should only work on horizontal strikethrough, for objectiveness. And there is a lot of algorithmic improvements that can be done, like, if char is not on the line range, don't even test for it. I don't know a lot of PDF parsing, I haven't studied linear algebra in a long time, but maybe it's feasible. What do you think, @JorjMcKie? I also don't know if there were another tries on this matter by the project, but, after I read your message, I saw that there is a recurrent need on the internet for this solution/feature. |
@arisjr yes, thanks for your thoughts. The problem is how to differentiate vector graphics intended as strikethrough from others - as I think I mentioned. |
It needs start with something and can evolve from there. (MS Word and libreoffice, are very good starting points indeed!) Maybe if this horizontal rectangle
Then it is a strikethrough. The rest of horizontal rectangles should be something else and we should not bother at this moment, like highlights, text boxes, or even a try to redact the text. This comparison with the font height I don't know if we have data for it on PDF structure or if its available somewhere, just brainstorming here. Sorry. |
@JorjMcKie take a look in the code snippet that I've done. I have used a code you made some time a go to show how to get rectangles and lines in a document (on stackoverflow). You can correct it (if I did some mistake) or add more logic to it, like, check the color of the rectangle (if it's the same of the font), this part I didn't know how to do it. For now it checks and filters for strikethrough texts on a page. But I also tested with a pdf printed on Mozilla Firefox/linux, and it didn't found no line nor rectangles on strikethrough... What the strikethrough could be? Office of Information Policy _ The Freedom of Information Act, 5 U.S.C. § 552.pdf Regards |
I am currently testing an algorithm that successfully matches horizontal "lines" with overlapping words. Just to prevent unfounded hopes: Another comment regarding HTML output: There is no way to achieve strike-out output! To confirm, please discuss this in the MuPDF Discord channel. |
Is your feature request related to a problem? Please describe.
YES. I'm doing a RAG on an group of brazilian laws and I think that the problem applies to all RAG/LLM community.
(I'm new to RAG)
Law and general legislation publications and documents that need to keep track of changes (history) normally don't simply erase text, they strikethrough the text, like the examples below:
https://www.planalto.gov.br/ccivil_03/_ato2004-2006/2006/decreto/d5948.htm
https://www.justice.gov/oip/freedom-information-act-5-usc-552
These were HTML examples, but PDFs of this documents follows the same procedure.
This is a markdown exemple that should not be counted.When the document parsers and loaders like pymupdf4llm (when generating markdowns) and langchain's PyMuPDFLoader extract the text, they extract all the text like it was the same, but, for RAG applications, I think that including strikethrough text on data may lead to false assumptions by the AI, leading to wrong results for the analyst.
Describe the solution you'd like
I would like to add strikethrough text type to Markdown and HTML export, for the document loader be able to ignore strikethrough text, if it was chosen by the programmer.
Describe alternatives you've considered
1st add strikethrough on the export of texts (markdown and HTML) of pyMuPDF python libraries and 2nd and also important - add the ability of pyMuPdf loaders to ignore strikethrough text, if the programmer choose to do so.
The second part (the loaders) I think it's with other projects, like langchain.
Additional context
None
The text was updated successfully, but these errors were encountered: