Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.replace_text does not work as intended. #223

Open
jymchng opened this issue Mar 24, 2023 · 5 comments
Open

.replace_text does not work as intended. #223

jymchng opened this issue Mar 24, 2023 · 5 comments

Comments

@jymchng
Copy link

jymchng commented Mar 24, 2023

L105    pdf.replace_text(2, "jdoe123@mycompany.net", "hello WORLD").unwrap();
L106    dbg!(pdf.extract_text(&[2]).unwrap());

Logs

[src\redact.rs:106] pdf.extract_text(&[2]).unwrap() = "For example, john.doe@example.com, jdoe123@mycompany.net, \nalice_123+test@gmail.co.uk, and jane\n-\ndoe@my\n-\nuniversity.edu all match this pattern, \nand are therefore considered valid email addresses.\n \n \n"

Apparently, directly replacing text in a page doesn't work?

@jymchng
Copy link
Author

jymchng commented Mar 27, 2023

@J-F-Liu Hi J-F-Liu, just thinking about this replace_text method that returns a Result<()> - it means there is a contract between the caller and callee such that if replace_text indeed does replace the text in the .pdf, it returns an Ok(()), else it returns an Err variant.

For this function, particularly on Line 138, it seems that the function does nothing when the encoding is not within the pre-defined 'able-to-parse' encodings ("Tf" or "Tj"), the match arm _ => {} evaluates to an empty scope. Would it be better to return an Err so that the caller knows it is not getting what the function promises to do because it is unable to parse any other type of encodings?

@jymchng
Copy link
Author

jymchng commented Mar 27, 2023

#217

@J-F-Liu
Copy link
Owner

J-F-Liu commented Mar 27, 2023

Yes, text processing is not implemented completely.

jymchng referenced this issue Apr 15, 2023
Co-authored-by: Lukáš Tyrychtr <ltyrycht@redhat.com>
@jymchng
Copy link
Author

jymchng commented Mar 31, 2024

@J-F-Liu Hi Liu, do you think this issue can be fixed?

@Heinenen
Copy link
Collaborator

Heinenen commented Aug 9, 2024

Theoretically this can be fixed, but sadly, extracting text from a PDF is hard.
The solution for #125 may lay a first foundation for solving this issue, as it will allow to (sometimes) extract the text from the PDF.
However, this is only half of the solution, we would still need to implement putting the replacement text back into the PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants