Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] pdf2parquet: identical PDF files have different contents #812

Open
1 of 2 tasks
sujee opened this issue Nov 19, 2024 · 1 comment
Open
1 of 2 tasks

[Bug] pdf2parquet: identical PDF files have different contents #812

sujee opened this issue Nov 19, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@sujee
Copy link
Contributor

sujee commented Nov 19, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

What happened + What you expected to happen

I have run pdf2parquet on identical files. How ever the content extracted is slightly different.

earth.pdf

Here is the diff in content extracted...

 "name": "earth",
 "origin": {"binary_hash": 17915699055171962696,
            "filename": "earth.pdf",
            "mimetype": "application/pdf"},

earth-copy.pdf

 "name": "earth-copy",
 "origin": {"binary_hash": 17915699055171962696,
            "filename": "earth-copy.pdf",
            "mimetype": "application/pdf"},

The document_hash is calculated correctly (same).

I think we should strip out the 'meta' data from contents so they can be the same.

Because this may have implications for ededupe / fdedupe.

Related to #605

Reproduction script

https://github.com/sujee/data-prep-kit/blob/dpk-intro-example-v2/examples/notebooks/intro/dpk_intro_1_python.ipynb

See step 3.4

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@sujee sujee added the bug Something isn't working label Nov 19, 2024
@dolfim-ibm
Copy link
Member

This is the expected behavior. I don't think it makes sense run a text dedeup on a JSON document. In case, it should be a specialization which knows about the JSON format (and in this case it could simply compare the binary_hash).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants