[Bug] pdf2parquet: identical PDF files have different `contents` #812

sujee · 2024-11-19T08:18:19Z

Search before asking

I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

What happened + What you expected to happen

I have run pdf2parquet on identical files. How ever the content extracted is slightly different.

earth.pdf

Here is the diff in content extracted...

 "name": "earth",
 "origin": {"binary_hash": 17915699055171962696,
            "filename": "earth.pdf",
            "mimetype": "application/pdf"},

earth-copy.pdf

 "name": "earth-copy",
 "origin": {"binary_hash": 17915699055171962696,
            "filename": "earth-copy.pdf",
            "mimetype": "application/pdf"},

The document_hash is calculated correctly (same).

I think we should strip out the 'meta' data from contents so they can be the same.

Because this may have implications for ededupe / fdedupe.

Related to #605

Reproduction script

https://github.com/sujee/data-prep-kit/blob/dpk-intro-example-v2/examples/notebooks/intro/dpk_intro_1_python.ipynb

See step 3.4

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

dolfim-ibm · 2024-11-19T08:50:48Z

This is the expected behavior. I don't think it makes sense run a text dedeup on a JSON document. In case, it should be a specialization which knows about the JSON format (and in this case it could simply compare the binary_hash).

sujee added the bug Something isn't working label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] pdf2parquet: identical PDF files have different `contents` #812

[Bug] pdf2parquet: identical PDF files have different `contents` #812

sujee commented Nov 19, 2024

dolfim-ibm commented Nov 19, 2024

[Bug] pdf2parquet: identical PDF files have different contents #812

[Bug] pdf2parquet: identical PDF files have different contents #812

Comments

sujee commented Nov 19, 2024

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

dolfim-ibm commented Nov 19, 2024

[Bug] pdf2parquet: identical PDF files have different `contents` #812

[Bug] pdf2parquet: identical PDF files have different `contents` #812