You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the expected behavior. I don't think it makes sense run a text dedeup on a JSON document. In case, it should be a specialization which knows about the JSON format (and in this case it could simply compare the binary_hash).
Search before asking
Component
Tools/ingest2parquet
What happened + What you expected to happen
I have run pdf2parquet on identical files. How ever the content extracted is slightly different.
earth.pdf
Here is the diff in content extracted...
earth-copy.pdf
The
document_hash
is calculated correctly (same).I think we should strip out the 'meta' data from
contents
so they can be the same.Because this may have implications for ededupe / fdedupe.
Related to #605
Reproduction script
https://github.com/sujee/data-prep-kit/blob/dpk-intro-example-v2/examples/notebooks/intro/dpk_intro_1_python.ipynb
See step 3.4
Anything else
No response
OS
Ubuntu
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: