-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] pdf2parquet must calculate hash and size on the file #605
Comments
At the moment the Internally, the JSON has a property It could indeed make sense to expose that one as well. Where? Should it be the |
@dolfim-ibm with the new Docling integration, will this be addressed as well? |
Reading again above, there were some open questions about which field to expose and with which names. The fact of exposing both is for sure a good idea, since they serve different purposes. |
Should be fixed in #756. |
@sujee Can you test and see if this can be closed? |
pdf2pq now blocked on #767 |
Search before asking
Component
Tools/ingest2parquet
What happened + What you expected to happen
I had duplicate documents (see attached).
I was expecting the exact same duplicate files to have same size and hash.
But seems like the hash is being calculated on 'contents' which is actual content + meta data (like file name)
I think the hash and size should be calculated on the actual file/document not on meta data.
Expected Behaviour
hash
should be identical to identical filessize
should be physical file size in bytesfile_hash
andfile_size
Reproduction script
earth.pdf
Create a copy of the above file
execute the pdf2parquet section here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb
Anything else
No response
OS
Ubuntu
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: