Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD5 at DataLake is not available for ACL Solution #1742

Open
singloudly90 opened this issue Jun 18, 2024 · 2 comments
Open

MD5 at DataLake is not available for ACL Solution #1742

singloudly90 opened this issue Jun 18, 2024 · 2 comments
Labels
auth open issue A validated issue that should be tackled. Comment if you'd like it assigned to you.

Comments

@singloudly90
Copy link

Please provide us with the following information:

Understand that A added checks to see what's been uploaded before. The prepdocs script now writes an .md5 file with an MD5 hash of each file that gets uploaded. Whenever the prepdocs script is re-run, that hash is checked against the current hash and the file is skipped if it hasn't changed.

However i realised when I tried on the ACL solution, MD5 didnt create as expected compared to the solution without ACL.
correct me if I am wrong:
Without ACL solution: Upload files from local folder, MD5 generated at local folder, files uploaded to blob storage and to AI Search Index.
With ACL solution: Upload files from local folder to datalake, datalake to AI Search.

These solution are difference in term of file processing...

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

With ACL solution: Upload files from local folder to datalake, MD5 generated in datalake, datalake to blobstorage and to AI Search.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

@pamelafox
Copy link
Collaborator

cc @mattgotteiner

So is your goal to be able to repeatedly re-run prepdocs to pick up new files in ADLS2, without having to re-index existing files? I think we'd probably want to implement #942 for both normal Blob storage and ADLS2, which would mean the MD5 would be stored in the blob itself, and we'd check against that.

@pamelafox pamelafox added open issue A validated issue that should be tackled. Comment if you'd like it assigned to you. auth labels Jun 21, 2024
@RCGEnableBigDataDeveloper

@pamelafox this could be a great feature, since in production, the docs are sitting somewhere on the lake that other system maybe able to drop files into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auth open issue A validated issue that should be tackled. Comment if you'd like it assigned to you.
Projects
None yet
Development

No branches or pull requests

3 participants