Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add custom compression / decompression? #1784

Open
dotphoton-ziad opened this issue Jul 15, 2022 · 15 comments
Open

How to add custom compression / decompression? #1784

dotphoton-ziad opened this issue Jul 15, 2022 · 15 comments
Assignees

Comments

@dotphoton-ziad
Copy link

Hi all! I have all my data stored in an S3 bucket and I would like to use Hub to load my data from S3, yet my data on the cloud is compressed using Jetraw by Dotphoton and I would like it to decompress the images as I am pulling them from the cloud. I would be ready to write code to make this happen, but as I am new to the code base I would like to know where this would fit in the most and where I should start. Thank you all in advance!

@davidbuniat
Copy link
Member

Hey @ziadomalik would love to accept the contribution to support your compression method. Most of our compression code is written here https://github.com/activeloopai/Hub/blob/main/hub/core/compression.py.

Feel free to join our slack communtiy at https://slack.activeloop.ai #develop channel to discuss in more details how to complete the contribution.

Looking forward to it.

@FayazRahman
Copy link
Contributor

Hey @ziadomalik. The list of supported compressions can be found in hub/compression.py. Make sure to add your new format there. The decompression code can be found at hub/core/compression.py. Import the required libraries and write your decompression function (something like _decompress_jetraw). Also see the decompress_array function in hub/core/compression.py - that's where your function will be called. Let me know if you need more help.

@h20200051
Copy link

@ziadomalik I would like to work.Kindly assign me

@mikayelh
Copy link
Collaborator

hey @h20200051 , We can only assign one issue per contributor, which one would you like to take on?

@Hussain0520
Copy link

Hey @mikayelh @davidbuniat is this issue still open? I would like to try and contribute to this issue.

@dotphoton-ziad
Copy link
Author

Hi all, so I've been assigned to work on other projects, so in this case this issue is on hold (for now). Yet it's still something we are actively discussing, and if it's appropriate, we could close this issue and reopen it once it becomes relevant again. I still need to the green light from my Project Manager. I hope you understand and thank you for your patience.

@mikayelh
Copy link
Collaborator

mikayelh commented Sep 15, 2022

@ziadomalik if you'd like, we can assign this issue to @Hussain0520 to work on it in the meantime, but if you want to be the one who writes this particular code, I'm not against putting this on hold. Maybe the best solution could be allowing someone else to take a stab and then improving on their contribution later on?

@dotphoton-ziad
Copy link
Author

@mikayelh Sounds like a plan! As a starting point, you guys could check out our Python documentation here and learn about the technology itself here. We also have a C++ API. Whenever you have questions, code review requests or anything I could help with, feel free to ping me!
cc: @Hussain0520

@mikayelh
Copy link
Collaborator

That's awesome!

hey @Hussain0520! I've just assigned you this issue - feel free to check in with us and @ziadomalik in case you need any help! Thanks for following up, @ziadomalik :)

@dotphoton-ziad
Copy link
Author

Hi, so I spoke with my project manager. Normally, we wanted to postpone this all the way to January because internally, we are still experimenting with the cloud and how our compression fits best into that context. If you guys would like to discuss, we could hop on a call so we could figure out the best way we can integrate Jetraw into the Activeloop Hub.
cc: @mikayelh

@Hussain0520
Copy link

Thank you @mikayelh @ziadomalik . I'll surely contact you guys for help.

@St3V0Bay
Copy link

St3V0Bay commented Nov 22, 2022

Please allow me to sneak in here... I was looking for a way how to compress my 1 million nifti stacks. On one hand "nifti" is not yet supported by deeplake (but dicom is). On the other hand, I was looking for a way to use the general dtype and add a custom compression on top. I was even more surprised to see someone from dotphoton here (@ziadomalik) You guys are on my list for more than a year. The stars seem to align :-)

@istranic
Copy link
Contributor

Hi @St3V0Bay. Thx for following up on this thread! Adding custom compression is quite tricky, because even if it's implemented in Deep Lake OSS, it won't work in our visualizer or the optimized C++ dataloader, because they are not in the OSS repo.

We're also happy to add support for your nifti data directly. Are you working with dicom files that are combined into nifti stacks? If you're able to provide us with example data, we can implement support for it across our stacks.

Regarding dotphoton, are you using any of their compressions currently, or this is something you're excited about for future work?

@St3V0Bay
Copy link

Hi @istranic,
thanks for the swift response. I see - so custom compressions are a bit tricky to handle.

Re nifti: In the medical imaging domain most open-source data is offered as nifti (bioimaging has their own preference however). That's the best format for data scientists to get started. However, DICOM is the true standard that is actually used in the clinic (you have that already integrated, which is great). To pool DICOM data with open-sourced nifti files, the dicom files are converted (e.g. https://github.com/rordenlab/dcm2niix). The other way (from nifti > dicom) is a lot more complicated.

Exemplary nifti files can be pulled using this repo (https://github.com/neheller/kits19). After installation it is just a one liner. You can look at the data using ITKSnap (for example; http://www.itksnap.org/pmwiki/pmwiki.php) and it can be opened in Python with the PyLib called nibabel (https://nipy.org/nibabel/). Another huge nifti repository is here: http://medicaldecathlon.com/

Re dotphoton: we are not using it. But their value proposition is really charming, which is: less costs for storage, faster data transfer. In projects with a certain size this really starts to matter, because things add up quickly if you have literally millions of data points.

@istranic
Copy link
Contributor

Thanks for the info @St3V0Bay. We'll keep you in the loop regarding our decision making around nifti support.

@Hussain0520 Hussain0520 removed their assignment Feb 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants