-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support more formats #109
Comments
Will it support git? |
@BingoKingo What part of git should ratarmount support? git-archive creates .tar.gz or zip archives, so that is already supported. Git packfiles? Not sure what the use case for that would be with ratarmount except that it would be nice to simply inspect all kinds of different archives manually for whatever reason. Or do you want to mount a remote git repository without fully cloning it? I'm not sure the protocol allows for that. Using ratarmount to show a checked out view without actually checking it out? |
I means this like gitfs. |
Hm, I think this is out of scope for ratarmount, especially as an existing solution already exists, even if it hasn't seen many commits in the last 3-4 years. |
Hello @mxmlnkn, First, I would like to thank you for this amazing project. I'm actually very interested by it, and I would love to do a PR to add at least the 7z format. However, before doing that, I have four questions:
Regarding libarchive, you said that there is currently no proper binding in Python allowing you to use it. I think you were refering to that project while saying that: https://github.com/smartfile/python-libarchive (which uses SWIG internally, and I had experienced issues with SWIG in past projects, it felt unreliable back then) I have found this project that had a big update last july: https://github.com/Changaco/python-libarchive-c. It only use ctypes to support the c binding, like fusepy.
Last but not least, I'm a little confused by a sentence in the main description of
Have a great evening, |
PRs would be very welcome! The old libarchive branch contains my earlier try with failing tests because of issues with the libarchive Python bindings. The tests could be reused / cherry-picked for your PR.
It's probably fine to use it. In the worst case, the dependency could be made optional at first. I think, I didn't consider it much because it only seems to solve one format issue: 7z, while libarchive would solve many formats. But one more format is better than none. Another issue might be the LGPL license. Then again, there exist other dependent projects mentioned in its own ReadMe, which are MIT: https://github.com/miurahr/aqtinstall . I think LGPL is definitely better than GPL-3 and the manner in which Python packages are distributed and installed might even make them compatible (it allows "relinking" / switching out the LGPL dependency).
The issue wasn't with the binding itself as far as I recall. The issue was with the provided file object that didn't reliably allow to do arbitrary seeks, especially seeking backwards would fail in some scenarios, meaning random access didn't work at all. It might be possible to fix this upstream with a PR or roll out a custom file object abstraction if the lower-level bindings are sufficiently exposed.
I was checking libarchive bindings again recently and I saw that this project got active again. I never looked deeper into it though. It could be worth a try.
I'm pretty sure that random access to formats such as gzip and bzip2 is not supported, I would have to check the source code to be completely sure, or do some benchmarks. The benchmark in the bottom left "Time Required to Get Contents of One File" is the most indicative. It shows that alternatives can take minutes for compressed archives to seek to a requested file because they presumably have to start decompression from the very beginning of the archive. Providing access to these heavily stream-based formats is difficult, but there are solutions like Other formats, such as 7z, zip, rar will definitely allow seeking to file members, but seeking inside files, when compressed, might also not be possible performantly, i.e., seeking backwards would require reopening that file member and reading from the beginning up to the desired position. For small files, it would not be an issue, only for larger ones. Again, I'm speculating and would have to check the source code. However, random access to bz2 and gzip would always require a kind of index, and such an index is not written by libarchive. So at least subsequent archive mounts would require analyzing the archives again, while ratarmount can simply load the exported indexes.
In principle yes. In practice, there are requirements on zstd and xz files, so random access will not work performantly with all of these files. The general rule would be, if it was compressed in parallel, then random access is highly likely possible. Lastly, I would mention fsspec again. It seems to provide a filesystem abstraction interface akin to my There also are some performance results for 7z access in this issue. py7zr seems to be 7x slower than 7z.exe and libarchive-c seems to be 20x slower. It looks bad for Python bindings. And, if Windows-support is intended, then python-libarchive would also be disqualified. |
First, thank you for your lengthy answer ! :) Benchmark of libarchive-c (I got good result 🤷)First I would like to start with the benchmark issue. I decided to give it a go and to benchmark on some data that I have (I cannot share the archives sadly, but if needed, we could create a benchmark archive with open sourced data). Here are my results, on 2 files, on Linux with 32 GB of RAM, 8 core (i7-7820HQ):
As you can see, on this setup, Regarding libarchiveI noticed that libarchive was able to open some zip archive that zipfile was not able to open (sadly I cannot share it, but maybe I could create a similar one). So I decided to investigate on libarchive more and disregard for now Libarchive doesn't support random seek within a file, and the underlying C-struct (different for each format) inside libarchive are not accessible. They did however implemented:
Regarding fsspecI didn't know that fsspec had an implem of libarchive! This is great! I think there is a potential Proof of Concept that could be done by using a variant of what did fsspec for libarchive. By either maintaining a special version for ratarmount, or by submitting a PR to fsspec. Here is what I thought:
The issue here is that putting a file into a MemoryFile implies that all files that will be accessed through it will be able to fit into memory. On the data that I have used on different projects so far, sometimes you can make that assumption, sometimes you cannot. You can have a 4GB file that needed to be accessed (and it could even be a nested archive). Some of their other implementation such as the FTP one handle the files as a custom object that inherit of their caching attribute (see here). They are different kind of caching in fsspec:
For the kinds 2 and 3, you can have different caching strategy (mmap, read ahead, etc.). My conclusionI think for ratarmount to be able to support libarchive and fsspec, we should use a caching mechanism with an expiration date (and make it conditionnal if the files that we would like to cache are small enough). It would allow random access for files inside an archive, and ratarmount could continue the support of recursive mount of archives without trouble. What do you think? PS: I'm not yet sure that I will be able to do this PR time-wise. I think I could maybe do a PoC. But I'm not sure yet, it would largely depends on your vision of the project and what you think about adding libarchive through an implem of fsspec using a (potentially optional) caching mechanism. |
I have a variety of larger files for benchmarking purposes, which I test on only locally because it would be too expensive for the CI:
All of these have different desirable properties and compressibilities.
This looks much more promising than the linked issue. Thank you for doing some benchmarks yourself. Weird, though, how much it differs, but there are so many factors that could play a role as you already stated.
ZIP supports many things, which are not supported on all decoders. Some of these are:
It's probably infeasible to support 100% of the ZIP format specification.
I think this would be a standalone project, although the backend could be reused from a ratarmount implementation. Personally, I want to get around to implement some kind of Regarding fsspecYes, it would be better to upstream support for passwords.
This does sound less performant than ratarmount because the runtime adds up if there are millions of files, even with skipping over files.
The workaround of using MemoryFile (BytesIO would also work) is not optimal because of possible out-of-memory errors. I'm not entirely sure how the caching is connected to libarchive support.
Ideally, I would want to keep the direct and indirect dependendencies to a miniumum, so a solution directly with libarchive sounds more amenable. Fun fact: python-libarchive and libarchive-c can be installed at the same time without a conflict warning even though they have the same module name, which did cause errors such as |
I have given the libarchive backend another try in the equally named branch. It only uses simple libarchive function calls without any possibly bugged higher-level Python layers as was the case before. This currently has the mentioned memory usage problem though as opened files are wholly extracted into memory. This should be fixed before merging. There are also some conceptual problems with the index. The index would be incompatible with the TAR backend. But it might be possible under current circumstances to create an index with the libarchive backend and then try to use the index from another ratarmount installation that uses the custom TAR backend. I think I had some heuristic metadata verification but I'll have to check that. Furthermore, it would be nice if the backend priority command line option would also affect libarchive vs. custom TAR backend / ZIP and others. Currently, it only affects the stream compression backends (bz2, gz, ...) for the custom TAR backend. This option could then be used to easily test the problem with index backend compatibility. It would also make handling of chimera files (think a ZIP file appended to a TAR file) more reliable as it could force the ZIP backend being tried first. It would be another step to somehow tell this priority to libarchive, though. |
Awesome! Could you create a draft PR maybe? It would help to see what changes were done, and we could have a place to discuss about it. Regarding the memory usage, I think we should let the user define a temp directory where we could keep extracted files, I could help with that if you want. I'm not sure to grasp the entire issue regarding the index problem. As I mentioned before, it seems that you cannot have the real offsets information of files using libarchive, you can only use the entryCount (as you did in your branch). So, I think that this index should only be used by the libarchive backend. I think it's fine if it's not cross compatible with the tar backend. But again, maybe I have not understood the issue here |
It is fine, but it should be detected. Currently, older versions would simply load an index created by the libarchive backend and when trying to access a file would return Input/Output errors. I have now added lots of checks to prevent this in the future but it does not fix compatibility with older ratarmount versions. For now, I have decided to not write the index out because it doesn't help performance that much anyway. Each file seek, currently, has to reparse / go over all archive entries anyway and therefore would be almost as slow as index creation anyway. This cannot be solved. That's why the custom tar-, bzip2-, gzip-, ... parsing in ratarmount exist in the first place.
I think, the better approach would be to implement seeking in I have opened a PR. There are some open todos: Blockers:
Nice to have:
If you want to give one of these a try, then let me know, else I'll hopefully slowly but steadily fix these open issues. |
Merged |
@mxmlnkn Thank you so much for your work! I thought I would have some time to actually help you on this... Do you think there is a need to have a cache system for libarchive? If so, I could a PR later on. |
To cache small files? I'm not sure that is necessary, but some benchmarks to (dis)prove it would be necessary. The LibarchiveFile implementation is buffered. It reads in 1 MiB chunks and if possible it avoids full reparsing via libarchive and only seeks inside the buffer. This also means that small files <= 1 MiB are fully cached. Currently, I'm working on a PySquashfsImage backend. The simple implementation in the squashfs branch already works, but it has some major performance issues that I'd like to give a try to fix before merging it. Maybe some of that could even be merged upstream into PySquashfsImage. The two other projects I arrived at in the linked issue from 3 days ago would be:
I have not started with any of these two. You could also still review the already merged PR/commits for libarchive or simply test them. Maybe there are still (performance) bugs there. |
After a myriad of smaller and larger papercuts and over a dozen of opened issues in other projects, I finally have some fsspec backends up and running. Unfortunately, they differ quite a lot from each other implementation-wise, and I experienced different issues with each of them. This is why I would not call it a "generic" fsspec support. Other fsspeck backends may or may not work, but I have tested these:
I have also added FAT and SquashFS support.
I would be very happy if anyone wants to test these before I do the next release. If other backends, such as Google Cloud, Azure, Dropbox, ... are needed, I also happily accept fixes if they do not work out of the box. I cannot test them right now because I do not have any cloud account and I am running out of steam with all these small issues I've had.
Unfortunately, all these dependencies mean that the AppImage has more than tripled in size :/ I'm not sure how important that is to people, but I liked the earlier 11 MB size, and the 37 MB now are barely passing. Anything more than 50 MB seems excessive to me. I cannot even upload the AppImage in this post because the limit is 25 MB. I'll upload it in the Telegram channel as linked in the ReadMe badges. It can also be installed via:
I tried to make the installation as out-of-the-box as it gets by testing with all versions and using Python version specifiers if some dependencies do not yet support e.g. Python 3.12. So, if there are any issues with this installation, please mention them. The recently merged ZIP decryption 100x performance fix is also included. I feel like this issue can soon be closed. |
Some archive formats I would particularly like to have access to:
Probably unneeded or out of scope:
Done:
Single file compressions:
Other frameworks have very similar goals to ratarmount and might even be further along:
I need to benchmark those but I hope that at least for very big data, ratarmount should still have an edge. If not, that's a bug.
The text was updated successfully, but these errors were encountered: