-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use tree search instead of linear search on traversed tree for more efficient extraction of single files #34
Comments
Seems like this is something we can implement ourselves on top of the "traversal" features in squashfuse: https://github.com/vasi/squashfuse/blob/0b48352ed7a89d920bb6792ac59f9f6775088f02/traverse.c#L39. |
Why not do it upstream in squashfuse? This would save others from implementing it themselves, too. |
The issue is in our own workflow. Currently, we traverse the entire AppImage and look for matching entries, even if we're just interested in a single file. Also, if we just want to extract files from a subdirectory, we still traverse the entire tree, and then match all paths to a given pattern. What algorithm we use to traverse this tree (seems like a variant of dfs) doesn't make a difference really. We should stop traversing the entire tree at all if we just need to extract a single file for instance. What we should do is just to make use of the path we are provided with by the user and perform a search that e.g., splits the path into components and perform a search like "search for node matching current path component, descend into there, search for next component, descend, ...", skipping all the nodes that we don't need to care about to descend into the tree as far as we can (e.g., for We don't have full access to the tree and are bound a bit to what's provided by squashfuse. In an ideal world where we had access to the underlying tree, we could even increase performance (e.g., for directories with large amounts of files, using e.g., a binary search, as in most filesystems the entries are stored in order), but that's peanuts compared to what we could do with what squashfuse provides us already. It's quite a custom workflow, nothing to be implemented in squashfuse. TL;DR: We don't make use of the features provided by squashfuse, but use a very simple and highly inefficient way to extract files. If we'd make use of some context information we have anyway, we can greatly increase the efficiency of some of our algorithms. |
Thanks for explaining @TheAssassin. Can you point to some online resource about the efficiency of the various tree searching strategies, and recommend the one we should be implementing? |
I already explained the most efficient strategy I came up with. It's a modified dfs (depth-first search, "Tiefensuche"), which skips paths which are known not to lead to a result. It makes use of the uniqueness of names in a directory. I've been looking this up myself a bit, but I couldn't find any good references. |
Thanks for explaining this. Looks like we should implement it as per your description then. (I only found the non-modified one: https://en.wikipedia.org/wiki/Depth-first_search) |
Depth-first search is what most tools use to completely traverse a tree structure, as it's more optimal with regards to memory (paid for with performance compared to other algorithms like bfs, "No Free Luch"-like). I said "modified dfs" as we'd recurse similar to what the dfs does, but we try to skip all the paths we don't need in the tree. The following graph illustrates the proposal: The task here is: "find all icons (for desktop integration". We know the files will be in First of all, we split the path into components: We start at We only visit nodes with a red arrow instead of traversing the entire tree. There will be binaries in The performance benefit here is that we only have to visit a very small subset of all nodes instead of traversing the entire tree and then applying globs or regular expressions to the full paths. The amount of "full traversals" is limited to the child nodes of a single node on the 5th level. This should reduce the amount of nodes visited to |
A linear algorithm was used to provide a generic enough traverse method to ensure that every filesystem we pick to pack the AppImage payload is supported. An algorithm like the one described above cannot be possible if the filesystem doesn't support it (I'm thinking right now in ISO 9660 fs). |
Nobody's questioning the existence of a generic traversal algorithm. The idea is not to use that when looking for specific files, as that's quite inefficient. The suggested algorithm might not work for ISO 9660, but I'm talking about squashfs (via squashfuse) in this case. You may want to check the squashfuse API I linked to. |
Replace sqfs traversal by efficient search for finding the files that need to be integrated.
Ideally do this upstream in https://github.com/vasi/squashfuse.
The text was updated successfully, but these errors were encountered: