Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tree search instead of linear search on traversed tree for more efficient extraction of single files #34

Open
probonopd opened this issue Dec 8, 2018 · 9 comments
Assignees
Labels
performance Enhancements that increase performance

Comments

@probonopd
Copy link
Member

Replace sqfs traversal by efficient search for finding the files that need to be integrated.

Ideally do this upstream in https://github.com/vasi/squashfuse.

@probonopd probonopd added the performance Enhancements that increase performance label Dec 8, 2018
@TheAssassin TheAssassin changed the title Replace sqfs traversal by efficient search for finding the files that need to be integrated Use tree search instead of linear search on traversed tree for more efficient extraction of single files Dec 9, 2018
@TheAssassin
Copy link
Member

Seems like this is something we can implement ourselves on top of the "traversal" features in squashfuse: https://github.com/vasi/squashfuse/blob/0b48352ed7a89d920bb6792ac59f9f6775088f02/traverse.c#L39.

@probonopd
Copy link
Member Author

Why not do it upstream in squashfuse? This would save others from implementing it themselves, too.

@TheAssassin
Copy link
Member

The issue is in our own workflow. Currently, we traverse the entire AppImage and look for matching entries, even if we're just interested in a single file. Also, if we just want to extract files from a subdirectory, we still traverse the entire tree, and then match all paths to a given pattern. What algorithm we use to traverse this tree (seems like a variant of dfs) doesn't make a difference really. We should stop traversing the entire tree at all if we just need to extract a single file for instance.

What we should do is just to make use of the path we are provided with by the user and perform a search that e.g., splits the path into components and perform a search like "search for node matching current path component, descend into there, search for next component, descend, ...", skipping all the nodes that we don't need to care about to descend into the tree as far as we can (e.g., for usr/share/icons/*, we can try to first descend into icons, and then perform any sort of pattern matching).
To further improve the performance in case of wildcards, we should add a parameter specifying the maximum list of results. That allows the search to abort immediately if e.g., 1 of 1 entries has been found, preventing unnecessary operations.

We don't have full access to the tree and are bound a bit to what's provided by squashfuse. In an ideal world where we had access to the underlying tree, we could even increase performance (e.g., for directories with large amounts of files, using e.g., a binary search, as in most filesystems the entries are stored in order), but that's peanuts compared to what we could do with what squashfuse provides us already.

It's quite a custom workflow, nothing to be implemented in squashfuse.

TL;DR: We don't make use of the features provided by squashfuse, but use a very simple and highly inefficient way to extract files. If we'd make use of some context information we have anyway, we can greatly increase the efficiency of some of our algorithms.

@probonopd
Copy link
Member Author

Thanks for explaining @TheAssassin. Can you point to some online resource about the efficiency of the various tree searching strategies, and recommend the one we should be implementing?

@TheAssassin
Copy link
Member

I already explained the most efficient strategy I came up with. It's a modified dfs (depth-first search, "Tiefensuche"), which skips paths which are known not to lead to a result. It makes use of the uniqueness of names in a directory. I've been looking this up myself a bit, but I couldn't find any good references.

@probonopd
Copy link
Member Author

Thanks for explaining this. Looks like we should implement it as per your description then. (I only found the non-modified one: https://en.wikipedia.org/wiki/Depth-first_search)

@TheAssassin
Copy link
Member

Depth-first search is what most tools use to completely traverse a tree structure, as it's more optimal with regards to memory (paid for with performance compared to other algorithms like bfs, "No Free Luch"-like). I said "modified dfs" as we'd recurse similar to what the dfs does, but we try to skip all the paths we don't need in the tree.

The following graph illustrates the proposal:

2018-12-09-note-14-41 pdf

The task here is: "find all icons (for desktop integration". We know the files will be in <root>/usr/share/icons/hicolor/**/*.*. The only "complete search" to perform here starts after having found hicolor, therefore the algorithm can first attempt to find that path, and then perform a normal dfs. This reduces the amount of nodes to check significantly; only the nodes with red arrows are "visited" by the algorithm (assuming nodes in directories are sorted in this specific example).

First of all, we split the path into components: {<root>, usr, share, icons, hicolor, **, *.png}.

We start at / and list its child nodes. We skip all nodes that don't match usr, i.e., app.desktop and AppRun. We find usr, and it's a directory, so we descend (most easy way will be to call the current algorithm recursively). Same steps again, we skip nodes until we find share, descend, then icons, and hicolor. Now, the next path component is a glob-style wildcard that means we shall perform a complete tree traversal and find all "leaves" (all files and directories without entries). We now apply the next component *.png to filter the list for matching entries (the filter doesn't differentiate between directories and files, but that can be handled on a higher level, i.e., in the calling method). Now, we build can return those nodes, and let the caller do whatever they want to with them.

We only visit nodes with a red arrow instead of traversing the entire tree. There will be binaries in usr/bin/, libraries etc. in usr/lib/, and maybe even other resources in usr/share, but we never get to visit them. We only focus on what is interesting for us.

The performance benefit here is that we only have to visit a very small subset of all nodes instead of traversing the entire tree and then applying globs or regular expressions to the full paths. The amount of "full traversals" is limited to the child nodes of a single node on the 5th level. This should reduce the amount of nodes visited to O(log n) in best case, O(n log n) in average case and O(n) in worst case, compared to O(n) for best, average and worst case right now (n being the amount of entries in the file system).

@azubieta
Copy link
Contributor

azubieta commented Dec 9, 2018

A linear algorithm was used to provide a generic enough traverse method to ensure that every filesystem we pick to pack the AppImage payload is supported. An algorithm like the one described above cannot be possible if the filesystem doesn't support it (I'm thinking right now in ISO 9660 fs).

@TheAssassin
Copy link
Member

Nobody's questioning the existence of a generic traversal algorithm. The idea is not to use that when looking for specific files, as that's quite inefficient.

The suggested algorithm might not work for ISO 9660, but I'm talking about squashfs (via squashfuse) in this case. You may want to check the squashfuse API I linked to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Enhancements that increase performance
Projects
None yet
Development

No branches or pull requests

3 participants