Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine untracked instead of empty directories #10

Open
johntconklin opened this issue Mar 5, 2014 · 9 comments
Open

Determine untracked instead of empty directories #10

johntconklin opened this issue Mar 5, 2014 · 9 comments

Comments

@johntconklin
Copy link
Contributor

Hi Pascal,

I have a proof of concept for you to consider in my feature/untracked_dirs branch. This change computes the set of untracked directories in the workspace, and treats those as a single thing, rather than a set of untracked files. In quick testing this morning, it's about 10 to 15% faster in dry run mode. Perhaps a bit faster when actually removing files, but that's harder for me to test since it takes more time to prepare my workspace again.

This is similar to git's behavior, so it may even be worth adding a similar -d flag to control whether untracked directories are removed.

@okcompute
Copy link
Owner

I like your idea but it seems there is one problem. What if, inside the entire folder tree being deleted, there is a match for an exclusion pattern set by the user? Let say I have some dot file somewhere in the tree hierarchy. Will it be deleted or will it persist like it does right now?

@johntconklin
Copy link
Contributor Author

Yes, this would be a change in behavior. The entire directory hierarchy will be removed (unless the untracked directory is in the exclude list), including any files that in the exclude list.

On the other hand, the behavior match's git's when it's -d option is used, so the behavior isn't completely unreasonable.

Since I work in a shop that uses both git and perforce, and clean performance is hugely important for our use case, I'll very likely be integrating this into my master branch even if you pass for the official release.

@okcompute
Copy link
Owner

I would gladly integrate this feature in the official release but there is one use case I have on my side that I want to make sure it will work. I have a git depot inside the perforce depot. At root, there is a .git folder. In my config file I add this:
exclude = /.git;

It think it will work in the proposed implementation because all subfolders will be matched against the pattern but please make sure this use case works and I will merge you pull request.

Also, if you can make sure the tests are working after your changes it would be great. I want to use Travis but I just did not have time to do it :)

Last thing, I still not have a contributor list set up, I will do one shortly and will add your name. Thank you for your collaboration on this project! :)

@johntconklin
Copy link
Contributor Author

This (preserving a .git directory) is an important use case for our users too. I'll verify today.

@johntconklin
Copy link
Contributor Author

I've rebased (but not yet squashed) my untracked_dirs branch against the latest master. The changes include taking your optimization to compute the set of depot directories in the same loop as computing the set of depot files; reworking the way I update the directory set returned by os.walk() (to avoid descending into untracked directories); and moving the handling of excluded files and directories to get_untracked_files(). This is needed to avoid descending into a excluded directory (aka, the ".git" use case).

All and all I'm pretty happy with this (assuming I didn't make any errors in the rebase). It's unfortunate that os.walk() doesn't handle symlinks to directories "correctly" which requires some duplicated code in both the files and the directories loops. I'm not sure it's ugly enough to use a private variant of os.walk() that does the right thing though.

Please take a look. Once we're both happy with it, I'll do a final rebase squashing the commits together, and send you a pull request.

@okcompute
Copy link
Owner

Hi JT,

What are the performance increase you see with this update? Do you have before/after numbers?

Thanks!

Pascal

On Mar 9, 2014, at 11:29 PM, J.T. Conklin notifications@github.com wrote:

I've rebased (but not yet squashed) my untracked_dirs branch against the latest master. The changes include taking your optimization to compute the set of depot directories in the same loop as computing the set of depot files; reworking the way I update the directory set returned by os.walk() (to avoid descending into untracked directories); and moving the handling of excluded files and directories to get_untracked_files(). This is needed to avoid descending into a excluded directory (aka, the ".git" use case).

All and all I'm pretty happy with this (assuming I didn't make any errors in the rebase). It's unfortunate that os.walk() doesn't handle symlinks to directories "correctly" which requires some duplicated code in both the files and the directories loops. I'm not sure it's ugly enough to use a private variant of os.walk() that does the right thing though.

Please take a look. Once we're both happy with it, I'll do a final rebase squashing the commits together, and send you a pull request.


Reply to this email directly or view it on GitHub.

@johntconklin
Copy link
Contributor Author

I fixed a small bug in my branch (not yet pushed), and ran some performance benchmarks today.

The average of 5 "p4clean -n" runs on a typical dirty client workspace I had at work was about 2 minutes 30 seconds for revision 8bcb9 from your master branch, and about 55 seconds with my branch.

While it's much better, "git clean -xdn" with the same dirty workspace took about 10 seconds. Even factoring out C vs. python and the time to fetch the list depot files from p4 vs. git, I think there is still room for improvement. I may have been wrong earlier when I discounted using a custom variant of os.walk(). But that's an experiment for another day.

@johntconklin
Copy link
Contributor Author

It's been a while since I've spent some time working on p4clean, but I got inspired after returning from the Perforce Merge conference this past week.

At work, we've been running my fork of p4clean for the last ~6 months and it's been working very well. That being said, I've always thought the performance gap between p4clean and "git clean" could be narrowed.

I did some profiling, found that most of the time was spent in the lstat() call used to determine whether the files returned by os.listdir() is a directory (or some other file type). Unfortunately, python's os.listdir() doesn't include the file type, even though the underlying OS primitives (at least on Windows, Linux, and BSD) provide them.

I did a quick prototype (aka hack) and wrote an enhanced listdir() function (using python's ctype module to call opendir()/readdir()/closedir()) which returned a list of file name/type tuples. When I replaced p4clean's use of os.listdir() with this version, the time to execute my benchmark use case dropped from ~40 seconds to ~10 seconds. Now we're finally in "git clean" territory.

Wondering whether someone had done this before, I ran across Ben Hoyt's scandir module (https://github.com/benhoyt/scandir) that he's been shepherding to get into Python 3.5. It provides a more polished implementation (for both Windows and Linux) and I've updated my fork's feature/untracked_dirs branch to use it.

I hope to get some time to bring the unit tests into line with the code on the branch...

@okcompute
Copy link
Owner

Hi JT,

Indeed it’s been a while :)

Wow! I’m impressed on two parts.

First, you got a really huge source tree! :P

Secondly, you have a really massive optimisation on your hand! Don’t hesitate to push this anytime you want in the master depot (once you have unit tests updated)!.

On Sep 25, 2014, at 12:39 AM, J.T. Conklin notifications@github.com wrote:

It's been a while since I've spent some time working on p4clean, but I got inspired after returning from the Perforce Merge conference this past week.

At work, we've been running my fork of p4clean for the last ~6 months and it's been working very well. That being said, I've always thought the performance gap between p4clean and "git clean" could be narrowed.

I did some profiling, found that most of the time was spent in the lstat() call used to determine whether the files returned by os.listdir() is a directory (or some other file type). Unfortunately, python's os.listdir() doesn't include the file type, even though the underlying OS primitives (at least on Windows, Linux, and BSD) provide them.

I did a quick prototype (aka hack) and wrote an enhanced listdir() function (using python's ctype module to call opendir()/readdir()/closedir()) which returned a list of file name/type tuples. When I replaced p4clean's use of os.listdir() with this version, the time to execute my benchmark use case dropped from ~40 seconds to ~10 seconds. Now we're finally in "git clean" territory.

Wondering whether someone had done this before, I ran across Ben Hoyt's scandir module (https://github.com/benhoyt/scandir) that he's been shepherding to get into Python 3.5. It provides a more polished implementation (for both Windows and Linux) and I've updated my fork's feature/untracked_dirs branch to use it.

I hope to get some time to bring the unit tests into line with the code on the branch...


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants