Determine untracked instead of empty directories #10

johntconklin · 2014-03-05T14:53:58Z

Hi Pascal,

I have a proof of concept for you to consider in my feature/untracked_dirs branch. This change computes the set of untracked directories in the workspace, and treats those as a single thing, rather than a set of untracked files. In quick testing this morning, it's about 10 to 15% faster in dry run mode. Perhaps a bit faster when actually removing files, but that's harder for me to test since it takes more time to prepare my workspace again.

This is similar to git's behavior, so it may even be worth adding a similar -d flag to control whether untracked directories are removed.

okcompute · 2014-03-05T21:46:47Z

I like your idea but it seems there is one problem. What if, inside the entire folder tree being deleted, there is a match for an exclusion pattern set by the user? Let say I have some dot file somewhere in the tree hierarchy. Will it be deleted or will it persist like it does right now?

johntconklin · 2014-03-06T04:49:50Z

Yes, this would be a change in behavior. The entire directory hierarchy will be removed (unless the untracked directory is in the exclude list), including any files that in the exclude list.

On the other hand, the behavior match's git's when it's -d option is used, so the behavior isn't completely unreasonable.

Since I work in a shop that uses both git and perforce, and clean performance is hugely important for our use case, I'll very likely be integrating this into my master branch even if you pass for the official release.

okcompute · 2014-03-06T11:19:24Z

I would gladly integrate this feature in the official release but there is one use case I have on my side that I want to make sure it will work. I have a git depot inside the perforce depot. At root, there is a .git folder. In my config file I add this:
exclude = /.git;

It think it will work in the proposed implementation because all subfolders will be matched against the pattern but please make sure this use case works and I will merge you pull request.

Also, if you can make sure the tests are working after your changes it would be great. I want to use Travis but I just did not have time to do it :)

Last thing, I still not have a contributor list set up, I will do one shortly and will add your name. Thank you for your collaboration on this project! :)

johntconklin · 2014-03-06T15:16:40Z

This (preserving a .git directory) is an important use case for our users too. I'll verify today.

johntconklin · 2014-03-10T03:29:01Z

I've rebased (but not yet squashed) my untracked_dirs branch against the latest master. The changes include taking your optimization to compute the set of depot directories in the same loop as computing the set of depot files; reworking the way I update the directory set returned by os.walk() (to avoid descending into untracked directories); and moving the handling of excluded files and directories to get_untracked_files(). This is needed to avoid descending into a excluded directory (aka, the ".git" use case).

All and all I'm pretty happy with this (assuming I didn't make any errors in the rebase). It's unfortunate that os.walk() doesn't handle symlinks to directories "correctly" which requires some duplicated code in both the files and the directories loops. I'm not sure it's ugly enough to use a private variant of os.walk() that does the right thing though.

Please take a look. Once we're both happy with it, I'll do a final rebase squashing the commits together, and send you a pull request.

okcompute · 2014-03-10T09:48:01Z

Hi JT,

What are the performance increase you see with this update? Do you have before/after numbers?

Thanks!

Pascal

On Mar 9, 2014, at 11:29 PM, J.T. Conklin notifications@github.com wrote:

I've rebased (but not yet squashed) my untracked_dirs branch against the latest master. The changes include taking your optimization to compute the set of depot directories in the same loop as computing the set of depot files; reworking the way I update the directory set returned by os.walk() (to avoid descending into untracked directories); and moving the handling of excluded files and directories to get_untracked_files(). This is needed to avoid descending into a excluded directory (aka, the ".git" use case).

All and all I'm pretty happy with this (assuming I didn't make any errors in the rebase). It's unfortunate that os.walk() doesn't handle symlinks to directories "correctly" which requires some duplicated code in both the files and the directories loops. I'm not sure it's ugly enough to use a private variant of os.walk() that does the right thing though.

Please take a look. Once we're both happy with it, I'll do a final rebase squashing the commits together, and send you a pull request.

—
Reply to this email directly or view it on GitHub.

johntconklin · 2014-03-11T02:50:42Z

I fixed a small bug in my branch (not yet pushed), and ran some performance benchmarks today.

The average of 5 "p4clean -n" runs on a typical dirty client workspace I had at work was about 2 minutes 30 seconds for revision 8bcb9 from your master branch, and about 55 seconds with my branch.

While it's much better, "git clean -xdn" with the same dirty workspace took about 10 seconds. Even factoring out C vs. python and the time to fetch the list depot files from p4 vs. git, I think there is still room for improvement. I may have been wrong earlier when I discounted using a custom variant of os.walk(). But that's an experiment for another day.

johntconklin · 2014-09-25T04:39:53Z

It's been a while since I've spent some time working on p4clean, but I got inspired after returning from the Perforce Merge conference this past week.

At work, we've been running my fork of p4clean for the last ~6 months and it's been working very well. That being said, I've always thought the performance gap between p4clean and "git clean" could be narrowed.

I did some profiling, found that most of the time was spent in the lstat() call used to determine whether the files returned by os.listdir() is a directory (or some other file type). Unfortunately, python's os.listdir() doesn't include the file type, even though the underlying OS primitives (at least on Windows, Linux, and BSD) provide them.

I did a quick prototype (aka hack) and wrote an enhanced listdir() function (using python's ctype module to call opendir()/readdir()/closedir()) which returned a list of file name/type tuples. When I replaced p4clean's use of os.listdir() with this version, the time to execute my benchmark use case dropped from ~40 seconds to ~10 seconds. Now we're finally in "git clean" territory.

Wondering whether someone had done this before, I ran across Ben Hoyt's scandir module (https://github.com/benhoyt/scandir) that he's been shepherding to get into Python 3.5. It provides a more polished implementation (for both Windows and Linux) and I've updated my fork's feature/untracked_dirs branch to use it.

I hope to get some time to bring the unit tests into line with the code on the branch...

okcompute · 2014-09-25T11:24:32Z

Hi JT,

Indeed it’s been a while :)

Wow! I’m impressed on two parts.

First, you got a really huge source tree! :P

Secondly, you have a really massive optimisation on your hand! Don’t hesitate to push this anytime you want in the master depot (once you have unit tests updated)!.

On Sep 25, 2014, at 12:39 AM, J.T. Conklin notifications@github.com wrote:

It's been a while since I've spent some time working on p4clean, but I got inspired after returning from the Perforce Merge conference this past week.

At work, we've been running my fork of p4clean for the last ~6 months and it's been working very well. That being said, I've always thought the performance gap between p4clean and "git clean" could be narrowed.

I did some profiling, found that most of the time was spent in the lstat() call used to determine whether the files returned by os.listdir() is a directory (or some other file type). Unfortunately, python's os.listdir() doesn't include the file type, even though the underlying OS primitives (at least on Windows, Linux, and BSD) provide them.

I did a quick prototype (aka hack) and wrote an enhanced listdir() function (using python's ctype module to call opendir()/readdir()/closedir()) which returned a list of file name/type tuples. When I replaced p4clean's use of os.listdir() with this version, the time to execute my benchmark use case dropped from ~40 seconds to ~10 seconds. Now we're finally in "git clean" territory.

Wondering whether someone had done this before, I ran across Ben Hoyt's scandir module (https://github.com/benhoyt/scandir) that he's been shepherding to get into Python 3.5. It provides a more polished implementation (for both Windows and Linux) and I've updated my fork's feature/untracked_dirs branch to use it.

I hope to get some time to bring the unit tests into line with the code on the branch...

—
Reply to this email directly or view it on GitHub.

okcompute mentioned this issue Mar 17, 2014

Windows 7 symlinks are not detected #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine untracked instead of empty directories #10

Determine untracked instead of empty directories #10

johntconklin commented Mar 5, 2014

okcompute commented Mar 5, 2014

johntconklin commented Mar 6, 2014

okcompute commented Mar 6, 2014

johntconklin commented Mar 6, 2014

johntconklin commented Mar 10, 2014

okcompute commented Mar 10, 2014

johntconklin commented Mar 11, 2014

johntconklin commented Sep 25, 2014

okcompute commented Sep 25, 2014

Determine untracked instead of empty directories #10

Determine untracked instead of empty directories #10

Comments

johntconklin commented Mar 5, 2014

okcompute commented Mar 5, 2014

johntconklin commented Mar 6, 2014

okcompute commented Mar 6, 2014

johntconklin commented Mar 6, 2014

johntconklin commented Mar 10, 2014

okcompute commented Mar 10, 2014

johntconklin commented Mar 11, 2014

johntconklin commented Sep 25, 2014

okcompute commented Sep 25, 2014