Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster GitIgnore directory check #3007

Merged
merged 2 commits into from
Dec 17, 2024

Conversation

fellhorn
Copy link
Contributor

Why are the changes needed?

flytekit's GitIgnore performs a recursive file check to detect empty / ignored folders instead of checking the folder status directly. For folders with a lot of files (e.g. a python .venv), this can be unnecessarily slow.

An extreme example with 1M files in an ignored folder:

Old ignore folder: 5.39s
New ignore folder: 0.000158s
Code

The measurements were created using this script

import subprocess
import shutil
from pathlib import Path
import os
import logging

from flytekit.tools.ignore import (
    GitIgnore as GitIgnoreOld,
    Ignore,
)

class GitIgnoreNew(Ignore):
  # The implementation ftom this PR
  ...

import time

start = time.perf_counter()
newIgnore = GitIgnoreNew(Path.cwd())
print(f"New ignore setup: {time.perf_counter() - start}")

start = time.perf_counter()
oldIgnore = GitIgnoreOld(Path.cwd())
print(f"Old ignore setup: {time.perf_counter() - start}")

start = time.perf_counter()
assert newIgnore.is_ignored("large-file-collection")
print(f"New ignore folder: {time.perf_counter() - start}")

start = time.perf_counter()
assert newIgnore.is_ignored("large-file-collection/1.txt")
print(f"New ignore file: {time.perf_counter() - start}")

start = time.perf_counter()
assert oldIgnore.is_ignored("large-file-collection")
print(f"Old ignore folder: {time.perf_counter() - start}")


start = time.perf_counter()
assert oldIgnore.is_ignored("large-file-collection/1.txt")
print(f"Old ignore file: {time.perf_counter() - start}")

What changes were proposed in this pull request?

  1. Use git ls-files to also check against the list of ignored directories

How was this patch tested?

Existing unit tests seem to already cover the changed code, performance benchmarks done manually as explained above.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Copy link

codecov bot commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 64.70588% with 6 lines in your changes missing coverage. Please review.

Project coverage is 50.91%. Comparing base (f99d50e) to head (4fa0c40).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/tools/ignore.py 64.70% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3007      +/-   ##
==========================================
- Coverage   51.08%   50.91%   -0.18%     
==========================================
  Files         201      201              
  Lines       21231    21173      -58     
  Branches     2731     2728       -3     
==========================================
- Hits        10846    10780      -66     
- Misses       9787     9797      +10     
+ Partials      598      596       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Correct algorithmic improvements are always welcome!

@eapolinario eapolinario merged commit e5c2f41 into flyteorg:master Dec 17, 2024
103 of 104 checks passed
shuyingliang pushed a commit to shuyingliang/flytekit that referenced this pull request Dec 20, 2024
* Faster GitIgnore directory check

Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>

* Remove code duplication

Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>

---------

Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants