-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Improve speed of shutil.copytree #124117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Oh, you didn't create the issue in temple format, which might cause some inconvenience😟 |
What is the temple format? |
The template Feature or enhancement issue will automatically add type-feature labels, and there is a area for describing the issue and previous discussions An example: #124111 |
Thank you. |
I‘m read your article, and I think that the time consumption of Lines 520 to 528 in 453da53
|
AFAICT, no core developer has expressed support for your suggested feature but thank you for asking whether you can open an issue or not. I will not close the issue, but it would be good that your improvement exists in a fork so that we can compare branches (and possibly review them). If you feel that your improvement is ready, you can open a PR and we will look at what you suggest (it might be easier than to review it on Discourse). However, the benchmarks that were shown seem to be only for a flat directory. It would be very interesting if we had benchmarks (using By the way, I always thought that copying files is an I/O bound task, and hence multi-threading would not necessarily be useful. Is there something that I am actually missing? cc @barneygale who could be interested in this topic. |
Ok,I write a test : import os
import shutil
def copytree_iterative(src, dst, symlinks=False, ignore=None,
copy_function=shutil.copy2, ignore_dangling_symlinks=False,
dirs_exist_ok=False):
"""Copy a directory recursively using an iterative approach."""
if os.path.exists(dst) and not dirs_exist_ok:
raise FileExistsError(f"Destination path '{dst}' already exists")
if not os.path.exists(dst):
os.makedirs(dst)
names = os.listdir(src)
if ignore is not None:
ignored_names = ignore(src, names)
else:
ignored_names = set()
entries = [(src, dst, name) for name in names if name not in ignored_names]
while entries:
srcobj, dstname, srcentry_name = entries.pop()
srcentry_path = os.path.join(srcobj, srcentry_name)
dstentry_path = os.path.join(dstname, srcentry_name)
if os.path.isdir(srcentry_path):
if not os.path.exists(dstentry_path):
os.makedirs(dstentry_path)
entries.extend([(srcentry_path, dstentry_path, name)
for name in os.listdir(srcentry_path)
if name not in ignored_names])
else:
copy_function(srcentry_path, dstentry_path)
def create_source_directory(directory_path):
try:
os.makedirs(directory_path)
print(f"Directory '{directory_path}' created successfully.")
except FileExistsError:
print(f"Directory '{directory_path}' already exists.")
except Exception as e:
print(f"Error creating directory: {e}")
def write_files(directory_path, num_files):
for i in range(num_files):
file_path = os.path.join(directory_path, f"file_{i}.txt")
with open(file_path, "w") as file:
file.write(f"This is file {i}.")
import time
def benchmark_copytree(func, *args):
start_time = time.perf_counter()
func(*args)
end_time = time.perf_counter()
print(f"{func.__name__} : {end_time - start_time:.6f} seconds")
def rmove_directory(directory_path):
try:
shutil.rmtree(directory_path)
print(f"Directory '{directory_path}' removed successfully.")
except FileNotFoundError:
print(f"Directory '{directory_path}' does not exist.")
except Exception as e:
print(f"Error removing directory: {e}")
# Example usage
if __name__ == "__main__":
create_source_directory("source")
write_files("source", 50000)
benchmark_copytree(copytree_iterative, "source", "destination")
rmove_directory("source")
rmove_directory("destination") # remove folders for the next time test
create_source_directory("source")
write_files("source", 50000)
benchmark_copytree(shutil.copytree, "source", "destination")
rmove_directory("source")
rmove_directory("destination") Result here: I use newest python 3.12.6, Core i5 8th Gen. |
Oh,I don't want it |
I do not believe we should complicate shutil.copytree with any form of multithreading. Our simple implementation works. As others noted on the discussion thread, typically a copy (tree or not) is bound by a single "spindle" of IOPS (hard drive) or IO operation queue. Even in fancier storage where multiple parallel operations can happen at once, the most significant minor gains are due to a small amount of allowing the OS kernel to rearrange some of the IO being done. It just isn't worth it. Adding multithreading where there was none before for something that isn't CPU bound and is contending for the same input and output IO resources isn't something we should be doing in the standard library. Also, on the discussion thread, that supposed "async" solution really isn't async. scandir is not doing async IO nor is the file copy, nor are the stats that attempt to load sizes up front will be high overhead on many OSes, filesystems, and storage media. And putting shutil.copy on the async event loop doesn't magically make it do async file IO. I don't think you're measuring what you think you are measuring. Fancier versions attempting to eek out performance in specific situations where interleaved or parallel IO can help are better left to modules on PyPI. |
I wrote a discussion on disscuss.python.org, Link here. Do you think it is good? What I need to improve? You should mention the comments because they are more useful than the article, I think.
The text was updated successfully, but these errors were encountered: