Skip to content

Improve speed of shutil.copytree #124117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mengqinyuan opened this issue Sep 16, 2024 · 9 comments
Closed

Improve speed of shutil.copytree #124117

mengqinyuan opened this issue Sep 16, 2024 · 9 comments
Labels
performance Performance or resource usage type-feature A feature request or enhancement

Comments

@mengqinyuan
Copy link

mengqinyuan commented Sep 16, 2024

I wrote a discussion on disscuss.python.org, Link here. Do you think it is good? What I need to improve? You should mention the comments because they are more useful than the article, I think.

@Wulian233
Copy link
Contributor

Oh, you didn't create the issue in temple format, which might cause some inconvenience😟

@mengqinyuan
Copy link
Author

Oh, you didn't create the issue in temple format, which might cause some inconvenience😟

What is the temple format?

@Wulian233
Copy link
Contributor

https://github.com/python/cpython/issues/new?assignees=&labels=type-feature&projects=&template=feature.yml

The template Feature or enhancement issue will automatically add type-feature labels, and there is a area for describing the issue and previous discussions

An example: #124111

@Eclips4 Eclips4 added type-feature A feature request or enhancement performance Performance or resource usage labels Sep 16, 2024
@mengqinyuan
Copy link
Author

Thank you.

@rruuaanng
Copy link
Contributor

rruuaanng commented Sep 16, 2024

I wrote a discussion on disscuss.python.org, Link here. Do you think it is good? What I need to improve? You should mention the comments because they are more useful than the article, I think.

I‘m read your article, and I think that the time consumption of _copytree comes from the recursive backtracking. If you could eliminate that, the speed would increase exponentially. Of course, this idea is based on not using any external syntax.

cpython/Lib/shutil.py

Lines 520 to 528 in 453da53

if srcentry.is_dir():
copytree(srcobj, dstname, symlinks, ignore,
copy_function, ignore_dangling_symlinks,
dirs_exist_ok)
else:
copy_function(srcobj, dstname)
elif srcentry.is_dir():
copytree(srcobj, dstname, symlinks, ignore, copy_function,
ignore_dangling_symlinks, dirs_exist_ok)

@picnixz
Copy link
Member

picnixz commented Sep 16, 2024

AFAICT, no core developer has expressed support for your suggested feature but thank you for asking whether you can open an issue or not. I will not close the issue, but it would be good that your improvement exists in a fork so that we can compare branches (and possibly review them). If you feel that your improvement is ready, you can open a PR and we will look at what you suggest (it might be easier than to review it on Discourse).

However, the benchmarks that were shown seem to be only for a flat directory. It would be very interesting if we had benchmarks (using pyperf or hyperfine) that considered nested directory structures.

By the way, I always thought that copying files is an I/O bound task, and hence multi-threading would not necessarily be useful. Is there something that I am actually missing?

cc @barneygale who could be interested in this topic.

@mengqinyuan
Copy link
Author

Ok,I write a test :

import os
import shutil

def copytree_iterative(src, dst, symlinks=False, ignore=None,
                       copy_function=shutil.copy2, ignore_dangling_symlinks=False,
                       dirs_exist_ok=False):
    """Copy a directory recursively using an iterative approach."""
    if os.path.exists(dst) and not dirs_exist_ok:
        raise FileExistsError(f"Destination path '{dst}' already exists")

    if not os.path.exists(dst):
        os.makedirs(dst)

    names = os.listdir(src)
    if ignore is not None:
        ignored_names = ignore(src, names)
    else:
        ignored_names = set()

    entries = [(src, dst, name) for name in names if name not in ignored_names]

    while entries:
        srcobj, dstname, srcentry_name = entries.pop()
        srcentry_path = os.path.join(srcobj, srcentry_name)
        dstentry_path = os.path.join(dstname, srcentry_name)

        if os.path.isdir(srcentry_path):
            if not os.path.exists(dstentry_path):
                os.makedirs(dstentry_path)
            entries.extend([(srcentry_path, dstentry_path, name)
                            for name in os.listdir(srcentry_path)
                            if name not in ignored_names])
        else:
            copy_function(srcentry_path, dstentry_path)

def create_source_directory(directory_path):
    try:
        os.makedirs(directory_path)
        print(f"Directory '{directory_path}' created successfully.")
    except FileExistsError:
        print(f"Directory '{directory_path}' already exists.")
    except Exception as e:
        print(f"Error creating directory: {e}")

def write_files(directory_path, num_files):
    for i in range(num_files):
        file_path = os.path.join(directory_path, f"file_{i}.txt")
        with open(file_path, "w") as file:
            file.write(f"This is file {i}.")
import time
def benchmark_copytree(func, *args):
    start_time = time.perf_counter()
    func(*args)
    end_time = time.perf_counter()
    print(f"{func.__name__} : {end_time - start_time:.6f} seconds")

def rmove_directory(directory_path):
    try:
        shutil.rmtree(directory_path)
        print(f"Directory '{directory_path}' removed successfully.")
    except FileNotFoundError:
        print(f"Directory '{directory_path}' does not exist.")
    except Exception as e:
        print(f"Error removing directory: {e}")

# Example usage
if __name__ == "__main__":
    create_source_directory("source")
    write_files("source", 50000)

    benchmark_copytree(copytree_iterative, "source", "destination")
    rmove_directory("source")
    rmove_directory("destination") # remove folders for the next time test

    create_source_directory("source")
    write_files("source", 50000)
    benchmark_copytree(shutil.copytree, "source", "destination")

    rmove_directory("source")
    rmove_directory("destination")

Result here:
copytree : 140.506028 seconds
copytree_iterative : 147.229518 seconds

I use newest python 3.12.6, Core i5 8th Gen.

@rruuaanng
Copy link
Contributor

Ok,I write a test :

import os

import shutil



def copytree_iterative(src, dst, symlinks=False, ignore=None,

                       copy_function=shutil.copy2, ignore_dangling_symlinks=False,

                       dirs_exist_ok=False):

    """Copy a directory recursively using an iterative approach."""

    if os.path.exists(dst) and not dirs_exist_ok:

        raise FileExistsError(f"Destination path '{dst}' already exists")



    if not os.path.exists(dst):

        os.makedirs(dst)



    names = os.listdir(src)

    if ignore is not None:

        ignored_names = ignore(src, names)

    else:

        ignored_names = set()



    entries = [(src, dst, name) for name in names if name not in ignored_names]



    while entries:

        srcobj, dstname, srcentry_name = entries.pop()

        srcentry_path = os.path.join(srcobj, srcentry_name)

        dstentry_path = os.path.join(dstname, srcentry_name)



        if os.path.isdir(srcentry_path):

            if not os.path.exists(dstentry_path):

                os.makedirs(dstentry_path)

            entries.extend([(srcentry_path, dstentry_path, name)

                            for name in os.listdir(srcentry_path)

                            if name not in ignored_names])

        else:

            copy_function(srcentry_path, dstentry_path)



def create_source_directory(directory_path):

    try:

        os.makedirs(directory_path)

        print(f"Directory '{directory_path}' created successfully.")

    except FileExistsError:

        print(f"Directory '{directory_path}' already exists.")

    except Exception as e:

        print(f"Error creating directory: {e}")



def write_files(directory_path, num_files):

    for i in range(num_files):

        file_path = os.path.join(directory_path, f"file_{i}.txt")

        with open(file_path, "w") as file:

            file.write(f"This is file {i}.")

import time

def benchmark_copytree(func, *args):

    start_time = time.perf_counter()

    func(*args)

    end_time = time.perf_counter()

    print(f"{func.__name__} : {end_time - start_time:.6f} seconds")



def rmove_directory(directory_path):

    try:

        shutil.rmtree(directory_path)

        print(f"Directory '{directory_path}' removed successfully.")

    except FileNotFoundError:

        print(f"Directory '{directory_path}' does not exist.")

    except Exception as e:

        print(f"Error removing directory: {e}")



# Example usage

if __name__ == "__main__":

    create_source_directory("source")

    write_files("source", 50000)



    benchmark_copytree(copytree_iterative, "source", "destination")

    rmove_directory("source")

    rmove_directory("destination") # remove folders for the next time test



    create_source_directory("source")

    write_files("source", 50000)

    benchmark_copytree(shutil.copytree, "source", "destination")



    rmove_directory("source")

    rmove_directory("destination")

Result here:

copytree : 140.506028 seconds

copytree_iterative : 147.229518 seconds

I use newest python 3.12.6, Core i5 8th Gen.

Oh,I don't want it

@gpshead
Copy link
Member

gpshead commented Sep 17, 2024

I do not believe we should complicate shutil.copytree with any form of multithreading. Our simple implementation works. As others noted on the discussion thread, typically a copy (tree or not) is bound by a single "spindle" of IOPS (hard drive) or IO operation queue. Even in fancier storage where multiple parallel operations can happen at once, the most significant minor gains are due to a small amount of allowing the OS kernel to rearrange some of the IO being done. It just isn't worth it.

Adding multithreading where there was none before for something that isn't CPU bound and is contending for the same input and output IO resources isn't something we should be doing in the standard library.

Also, on the discussion thread, that supposed "async" solution really isn't async. scandir is not doing async IO nor is the file copy, nor are the stats that attempt to load sizes up front will be high overhead on many OSes, filesystems, and storage media. And putting shutil.copy on the async event loop doesn't magically make it do async file IO. I don't think you're measuring what you think you are measuring.

Fancier versions attempting to eek out performance in specific situations where interleaved or parallel IO can help are better left to modules on PyPI.

@gpshead gpshead closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants