Improve speed of shutil.copytree #124117

mengqinyuan · 2024-09-16T09:40:51Z

I wrote a discussion on disscuss.python.org, Link here. Do you think it is good? What I need to improve? You should mention the comments because they are more useful than the article, I think.

Wulian233 · 2024-09-16T10:08:11Z

Oh, you didn't create the issue in temple format, which might cause some inconvenience😟

mengqinyuan · 2024-09-16T10:11:26Z

Oh, you didn't create the issue in temple format, which might cause some inconvenience😟

What is the temple format?

Wulian233 · 2024-09-16T10:16:07Z

https://github.com/python/cpython/issues/new?assignees=&labels=type-feature&projects=&template=feature.yml

The template Feature or enhancement issue will automatically add type-feature labels, and there is a area for describing the issue and previous discussions

An example: #124111

mengqinyuan · 2024-09-16T10:19:53Z

Thank you.

rruuaanng · 2024-09-16T12:34:03Z

I wrote a discussion on disscuss.python.org, Link here. Do you think it is good? What I need to improve? You should mention the comments because they are more useful than the article, I think.

I‘m read your article, and I think that the time consumption of _copytree comes from the recursive backtracking. If you could eliminate that, the speed would increase exponentially. Of course, this idea is based on not using any external syntax.

cpython/Lib/shutil.py

Lines 520 to 528 in 453da53

    
                   if srcentry.is_dir(): 
        
                       copytree(srcobj, dstname, symlinks, ignore, 
        
                                copy_function, ignore_dangling_symlinks, 
        
                                dirs_exist_ok) 
        
                   else: 
        
                       copy_function(srcobj, dstname) 
        
           elif srcentry.is_dir(): 
        
               copytree(srcobj, dstname, symlinks, ignore, copy_function, 
        
                        ignore_dangling_symlinks, dirs_exist_ok)

picnixz · 2024-09-16T13:38:14Z

AFAICT, no core developer has expressed support for your suggested feature but thank you for asking whether you can open an issue or not. I will not close the issue, but it would be good that your improvement exists in a fork so that we can compare branches (and possibly review them). If you feel that your improvement is ready, you can open a PR and we will look at what you suggest (it might be easier than to review it on Discourse).

However, the benchmarks that were shown seem to be only for a flat directory. It would be very interesting if we had benchmarks (using pyperf or hyperfine) that considered nested directory structures.

By the way, I always thought that copying files is an I/O bound task, and hence multi-threading would not necessarily be useful. Is there something that I am actually missing?

cc @barneygale who could be interested in this topic.

mengqinyuan · 2024-09-17T03:09:04Z

Ok，I write a test :

import os
import shutil

def copytree_iterative(src, dst, symlinks=False, ignore=None,
                       copy_function=shutil.copy2, ignore_dangling_symlinks=False,
                       dirs_exist_ok=False):
    """Copy a directory recursively using an iterative approach."""
    if os.path.exists(dst) and not dirs_exist_ok:
        raise FileExistsError(f"Destination path '{dst}' already exists")

    if not os.path.exists(dst):
        os.makedirs(dst)

    names = os.listdir(src)
    if ignore is not None:
        ignored_names = ignore(src, names)
    else:
        ignored_names = set()

    entries = [(src, dst, name) for name in names if name not in ignored_names]

    while entries:
        srcobj, dstname, srcentry_name = entries.pop()
        srcentry_path = os.path.join(srcobj, srcentry_name)
        dstentry_path = os.path.join(dstname, srcentry_name)

        if os.path.isdir(srcentry_path):
            if not os.path.exists(dstentry_path):
                os.makedirs(dstentry_path)
            entries.extend([(srcentry_path, dstentry_path, name)
                            for name in os.listdir(srcentry_path)
                            if name not in ignored_names])
        else:
            copy_function(srcentry_path, dstentry_path)

def create_source_directory(directory_path):
    try:
        os.makedirs(directory_path)
        print(f"Directory '{directory_path}' created successfully.")
    except FileExistsError:
        print(f"Directory '{directory_path}' already exists.")
    except Exception as e:
        print(f"Error creating directory: {e}")

def write_files(directory_path, num_files):
    for i in range(num_files):
        file_path = os.path.join(directory_path, f"file_{i}.txt")
        with open(file_path, "w") as file:
            file.write(f"This is file {i}.")
import time
def benchmark_copytree(func, *args):
    start_time = time.perf_counter()
    func(*args)
    end_time = time.perf_counter()
    print(f"{func.__name__} : {end_time - start_time:.6f} seconds")

def rmove_directory(directory_path):
    try:
        shutil.rmtree(directory_path)
        print(f"Directory '{directory_path}' removed successfully.")
    except FileNotFoundError:
        print(f"Directory '{directory_path}' does not exist.")
    except Exception as e:
        print(f"Error removing directory: {e}")

# Example usage
if __name__ == "__main__":
    create_source_directory("source")
    write_files("source", 50000)

    benchmark_copytree(copytree_iterative, "source", "destination")
    rmove_directory("source")
    rmove_directory("destination") # remove folders for the next time test

    create_source_directory("source")
    write_files("source", 50000)
    benchmark_copytree(shutil.copytree, "source", "destination")

    rmove_directory("source")
    rmove_directory("destination")

Result here:
copytree : 140.506028 seconds
copytree_iterative : 147.229518 seconds

I use newest python 3.12.6, Core i5 8th Gen.

rruuaanng · 2024-09-17T09:39:29Z

Ok，I write a test :

import os

import shutil



def copytree_iterative(src, dst, symlinks=False, ignore=None,

                       copy_function=shutil.copy2, ignore_dangling_symlinks=False,

                       dirs_exist_ok=False):

    """Copy a directory recursively using an iterative approach."""

    if os.path.exists(dst) and not dirs_exist_ok:

        raise FileExistsError(f"Destination path '{dst}' already exists")



    if not os.path.exists(dst):

        os.makedirs(dst)



    names = os.listdir(src)

    if ignore is not None:

        ignored_names = ignore(src, names)

    else:

        ignored_names = set()



    entries = [(src, dst, name) for name in names if name not in ignored_names]



    while entries:

        srcobj, dstname, srcentry_name = entries.pop()

        srcentry_path = os.path.join(srcobj, srcentry_name)

        dstentry_path = os.path.join(dstname, srcentry_name)



        if os.path.isdir(srcentry_path):

            if not os.path.exists(dstentry_path):

                os.makedirs(dstentry_path)

            entries.extend([(srcentry_path, dstentry_path, name)

                            for name in os.listdir(srcentry_path)

                            if name not in ignored_names])

        else:

            copy_function(srcentry_path, dstentry_path)



def create_source_directory(directory_path):

    try:

        os.makedirs(directory_path)

        print(f"Directory '{directory_path}' created successfully.")

    except FileExistsError:

        print(f"Directory '{directory_path}' already exists.")

    except Exception as e:

        print(f"Error creating directory: {e}")



def write_files(directory_path, num_files):

    for i in range(num_files):

        file_path = os.path.join(directory_path, f"file_{i}.txt")

        with open(file_path, "w") as file:

            file.write(f"This is file {i}.")

import time

def benchmark_copytree(func, *args):

    start_time = time.perf_counter()

    func(*args)

    end_time = time.perf_counter()

    print(f"{func.__name__} : {end_time - start_time:.6f} seconds")



def rmove_directory(directory_path):

    try:

        shutil.rmtree(directory_path)

        print(f"Directory '{directory_path}' removed successfully.")

    except FileNotFoundError:

        print(f"Directory '{directory_path}' does not exist.")

    except Exception as e:

        print(f"Error removing directory: {e}")



# Example usage

if __name__ == "__main__":

    create_source_directory("source")

    write_files("source", 50000)



    benchmark_copytree(copytree_iterative, "source", "destination")

    rmove_directory("source")

    rmove_directory("destination") # remove folders for the next time test



    create_source_directory("source")

    write_files("source", 50000)

    benchmark_copytree(shutil.copytree, "source", "destination")



    rmove_directory("source")

    rmove_directory("destination")

Result here:

copytree : 140.506028 seconds

copytree_iterative : 147.229518 seconds

I use newest python 3.12.6, Core i5 8th Gen.

Oh，I don't want it

gpshead · 2024-09-17T16:35:33Z

I do not believe we should complicate shutil.copytree with any form of multithreading. Our simple implementation works. As others noted on the discussion thread, typically a copy (tree or not) is bound by a single "spindle" of IOPS (hard drive) or IO operation queue. Even in fancier storage where multiple parallel operations can happen at once, the most significant minor gains are due to a small amount of allowing the OS kernel to rearrange some of the IO being done. It just isn't worth it.

Adding multithreading where there was none before for something that isn't CPU bound and is contending for the same input and output IO resources isn't something we should be doing in the standard library.

Also, on the discussion thread, that supposed "async" solution really isn't async. scandir is not doing async IO nor is the file copy, nor are the stats that attempt to load sizes up front will be high overhead on many OSes, filesystems, and storage media. And putting shutil.copy on the async event loop doesn't magically make it do async file IO. I don't think you're measuring what you think you are measuring.

Fancier versions attempting to eek out performance in specific situations where interleaved or parallel IO can help are better left to modules on PyPI.

Eclips4 added type-feature A feature request or enhancement performance Performance or resource usage labels Sep 16, 2024

gpshead closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of shutil.copytree #124117

Improve speed of shutil.copytree #124117

mengqinyuan commented Sep 16, 2024 •

edited

Loading

Wulian233 commented Sep 16, 2024

mengqinyuan commented Sep 16, 2024

Wulian233 commented Sep 16, 2024

mengqinyuan commented Sep 16, 2024

rruuaanng commented Sep 16, 2024 •

edited

Loading

picnixz commented Sep 16, 2024

mengqinyuan commented Sep 17, 2024

rruuaanng commented Sep 17, 2024

gpshead commented Sep 17, 2024

Improve speed of shutil.copytree #124117

Improve speed of shutil.copytree #124117

Comments

mengqinyuan commented Sep 16, 2024 • edited Loading

Wulian233 commented Sep 16, 2024

mengqinyuan commented Sep 16, 2024

Wulian233 commented Sep 16, 2024

mengqinyuan commented Sep 16, 2024

rruuaanng commented Sep 16, 2024 • edited Loading

picnixz commented Sep 16, 2024

mengqinyuan commented Sep 17, 2024

rruuaanng commented Sep 17, 2024

gpshead commented Sep 17, 2024

mengqinyuan commented Sep 16, 2024 •

edited

Loading

rruuaanng commented Sep 16, 2024 •

edited

Loading