-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zipfile.Path.open is slow when opening for writing in a ZIP with many entries #126565
Comments
I'd say such PR would be welcomed and you could also check if other tests need that kind of optimization. Let me confirm with a core dev: cc @jaraco |
janhicken
added a commit
to janhicken/cpython
that referenced
this issue
Nov 8, 2024
When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members.
janhicken
added a commit
to janhicken/cpython
that referenced
this issue
Nov 8, 2024
When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members.
jaraco
pushed a commit
that referenced
this issue
Nov 10, 2024
When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members.
miss-islington
pushed a commit
to miss-islington/cpython
that referenced
this issue
Nov 10, 2024
…onGH-126576) When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members. (cherry picked from commit 160758a) Co-authored-by: Jan Hicken <janhicken@users.noreply.github.com>
miss-islington
pushed a commit
to miss-islington/cpython
that referenced
this issue
Nov 10, 2024
…onGH-126576) When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members. (cherry picked from commit 160758a) Co-authored-by: Jan Hicken <janhicken@users.noreply.github.com>
This was referenced Nov 10, 2024
picnixz
added
3.12
bugs and security fixes
3.13
bugs and security fixes
3.14
new features, bugs and security fixes
labels
Nov 10, 2024
This fix is also available on older Pythons with zipp 3.21.0. |
jaraco
pushed a commit
that referenced
this issue
Nov 10, 2024
…126576) (#126643) gh-126565: Skip `zipfile.Path.exists` check in write mode (GH-126576) When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members. (cherry picked from commit 160758a) Co-authored-by: Jan Hicken <janhicken@users.noreply.github.com>
jaraco
pushed a commit
that referenced
this issue
Nov 10, 2024
…126576) (#126642) gh-126565: Skip `zipfile.Path.exists` check in write mode (GH-126576) When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members. (cherry picked from commit 160758a) Co-authored-by: Jan Hicken <janhicken@users.noreply.github.com>
Closing since merged and backported. Thank you all. |
thanks for finding and fixing this! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug report
Bug description:
I'm using the
zipfile
module to create a ZIP file with thousands of members. Each file is created by creating a correspondingzipfile.Path
object first and then calling.open()
on it.The implementation of
open()
contains a check whether the file already exists when opening a file in read mode:However,
self.exists()
is called even in write mode, because it is theand
operator's first argument.The call to
self.exists()
is quite slow however, because this requires the ZIP file's.namelist()
to be computed, which in turn requires to compute all the implied directories. After all, this is the reason why theFastLookup
optimization exists for ZIP files in read mode.I found this issue after profiling my application and was surprised that more than half of its execution time was spent in computing implied ZIP directories.
I would propose to swap the
and
arguments so the check looks like this:This will cause the
self.exists()
check to only be run in read mode, where the check is fast because ofFastLookup
anyway.I'm happy to provide a pull request with this change.
CPython versions tested on:
3.11, 3.12
Operating systems tested on:
Linux, macOS
Linked PRs
zipfile.Path.exists
check in write mode #126576zipfile.Path.exists
check in write mode (GH-126576) #126642zipfile.Path.exists
check in write mode (GH-126576) #126643The text was updated successfully, but these errors were encountered: