Skip to content

Commit

Permalink
[performance] Skip data offset collection for zip files: 100x speedup
Browse files Browse the repository at this point in the history
Creating offset dictionary for rsna-intracranial-hemorrhage-detection.zip took:

    After  : 9.05 s
    Before : 1061 s

https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection
  • Loading branch information
mxmlnkn committed Sep 28, 2023
1 parent ea43572 commit fb058eb
Showing 1 changed file with 8 additions and 3 deletions.
11 changes: 8 additions & 3 deletions core/ratarmountcore/ZipMountSource.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,10 +286,15 @@ def _convertToRow(self, info: "zipfile.ZipInfo") -> Tuple:

path, name = SQLiteIndex.normpath(info.filename).rsplit("/", 1)

# Currently, this is unused. The index only is used for getting metadata. But the data offset
# Currently, this is unused. The index only is used for getting metadata. (The data offset
# is already determined and written out in order to possibly speed up reading of encrypted
# files by implementing the decryption ourselves.
dataOffset = self._findDataOffset(info.header_offset)
# files by implementing the decryption ourselves.)
# The data offset is deprecated again! Collecting it can add a huge overhead for large zip files
# because we have to seek to every position and read a few bytes from it. Furthermore, it is useless
# by itself anyway. We don't even store yet how the data is compressed or encrypted, so we would
# have to read the local header again anyway!
# dataOffset = self._findDataOffset(info.header_offset)
dataOffset = 0

# fmt: off
fileInfo : Tuple = (
Expand Down

0 comments on commit fb058eb

Please sign in to comment.