[performance] Skip data offset collection for zip files: 100x speedup

Creating offset dictionary for rsna-intracranial-hemorrhage-detection.zip took: After : 9.05 s Before : 1061 s https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection
mxmlnkn · Sep 28, 2023 · fb058eb · fb058eb
1 parent ea43572
commit fb058eb
Showing 1 changed file with 8 additions and 3 deletions.
diff --git a/core/ratarmountcore/ZipMountSource.py b/core/ratarmountcore/ZipMountSource.py
@@ -286,10 +286,15 @@ def _convertToRow(self, info: "zipfile.ZipInfo") -> Tuple:
 
         path, name = SQLiteIndex.normpath(info.filename).rsplit("/", 1)
 
-        # Currently, this is unused. The index only is used for getting metadata. But the data offset
+        # Currently, this is unused. The index only is used for getting metadata. (The data offset
         # is already determined and written out in order to possibly speed up reading of encrypted
-        # files by implementing the decryption ourselves.
-        dataOffset = self._findDataOffset(info.header_offset)
+        # files by implementing the decryption ourselves.)
+        # The data offset is deprecated again! Collecting it can add a huge overhead for large zip files
+        # because we have to seek to every position and read a few bytes from it. Furthermore, it is useless
+        # by itself anyway. We don't even store yet how the data is compressed or encrypted, so we would
+        # have to read the local header again anyway!
+        # dataOffset = self._findDataOffset(info.header_offset)
+        dataOffset = 0
 
         # fmt: off
         fileInfo : Tuple = (