-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance comparison between tifffile and iohub's custom OME-TIFF implementation #66
Comments
A note on interpreting the results (why no error bars) from
In our case if the file is small FS caching might decrease that lower bound, but the general idea holds true. |
Repeated the test with a much larger (1036 GB) dataset, and this time iohub's custom reader is faster by a large margin: Test script (click to expand):Environment: Python 3.10.8, Linux 4.18 (x86_64, AMD EPYC 7H12@2.6GHz) # %%
import os
from timeit import timeit
from tqdm import tqdm
import zarr
import pandas as pd
# readers tested
from tifffile import TiffSequence # 2023.2.3
from iohub.multipagetiff import MicromanagerOmeTiffReader # 0.1.dev368+g3d62e6f
# %%
# 1036 GB total
DATASET = (
"/hpc/projects/comp_micro/rawdata/hummingbird/Janie/"
"2022_03_15_orgs_nuc_mem_63x_04NA/all_21_3"
)
POSITIONS = (50, 100, 150, 200, 250)
# %%
def read_tifffile():
sequence = TiffSequence(os.scandir(DATASET))
data = zarr.open(sequence.aszarr(), mode="r")
for p in POSITIONS:
_ = data[p]
sequence.close()
# %%
def read_custom():
reader = MicromanagerOmeTiffReader(DATASET)
for p in POSITIONS:
_ = reader.get_array(p)
# %%
def repeat(n=5):
tf_times = []
wo_times = []
for _ in tqdm(range(n)):
tf_times.append(
timeit(
"read_tifffile()", number=1, setup="from __main__ import read_tifffile"
)
)
wo_times.append(
timeit("read_custom()", number=1, setup="from __main__ import read_custom")
)
return pd.DataFrame({"tifffile": tf_times, "waveorder": wo_times})
# %%
def main():
timings = repeat()
print(timings)
timings.to_csv("large_tiff_time.csv")
# %%
if __name__ == "__main__":
main() |
Given these tests it seems that the inflection point of tifffile/iohub performance is in the vicinity of 500 GB dataset size. Since our users do expect ~TB OME-TIFF datasets frequently, it makes sense to keep investing in a custom solution.
|
Allow me to chime in. 350 vs 50 seconds to read 25 files seems excessive. What is the number of files, size of files, and number of pages in each file? If this is a single, multi-file MicroManager dataset, the use of TiffSequence is counterproductive. Tifffile uses OME-XML metadata in the first file, which describes the whole multi-file dataset and triggers indexing all TIFF pages in all files of the dataset... |
@cgohlke Thanks for letting us know of the improvement we can make in benchmarking! We are very much looking forward to switching to a more widely used and tested implementation with comparable performance.
There's 260 Can you point us to the recommended entry point for such a dataset with |
Unfortunately it is currently not possible to get a single Zarr store of a multi-position MicroManager dataset. Tifffile uses the OME-XML metadata, which describes positions as distinct OME series. with TiffFile(FIRST_FILE) as tif:
series = tif.series # this parses OME-XML and indexes all pages in all files
for position in POSITIONS:
# read series as numpy array ...
im = series[position].asarray()
# ... or via zarr store
store = series[position].aszarr()
z = zarr.open(store, mode='r')
im = z[:]
store.close() A major bottleneck when only accessing a few positions is that tifffile needs to index all TIFF pages/IFDs in the dataset when creating the series. That requires a lot of seeks and little reads over many files. I have a patch that uses the MicroManager IndexMap, which speeds up indexing about 10x for a smaller 8 GB dataset. It would require changes in tifffile to return a single series for the whole MicroManager datastore. Either special casing the OME-XML parser, or adding a dedicated MicroManager parser similar to NDTiff. A special parser would be preferable, but (1) the format seems poorly documented compared to OME-TIFF, (2) there are variations between MicroManager versions, (3) there is no public set of test files, and (4) MicroManager frequently produces corrupted files (at least on my system). |
Thanks for showing this! This way is much faster in most runs. Changing the def read_tifffile():
with TiffFile(os.path.join(DATASET, FIRST_FILE)) as tif:
series = tif.series
for p in POSITIONS:
_ = series[p].asarray() while the rest of the script is kept the same gives: The 1000+ second outlier happened in the first run:
I'll have to run it a couple more times to see if that's a consistent pattern or just a network glitch. |
Tested again with iohub running first in each loop:
|
Interesting. I guess that "glitch" is due to caching and abysmal performance of parsing the TIFF IFD structures on a network drive. I have just released tifffile 2023.2.27, which uses the Micro-Manager indexmap instead of parsing the IFD chain. That should improve timing, especially when files are on a network drive. Another thought: are the image data returned by iohub and tifffile the same, i.e. |
The special parser in cc: @edyoshikun, @talonchandler. |
Yes, they are the same.
Great work! The timing did improve a lot. I just tested the 1 TB dataset on other compute nodes linked to the same storage server, and the initial delay seems to be a pattern of the network storage caching. Tifffile running first:
iohub running first (2 different nodes):
In later iterations tifffile is now consistently faster, and its performance on 'fresh' nodes is now usable. A remaining question is the first shot performance between the two. |
Great. Thanks for re-running the benchmark! Are there any warnings from tifffile in the log output?
I think that is the overhead of reading all the Micro-Manager metadata from all files initially. The metadata are distributed across the files: the indexmap is at the beginning, while the OME-XML (in the first file only), the display settings, and comments are towards the end. It might be worth only reading the indexmap and OME-XML required for parsing the series to minimize cache misses. |
There were warnings both before and after the update. iohub has the same issue since it uses tifffile to get the headers. I think this has always been the case and iohub even had ('doesn't work') code to suppress tifffile warnings:
|
Thanks. Those are the same kind of warnings I get on files produced by MicroManager 2.0.0. I'll double-check my code against the file format spec. The "coercing invalid ASCII to byte" warning comes from a second ImageDescription tag, which usually contains ImageJ metadata, but is clearly corrupted for these files. |
Good to know. Calling |
Turns out that some files written by Micro-Manager are > 4GB while the offsets in classic TIFF and the Micro-Manager header are 32-bit (< 4 GB). Hence the offsets to ImageJ metadata, comments, and display settings stored at the end of MicroManager files are frequently wrong (32-bit overflow). OME-XML could be affected too, but I don't have such a sample. |
Moreover, the display settings, which should be UTF-8 JSON strings, are always (?) truncated and invalid JSON. |
ImageJ metadata are also wrong for multi-file or combined multi-position datasets... Tifffile 2023.2.28 contains potential speed improvements and fixes for reading some corrupted metadata. |
This is probably because MM determines the max number of images to write in a TIFF file only from the size of pixels: And since the MM metadata can be large (> 100 MB, it dumps the state of the entire system in a JSON string for every frame micro-manager/micro-manager#1563), this might break stuff at times. For example the first file (and many others) in the 1 TB dataset I'm using in the benchmark is 4.3 GB (4,339,252,149 bytes). Even if they fix it in the future we will still have to be 'compatible' with existing broken data. Edit: It does seem check if there's space to write OME metadata. Couldn't find the check for image plane metadata (the larger one) though. |
Just tried it out. Same 1 TB benchmark on the same node: #2023.2.27
tifffile iohub
0 17.227365 30.499268
1 17.311548 29.198703
2 16.966164 29.202692
3 16.967853 29.073899
4 16.956701 29.024661
# 2023.2.28
tifffile iohub
0 15.639089 48.219170
1 15.678544 34.795313
2 15.628425 34.761851
3 15.679976 34.806345
4 15.602959 34.863524 Now the header warnings are silenced. Only the tag 270 encoding warning persists. I tried loading the tag with PIL but that gives me a corrupted UTF-8 string (no luck with chardet):
|
Those timings make sense. The tifffile series interface is a little faster because it now only reads the indexmaps from the beginning of the files and the OME-XML from the end of the first file. Iohub is a little slower because the read_micromanager_metadata function now actually reads and (tries to) decode the Micro-Manager metadata from the end of the files instead of failing early. I did not special-case the TIFF parser for the wrong second ImageDescription tag. If you really need to recover the (wrong) ImageJ metadata from files > 4 GB, try: with tifffile.TiffFile(FILENAME) as tif:
tag = tif.pages[0].tags.get('ImageDescription', index=1)
tif.filehandle.seek(tag.valueoffset + 2**32)
data = tif.filehandle.read(tag.count)
imagej_description = tifffile.stripnull(data).decode('cp1252')
print(imagej_description) |
Tifffile v2023.3.15 includes a new parser for MMStack series and reverts the MicroManager specific optimizations for OME series. Positions are now returned as a dimension in the series rather than as separate series. The dimension order is parsed from the MM metadata and might differ from OME. I only have a limited number of test files, many of which are corrupted in one way or another. Hope it works for you. |
@cgohlke Thanks for letting us know! I have yet to change the related code here, so I have to pin the |
Just found this thread. It would be great to open issues on the micro-manager repository (https://github.com/micro-manager/micro-manager). I see at least two different issues here (Size sometimes > 4GB, and "the display settings, which should be UTF-8 JSON strings, are always (?) truncated and invalid JSON", but there may be more. |
A custom OME-TIFF reader (
iohub.multipagetiff.MicromanagerOmeTiffReader
) was implemented because historically tifffile and AICSImageIO was slow when reading large OME-TIFF series generated by Micro-Manager acquisitions.While debugging #65 I found that this implementation does not guarantee data integrity during reading. Before investing more time in fixing it, I think it is worth revisiting the topic of whether it is worth maintaining a custom OME-TIFF reader, given that the more widely adopted solutions have evolved since
waveorder.io
's designation.Here is a simple read speed benchmark of tifffile and iohub's custom reader:
The test was done on a 123 GB dataset with TCZYX=(8, 9, 3, 81, 2048, 2048) dimensions. Voxels from 2 non-sequential positions was read into RAM in each iteration (N=5).
Test script (click to expand):
Environment: Python 3.10.8, Linux 4.18 (x86_64, AMD EPYC 7H12@2.6GHz)
At least in this test, the latest tifffile consistently out-performs the iohub implementation. While a comprehensive benchmark will take more time (#57), I think as long as a widely used library is not significantly slower, the reduction of maintenance overhead and increased user testing can make a strong case for us to reconsider maintaining the custom code in iohub.
The text was updated successfully, but these errors were encountered: