Metadata handling #1563

nicost · 2022-11-17T17:14:43Z

Currently, each "tagged" image taken from MMCoreJ has all the metadata available in the SystemCache attached to it. In many cases, this is uninteresting, since none or only very few of these metadata change between images. Processing of these metadata can take up to 1 ms (data needed), potentially being the biggest hurdle to go to speeds higher than 1kfps.

The SystemCache is added in the funtion "CreateTaggedImage() in MMCoreJ.i. Would it be possible to take this code out, and provide a separate function to add the system cache to the image metadata? We can find all instances in our code where CreateTaggedImage is called, and take appropriate action (i.e. evaluate whether or not to add the system cache metadata). This change seems relatively low impact, and confined to the issue at hand. I guess this would mean that we need to bump the MMCoreJ version. Curious about your thoughts.

marktsuchida · 2022-11-17T18:15:43Z

I think we should have a new function that replaces popNextTaggedImage() and family (which can be deprecated), with the following properties (at least):

The obtained result is not TaggedImage but something new that does not depend on a specific JSON library and has no public fields
- But it should include the pixel format, which is too complicated to access in TaggedImage
Metadata added by the camera and Core is included (perhaps optionally?)
Contents of the system state cache are not included (or are included optionally, but that seems the same as calling a separate function)
- It would also be good to keep this segregated from camera- and core-generated metadata instead of dumping it into the same bag as we currently do
It would be good for this to have a basic implementation in C++, with the SWIG wrapper only doing straightforward conversion

Some details may be further influenced by @henrypinkard's various plans (especially around the sequence buffer).

What to use instead of JSONObject is a good question. We don't need JSON-like recursive structures, only a single-level map (and I like this simplicity and think it best to preserve it). So it could be as simple as a C++ std::map<std::string, std::string> (or std::unordered_map, i.e., a hash table), which can be wrapped as something compatible to a Java Map or Python mapping. We can talk about using tuples for the keys (device, property), and/or using a variant type (string/number/etc) for the values.

(Although the primary way we remove overhead might be to avoid copying the system state cache, the state cache itself can also be optimized for copying, e.g., by storing it as a pair of (sorted) std::vectors. But this is a next-level thing.)

nicost · 2022-11-17T18:48:13Z

I do understand that is the better long term solution. In the interim, we are running more and more into performance issues that seem to have their roots in the huge number of metadata. My proposal here seems relatively easy to implement, and allow projects that need performance to move forward. Maybe I am wrong and your proposal is also easy to implement soon.

marktsuchida · 2022-11-18T14:56:17Z

If we just want a short-term solution, it could be as simple as a flag in MMCoreJ that disables the addition of the system state cache data.

The only downside is that such a flag will be global state, so there could be issues if one plugin disables it and another plugin expects it to not be disabled (for example; not sure if this is a real concern). To prevent this, we could make the call to disable device metadata apply only to the current acquisition (and ignored if no acquisition is running). Then application code can start an acquisition, disable metadata addition, optionally get the system state cache contents once, and then keep popping images with no metadata added. If another plugin starts an acquisition later, it will get metadata as expected.

But one could also argue that it would be simpler to just make it a global setting and give users control (e.g., put it in Tools > Options). This might make more sense because the need for speed probably doesn't change from acquisition to acquisition on the same microscope. Maybe I prefer this just for its simplicity.

henrypinkard · 2022-11-18T17:43:58Z

I do understand that is the better long term solution. In the interim, we are running more and more into performance issues that seem to have their roots in the huge number of metadata.

I'm not sure that this is true. Pycromanager + NDTiff can save at multiple GB/s seemingly indefinitely on nvme drives (see here). This is with full system state cache in each image. While I'm sure there are situations where metadata can be a problem (e.g. tiny ROIs), its worth having more data on how problematic this actually is in practice before trying to solve it. Maybe a short term solution is not needed and this can be deferred to the larger remaking of the bufffers.

The obtained result is not TaggedImage but something new that does not depend on a specific JSON library and has no public fields
* But it should include the pixel format, which is too complicated to access in `TaggedImage`

Yes, agreed. I think there a couple fields of "essential metadata" that should always be present with an image: pixel size, width, height. Will write some more about this in the buffer proposal

What to use instead of JSONObject is a good question. We don't need JSON-like recursive structures, only a single-level map (and I like this simplicity and think it best to preserve it).

I think the JSON schema is helpful for being able to organize and group metadata. AcqEngJ makes use of recursive structure at present, and it seems like the might be other backwards compatibility issues if we tried to switch from a recursive schema to a a flat one (though maybe not within the core itself)

nicost · 2022-11-18T17:55:38Z

I do understand that is the better long term solution. In the interim, we are running more and more into performance issues that seem to have their roots in the huge number of metadata.

I'm not sure that this is true.

Count me confused. I remember you saying during the LIghtSheetManager meeting that taggedImages can not go in a performant way into the MM Datastore because of the slowness in metadata handling. Reducing the metadata (significantly) should take care of that. What am I missing?

henrypinkard · 2022-11-18T18:18:57Z

mmcorej.json is relatively slow at serializing a JSON object. So in order to be performant that serializing needs to be threaded intelligently. Using Pycromanager to go directly from AcqEngJ --> NDTiff has this optimized and achieved the high performance. When going through studio, the json metadata of TaggedImage gets converted to a Metadata object and then back to JSON before going into NDTiff via the DataStore API. This involves serialization/deserialization somewhere along the way. I don't know that this is causing a performance hit, but it is introducing at least one extra serialization, so it could be. It should be fairly easy to test with fast HDs

nicost · 2022-11-19T01:17:38Z

Then does it not make sense to have a way of getting images without the overload of metadata the we do not have any use for? Seems a cheap way to speed things up, and possibly could be a bottleneck for using the Micro-Manager Datastore and Viewer. Seems worth figuring out especially since it is easy to do.

marktsuchida · 2022-11-19T16:10:50Z

I thought we were talking about the performance of popNextTaggedImage() itself here, which I think was found to be a bottleneck in at least some situations, although I can't remember a reference. It may have been higher frame rates (and smaller images) than what you are working with.

This is not about GB/s of data transfer or saving, because the MMCoreJ overhead we are talking about here does not scale with GB. It needs to be tested with small image sizes (e.g. 64x64) at high frame rate (≥ 1 kHz) and a reasonably large (and known, standardized) number of device properties (demo config might be on the small side compared to many real microscopes).

Perhaps it would be good to measure what is actually the bottleneck(s) if the goal is to support a particular use case with particular downstream handling. We know that JSONObject is slow, but it is not surprising that the downstream metadata handling via Metadata is also very slow (there is in fact an extra ser/des because decoupling from JSONObject for future-proofing was a priority -- we anticipated having a replacement for TaggedImage).

I do agree with @nicost that simply avoiding the per-frame copy of system state cache does seem like a valid and simple fix. It has the huge advantage of not requiring optimization of all downstream handling. The fact that downstream code is handed a huge JSONObject with every frame and has to do something about it, being efficient every step of the way, is a problem in itself.

This is especially true when you realize that these device property values are not temporally linked to the image frame with any accuracy: they are the cached values at the time of popping the image (and the cached values are not always valid, but that's a whole nother story). So any application that wants to monitor device property changes during a sequence acquisition could just as well monitor the system state cache separately from the image popping, at whatever desired interval. (Although then there is no longer an expedient place to save the values that are not tied to image frames. I suspect that was a motivation for artificially bundling device state with image frames.)

henrypinkard · 2022-11-19T22:09:14Z

Then does it not make sense to have a way of getting images without the overload of metadata the we do not have any use for? Seems a cheap way to speed things up, and possibly could be a bottleneck for using the Micro-Manager Datastore and Viewer. Seems worth figuring out especially since it is easy to do.

Yes, absolutely. I've never doubted this. And all the reasons @marktsuchida mentions support not being especially attached to the current way things are done.

Separately, the issue being discussed re LightSheetManager was how can we get a prototype plugin with performant file saving as soon as possible. Given the testing that has been done thus far, there doesn't seem to be any evidence that this change to metadata needs to happen first.

nicost · 2022-11-25T00:23:38Z

See micro-manager/mmCoreAndDevices#305.

nicost assigned marktsuchida, nicost and henrypinkard Nov 17, 2022

ziw-liu mentioned this issue Mar 1, 2023

Performance comparison between tifffile and iohub's custom OME-TIFF implementation czbiohub-sf/iohub#66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata handling #1563

Metadata handling #1563

nicost commented Nov 17, 2022

marktsuchida commented Nov 17, 2022

nicost commented Nov 17, 2022

marktsuchida commented Nov 18, 2022

henrypinkard commented Nov 18, 2022

nicost commented Nov 18, 2022

henrypinkard commented Nov 18, 2022

nicost commented Nov 19, 2022

marktsuchida commented Nov 19, 2022

henrypinkard commented Nov 19, 2022

nicost commented Nov 25, 2022

Metadata handling #1563

Metadata handling #1563

Comments

nicost commented Nov 17, 2022

marktsuchida commented Nov 17, 2022

nicost commented Nov 17, 2022

marktsuchida commented Nov 18, 2022

henrypinkard commented Nov 18, 2022

nicost commented Nov 18, 2022

henrypinkard commented Nov 18, 2022

nicost commented Nov 19, 2022

marktsuchida commented Nov 19, 2022

henrypinkard commented Nov 19, 2022

nicost commented Nov 25, 2022