Reading a DL-NIRSP ASDF is very slow #500

Cadair · 2025-01-21T12:52:22Z

With a test file on my local system (with the profiler enabled) it took 600s, which is insane (it does take significantly less without the profiler).

Here are some excerpts from the profile (which is insanely large)

612.109 <module>  <ipython-input-10-8499de7ed805>:1
└─ 612.109 wrapper  functools.py:927
   └─ 612.109 _load_from_string  /home/stuart/Git/DKIST/dkist/dkist/dataset/loader.py:116
      └─ 612.109 _load_from_path  /home/stuart/Git/DKIST/dkist/dkist/dataset/loader.py:125
         ├─ 612.109 _load_from_asdf  /home/stuart/Git/DKIST/dkist/dkist/dataset/loader.py:158
         │  ├─ 611.866 open_asdf  asdf/_asdf.py:1622
         │  │  ├─ 611.861 AsdfFile._open_impl  asdf/_asdf.py:1006
         │  │  │  └─ 611.861 AsdfFile._open_asdf  asdf/_asdf.py:890
         │  │  │     ├─ 360.544 AsdfFile._validate  asdf/_asdf.py:670

         │  │  │     ├─ 114.634 tagged_tree_to_custom_tree  asdf/yamlutil.py:329

         │  │  │     ├─ 88.697 load_tree  asdf/yamlutil.py:373

         │  │  │     ├─ 39.880 find_references  asdf/reference.py:108

         │  │  │     ├─ 7.834 Manager.read  asdf/_block/manager.py:337

So a significant amount of time is in the validation of the file on read, followed by the conversion of the tree to high-level objects and a good chunk in parsing the yaml and finding all the references in the yaml.

The obvious win would be to disable validation on read, but we should think about the trade off more.

The text was updated successfully, but these errors were encountered:

Cadair · 2025-01-21T13:02:59Z

As I mentioned to @SolarDrew out of band, I think one of the biggest wins could be only serialising one header table for the whole TiledDataset object and then storing slices into that big table for each tile. That would mean going from 726 tables to 1 which given they have a lot of columns would massively simplify the file.

We should be able to do all of that in the Converter, i.e this shouldn't require changes outside of this repo.

braingram · 2025-01-21T13:28:55Z

With "lazy tree" and no validate on read (and cprofile running) the file takes 87 seconds to open and 70 seconds to load the dataset.
Most of the open time (52s) is spent by libyaml parsing the file, then 23s finding references and the remaining time reading the block index (11s). The test file has 245025 blocks which is contributing to the slow load time.

Cadair mentioned this issue Jan 21, 2025

Add support to TiledDataset for missing, irregular or overlapping tiles #487

Open

SolarDrew linked a pull request Jan 30, 2025 that will close this issue

Asdf read speed #514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a DL-NIRSP ASDF is very slow #500

Reading a DL-NIRSP ASDF is very slow #500

Cadair commented Jan 21, 2025

Cadair commented Jan 21, 2025

braingram commented Jan 21, 2025

Reading a DL-NIRSP ASDF is very slow #500

Reading a DL-NIRSP ASDF is very slow #500

Comments

Cadair commented Jan 21, 2025

Cadair commented Jan 21, 2025

braingram commented Jan 21, 2025