Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a DL-NIRSP ASDF is very slow #500

Open
Cadair opened this issue Jan 21, 2025 · 2 comments · May be fixed by #514
Open

Reading a DL-NIRSP ASDF is very slow #500

Cadair opened this issue Jan 21, 2025 · 2 comments · May be fixed by #514

Comments

@Cadair
Copy link
Member

Cadair commented Jan 21, 2025

With a test file on my local system (with the profiler enabled) it took 600s, which is insane (it does take significantly less without the profiler).

Here are some excerpts from the profile (which is insanely large)

612.109 <module>  <ipython-input-10-8499de7ed805>:1
└─ 612.109 wrapper  functools.py:927
   └─ 612.109 _load_from_string  /home/stuart/Git/DKIST/dkist/dkist/dataset/loader.py:116
      └─ 612.109 _load_from_path  /home/stuart/Git/DKIST/dkist/dkist/dataset/loader.py:125
         ├─ 612.109 _load_from_asdf  /home/stuart/Git/DKIST/dkist/dkist/dataset/loader.py:158
         │  ├─ 611.866 open_asdf  asdf/_asdf.py:1622
         │  │  ├─ 611.861 AsdfFile._open_impl  asdf/_asdf.py:1006
         │  │  │  └─ 611.861 AsdfFile._open_asdf  asdf/_asdf.py:890
         │  │  │     ├─ 360.544 AsdfFile._validate  asdf/_asdf.py:670
         │  │  │     ├─ 114.634 tagged_tree_to_custom_tree  asdf/yamlutil.py:329
         │  │  │     ├─ 88.697 load_tree  asdf/yamlutil.py:373
         │  │  │     ├─ 39.880 find_references  asdf/reference.py:108
         │  │  │     ├─ 7.834 Manager.read  asdf/_block/manager.py:337

So a significant amount of time is in the validation of the file on read, followed by the conversion of the tree to high-level objects and a good chunk in parsing the yaml and finding all the references in the yaml.

The obvious win would be to disable validation on read, but we should think about the trade off more.

@Cadair
Copy link
Member Author

Cadair commented Jan 21, 2025

As I mentioned to @SolarDrew out of band, I think one of the biggest wins could be only serialising one header table for the whole TiledDataset object and then storing slices into that big table for each tile. That would mean going from 726 tables to 1 which given they have a lot of columns would massively simplify the file.

We should be able to do all of that in the Converter, i.e this shouldn't require changes outside of this repo.

@braingram
Copy link
Contributor

With "lazy tree" and no validate on read (and cprofile running) the file takes 87 seconds to open and 70 seconds to load the dataset.
Most of the open time (52s) is spent by libyaml parsing the file, then 23s finding references and the remaining time reading the block index (11s). The test file has 245025 blocks which is contributing to the slow load time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants