Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible enhancement: hashmap for fast dim and var query by name #234

Closed
gsjaardema opened this issue Mar 9, 2016 · 4 comments
Closed

Comments

@gsjaardema
Copy link
Contributor

This issue is to continue some discussion started in pull request #229 related to the addition of a hashmap to speed up dim and var query by name.

A few more observations:

  • Need two types of queries to be fast
    • query by name which returns either var or varid
    • query by varid which returns var (similar for dimid/dim)
  • Looks like the current nc4 implementation assigns dimid's globally based on the next_dimid field in the NC_HDF5_FILE_INFO_T struct.
    • dimids are unique at the file scope.
    • To provide fast lookup of a dim from a dimid, it might be possible to use the dimarray concept used in the nc3 implementation to store dims off of the NC_HDF5_FILE_INFO_T struct instead of storing them using the doubly-linked list stored on the group.
  • The nc4 assigns varid's locally in a group using the nvars field in the NC_GRP_INFO_T struct.
    • This means that there are multiple vars with the same varid -- varid scope is group.
    • Could possibly use the nc3 vararray concept to store vars at the group level instead of using the doubly-linked list currently used. This would give fast lookup of var via varid.
  • Need fast lookup of varid and dimid via a name query. This could be provided with hashmap.
    • This is relatively easy for nc3 files since there is a single namespace.
    • In nc4 files, there can be multiple dim and var with the same name -- unique only within a group.
    • One implementation is a hashmap per type (var, dim) per group, but this could result in lots of overhead if there are many groups with not very many dims or vars per group.
    • Other possibility is to have a single hashmap at the file level for dims and another for vars.
      • This names would be the var/dim name concatenated with the full group path name which would be hashed and used as the key.
      • Overhead in creating the full group path for var/dim creation and for inquiry, but reduces overhead since only 2 hashmaps instead of 2 per group.
    • Could also use a hash key based on the hash of the name combined with the group id instead of group name.
      I have a prototype hashmap usage for nc3 files that is currently passing all tests. It would need some cleanup for general use, but wanted to see how doable it was. It basically provides a quick lookup of dimid or varid from a name and then the dimid or varid to dim or var is a quick lookup based on the dimarray and vararray that nc3 files use. I hope to extend this to nc4 files, but not sure when will get a chance.
@WardF
Copy link
Member

WardF commented Apr 4, 2016

This was addressed in the recent pull request I believe; closing out unless I hear different.

@WardF WardF closed this as completed Apr 4, 2016
@gsjaardema
Copy link
Contributor Author

The recent pull request was for nc3 files; the above discussion is for nc4 (netcdf-4) files.

@WardF
Copy link
Member

WardF commented Apr 4, 2016

So it is; I glanced over it when I should have read it more carefully.

@edhartnett
Copy link
Contributor

I think this is a good idea and I like the changes in your PR.

I am amazed that linked lists should be so much slower. Especially for small numbers of variables.

@WardF WardF closed this as completed Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants