Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convention for HEALPix grid parameters #433

Open
uweschulzweida opened this issue Mar 18, 2023 · 55 comments
Open

Convention for HEALPix grid parameters #433

uweschulzweida opened this issue Mar 18, 2023 · 55 comments
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@uweschulzweida
Copy link

We will soon be producing a lot of data on a HEALPix (Hierarchical Equal Area isoLatitude Pixelation) grid. The grid coordinates can be calculated easily from 2 parameters. The NSIDE parameter controls the resolution of the pixellization and the ORDER parameter sets the index ordering convention of the pixels (ring or nested).
I think it would be good to have a convention for storing these parameters in NetCDF. My idea is to define these parameters via the grid_mapping attribute, for example:

dimensions:
	cells = 49152 ;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
		healpix:healpix_nside = 64 ;
		healpix:healpix_order = "nested" ;
	float var(cells) ;
		var:grid_mapping = "healpix" ;

This defines a HEALPix grid with nside=64 and a nested index ordering.
Can this be taken over into the CF Convention or do you have a better suggestion?

@JonathanGregory
Copy link
Contributor

@uweschulzweida, thanks for your question. Maybe it can be treated as grid mapping. Does it have two horizontal coordinate variables, like the existing grid mappings of Appendix F? All of those are methods for converting between the (X,Y) coordinates of the mapping and (longitude,latitude) coordinates. Please could you describe HEALPix in the form of the other entries of Appendix F?

@uweschulzweida
Copy link
Author

I have not found anything suitable in Appendix F. In any case, only the two parameters NSIDE/ORDER are needed to calculate the grid coordinates. For a HEALPix grid it does not make sense to store the coordinates in a NetCDF file. I think I will save the parameters as described above for the time being. If someone has a better idea I can change it.

@JonathanGregory
Copy link
Contributor

You're right, HEALPix is not in Appendix F at the moment. I see from wikipedia that it is a map projection, or a class of map projections. If you could provide a definition of its parameters in the form of the other entries of Appendix F, we could certainly consider adding it.

You remark, "For a HEALPix grid it does not make sense to store the coordinates in a NetCDF file." Sect 5 of CF expects latitude and longitude coordinates to be provided as 2D auxiliary coordinate variables if the horizontal coordinates aren't latitude and longitude (as in this case). This is mandatory, whereas the grid mapping is an optional extra. The reason for it is to make the data self-explanatory and useful (the aim of the CF convention, in general), by enabling generic applications to geolocate the data. Most applications will not be aware of the HEALPix grid, for instance, but will understand latitude and longitude.

Best wishes

Jonathan

@bnlawrence
Copy link

I'm missing something here @JonathanGregory. When we use a map projection we provide coordinates which can only be turned into lat/lon by using information about the map projection itself.

In Healpix, the dimension index (singular) plus the nside and order are sufficient to generate the lat/lon coordinates. So it's the same as any other coordinate system (cordinates in lat/lon) = function(coordinates in projected grid) where the function you need is defined outside the CF file itself.

Section 5 says:

If the coordinate variables for a horizontal grid are not longitude and latitude, then a grid_mapping variable provides the information required to derive longitude and latitude values for each grid location. If no grid mapping variable is referenced by a data variable, then longitude and latitude coordinate values shall be supplied in addition to the required coordinates.''

Which I take to mean that given Healpix is a well known formalism for whch a grid mapping variable can be provided, and so we don't need latitude and longitude.

@JonathanGregory
Copy link
Contributor

Dear Bryan @bnlawrence

Actually it was me who was missing something! I didn't know that the convention had been changed, such that the 2D lat and lon coordinates are no longer mandatory if the grid mapping is provided. This change was introduced in version 1.8 by issue 179, which I had never seen before this morning! I'm not disputing it, and I understand the reasons for changing it. I suspect I didn't notice it because it was proposed during the 2019 CF meeting in Tacoma, which I didn't attend in person. There may have been several things initiated then, which were of course properly followed up in issues, but since I'm quite attentive to the CF discussions, my oversight shows there's a danger when a large number of things are done at once that they may not receive the same scrutiny as usual. That's something to keep in mind.

So you're right, Healpix can be included without 2D aux lat and lon coordinates if it can be described by a grid mapping in the format of Appendix F, as discussed with @uweschulzweida. Can that be done?

Best wishes

Jonathan

@davidhassell
Copy link
Contributor

davidhassell commented Oct 17, 2023

Hello,

It's interesting to note that the 5.6 text implicitly assumes that horizontal coordinates of some type are always present. What if there are no projection coordinates?

This is potentially relevant because it's not obvious to me what auxiliary coordinates, if any, would be for the data variable's cell dimension (from the CDL in the original post). It would be great if someone clarify this point.

Thanks,
David

@bnlawrence
Copy link

bnlawrence commented Oct 17, 2023

I think it's crucial to understand that the choice of order defines the domain filling curve to be used:
Screenshot 2023-10-17 at 15 06 02
This is not as bad as it seems as there is a clear functional relationship between the pixels in both schemes and their latitude and longitudes. Cell bounds might be more fun.

(Figure from Górski et al, 2004, https://iopscience.iop.org/article/10.1086/427976/pdf, shows pixelation for nside=2)

@bnlawrence
Copy link

bnlawrence commented Oct 17, 2023

(Incidentally there is a lot of confusion in people talking about healpix in our community because at the same time as introducing healpix, people are introducing hierarchical datasets, that is, storing lots of versions of the same variable at different pixelations, so that folks wanting low resolution data can read a lower resolution pixelation. E.g. in practice folks talking about storing multiple zoom levels where $\mathrm{cells} = 12 \times 4^z$, $z$ is the zoom level, and $\mathrm{nside}=2^z$)

@davidhassell
Copy link
Contributor

Thanks, @bnlawrence, I now understand the suggested healpix:healpix_order = "nested" in the CDL.

So there are no auxiliary coordinates for the cell dimension, then. I suppose that's OK, as it's not really any different to, say, having easting and northing coordinates for a transverse mercator projection - they're not much use unless you go through the laborious mathematical procedure to work out the longitudes and latitudes.

@bnlawrence
Copy link

@uweschulzweida I see that the zoom level seems pretty fundamental to how DKRZ are using healpix. Do you think the zoom level should be in the attributes directly, rather than say, nside?

@davidhassell
Copy link
Contributor

So there are no auxiliary coordinates for the cell dimension, then.

... or would you store the latitudes and longitudes as auxiliary coordinates, with the grid_mapping there to show provenance (and to define the cell edges)

@uweschulzweida
Copy link
Author

@uweschulzweida I see that the zoom level seems pretty fundamental to how DKRZ are using healpix. Do you think the zoom level should be in the attributes directly, rather than say, nside?

The zoom level can be easily calculated from the nside parameter. I don't think this should be stored as an attribute, since the zoom level is only applied to nested ordering.

@bnlawrence
Copy link

Well, zoom level and nside are intimately related right: $\mathrm{nside} = 2^{z}$?

Why do you say it is only applied to nested ordering? My reading of the DKRZ documentation suggests it is being used as a query parameter into a dataset with a number of variables each with a different zoom level. It seems to me that the "interesting" information from the metadata point of view is the zoom level, not the nside parameter as it's not directly related to the way variables are chosen.

@uweschulzweida
Copy link
Author

Yes, the zoom level is important for us! We store the data in zarr archives. Each zoom level is a separate dataset with all variables. The name of the zarr archive contains the zoom level that is accessed via the query request.

@bnlawrence
Copy link

Given that, it feels like it would be more useful to expose the zoom level in the CF metadata, which means that tools which harvest that metadata can use it, and of course, the tools which read the data can trivially calculate nside from the zoom level when attempting to use code that needs that.

(Obviously it's trivial both ways, but I figure the representation which needs least conversion should be the one that is harvested to catalogs and/or visible when lazily loading.)

@bnlawrence
Copy link

I said earlier that cell bounds might be fun. The advantage of Healpix is that the cells are equal area, which is obviously useful for regridding (zooming) in a healpix grid, but not so good if we wanted a conservative regrid onto another coordinate system (or indeed conservative regridding to a healpix projection). But it's not an impossible calculation and we presume there are libraries that do it. This figure is helpful:
Screenshot 2023-10-18 at 12 45 16

@bnlawrence
Copy link

Just been discussing what we might put in Appendix F with @davidhassell; suffice to say we'd need a reasonably complete description of what is going on with HealPix. We'd also want, we think, to recommend (but not mandate) the use of two 1D auxiliary coordinate variables with the lat/lon of the pixel centres so as to simplify data usage by those that don't want to utilise healpy or equivalent.

@sebvi
Copy link

sebvi commented Oct 18, 2023

just discovering this discussion now.

We actually did the same exercise to add the HEALPix grids into WMO GRIB2 standard (I suspect for the same project than the one reported by @uweschulzweida ): GRIB2 issue

I have 2 comments:

  1. the restriction nside = 2^k is only valid for the nested ordering because of the way this ordering works. But it is not a requirement for ring ordering and in fact we use a nside that is not a power of 2 at ECMWF. It is perfectly fine to have 12 diamonds area that are 5 by 5, 9 by 9, 12345 by 12345, etc. It is particularly useful when you go to high resolution because the number of points (pixels) can quickly explode and one might not want to go from nside=1024 to nside=2048 or nside=4096 as a model resolution could increase. Sure you loose the "zooming" feature but it is not always a requirement of your workflow or application. This is why we decided to go with encoding nside in the GRIB2 metadata rather than the zoom level k.
  2. Strictly speaking, nsides and the ordering type are not sufficient to decode the data. nside will tell you how many iso latitudes and how many points on each iso latitude you have but you will need the longitude of at least one point to relatively compute the other points wrt it. Sure, you could impose one like it is done in many tools implementing HEALPix grids but you loose flexibility. That said choosing or not a reference longitude is conceptually just a rotation around the north/south pole rotation axis and could possibly be handled using extra rotation metadata (but I'm not sure).

In the GRIB2 proposal we also added extra keys to specify if the observable is valid at the center of the pixel, at edges, etc. You may want to do that as well (although I must admit I would need to read again how grids and projections are handled in CF).
At the end I would welcome something as close as possible to what is done in GRIB2 for the sake of interoperability and format conversion, in the limits of what is usually done in CF of course.

@bnlawrence
Copy link

Thanks @sebvi, really glad you've jumped in here!

  1. My reading of "the" HealPix paper is that footnote 11 is intended to be part of "the" definition. However, given in practice ECMWF doesn't feel bound by that, and given as a consequence GRIB2 doesn't, it'd be silly for CF to hold to it, and hence we had better use nside too.
  2. It's pretty clear that the default expectation is for the longitude to be 0, but we've been burnt with defaults, so I think we too should require that.

There is a lot of other material in your various templates, is any of that relevant here? I note you also have table entries for edges and vertices. Is that mandatory or have you added that so that the information can be provided if desired?

@sebvi
Copy link

sebvi commented Oct 18, 2023

ECMWF has an (indirect) interest in seeing this going forward as many of our end users might want to retrieve the data in HEALPix GRIB2 and then convert to HEALPix netCDF to then use with their preferred tools. The closer the metadata the easier the conversion. :)

  1. one of the reasons not to conform to the rule nside=2^k is mostly its lack of flexibility. basically nside can only take the values 1, 2, 4, 8, 16, ... which is a limited set of resolutions not necessarily close to the sort of resolutions we usually run our models now and in the future. This is why we use mostly the ring ordering because then we can choose a resolution as close as possible to the resolution we are running when not using HEALPix. When interpolating from/to HEALPix to/from a non HEALPix grid, not bound by the 2^k rule, it has some implications as well.
  2. We actually use a default of 45 degrees for the reference longitude.... The reasoning behind is that if you look at the first picture you posted of the ordering schemes, you will notice that pixel 0 of diamond 0, in both cases, is at 45 degrees. I am not sure it is the default in popular tools and in particular in healpy but it was the one that made sense to us. But as long as you define an optional longitude reference with a sensible default value if not present, you are covered. The other reason why we included the possibility to specify that reference longitude is that we know by experience that it is only a matter of time until someone request the feature, likely because their area of interest falls between 2 or more diamonds and they would rather define the points to have their area in only one diamond. I could also see countries in diamonds 4 to shift everything so that diamond 4 is not split, i.e. to have the points contiguous (this would only matter in ring ordering obviously)

The extra keys in GRIB2 were added for several reasons but mostly to be covered for future requests:

  • in traditional grid like regular lat/lon it is not uncommon to have institution encoding points bottom right to top left rather than the traditional top left to bottom right. Typically south hemisphere folks would want to encode the south hemisphere first for faster access (these access consideration might not be relevant for netCDF). For this we have "scanning modes" to indicate in which order the points are encoded: if we do left to right or right to left, top to bottom or bottom to top, if the points come by rows then columns or by columns then rows, etc. This is just to keep maximum flexibility. We are planning to use this in the "traditional top left to bottom right" way but if someone wants to do it differently they can do it. Note that this is mostly useful for ring ordering. It can also be used in nested ordering but I would recommend to stay away of this!

  • We have a table for edge and vertices because similar things were done for similar grids. GRIB2 supports an "icosahedron" grid and they have defined this sort of things.

EDIT: I 've hit the button too quickly

@larsbarring larsbarring added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Oct 18, 2023
@bnlawrence
Copy link

bnlawrence commented Oct 19, 2023

The closer the metadata the easier the conversion. :)

I'm pretty keen that we don't introduce unnecessary semantic mismatches. I don't see any reason for doing so here.

I think we can build a proposal based on this. I think the important point is that we would only need one mandatory additions to the original CDL (i.e. something like a mandatory healpix:healpix_Lo = Float).

I am in two minds as to the necessity for scanning order in netCDF, what do others think? If we did it, we could do it with optional additional parameters?

I think we can optionally use auxiliary coordinates for the vertices etc. Will think more about that.

@bnlawrence
Copy link

Incidentally the document produced for the GRIB appendix provides a useful intro to all this for newcomers!

@taylor13
Copy link

eyes=thanks!

@larsbarring
Copy link
Contributor

larsbarring commented Oct 20, 2023

@bnlawrence asks

I am in two minds as to the necessity for scanning order in netCDF, what do others think? If we did it, we could do it with optional additional parameters?

From my outside perspective I think that transferring the concept of [different] scanning orders from GRIB to netCDF is not helpful.

In the GRIB "world" I do see the need and underlying reasons (as @sebvi hinted at) where GRIB is essentially a machine-to-machine format where scanning order is a well established fact. This is not the case for netCDF, which is a data format widely used across rather different communities. And scanning order would be a new concept in the netCDF "world" (where I, like @sebvi, have questions regarding the actual need/use case, e.g. with respect to efficiency). Across the different communities using netCDF there is already widespread confusion regarding X, Y, Z vs. latitude, longitude (see here for a rundown of anecdotal evidence).

Hence, I think that unifying GRIB different scanning orders into what is established in netCDF is a perfect task for GRIB-to-netCDF converter tools.

@JonathanGregory
Copy link
Contributor

Dear Sebastien @sebvi, Bryan @bnlawrence et al.

Thanks for the discussion on this issue, on a convention for HEALPix in CF. Is someone in a position to make a definite proposal, to take it forward?

Best wishes

Jonathan

@davidhassell
Copy link
Contributor

Hello,

@sebvi wrote:

In the GRIB2 proposal we also added extra keys to specify if the observable is valid at the center of the pixel, at edges, etc. You may want to do that as well (although I must admit I would need to read again how grids and projections are handled in CF).

I was wondering what the use cases might be for storing data on edges and vertices. It would seems to me that you'd only want to do this if there are well defined mappings between the edges/vertices and the pixels (faces) themselves - is that the case? You could of course use UGRID to to store such information, but then the structure is lost, which seems counterproductive!

Thanks,
David

@davidhassell
Copy link
Contributor

davidhassell commented Sep 4, 2024

Hello,

I've been thinking about this again, as there is going to be Hackathon on trying to progress this issue at the 2024 CF workshop (currently scheduled for 08:00 UTC on Thursday 19th September).

There was (I think) a consensus to use a grid_mapping variable to identify and define a HEALPix grid, e.g. (from the initial post):

dimensions:
	cells = 49152 ;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
		healpix:healpix_nside = 64 ;
		healpix:healpix_order = "nested" ;
	float var(cells) ;
		var:grid_mapping = "healpix" ;

but we also need to identify which data dimension the grid mapping applies to.

In the above example there is only one data dimension, but there could be many, and there could be more than one that is the same size as the HEALPix dimension. For instance, let's introduce a time dimension:

dimensions:
        time = 12 ;
	cells = 12 ;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
		healpix:healpix_nside = 1 ;
		healpix:healpix_order = "nested" ;
        float time(time) ;
                time.units = "days since 2024-09-04" ;
	float var(time, cells) ;
		var:grid_mapping = "healpix" ;  // We don't know which is the X-Y dimension

Normally we associate the grid mapping with the appropriate dimensions via coordinate variables (section 5.6) - either implicitly via their standard names, or else with an explicit format that references their netCDF variable names. This is not possible when there are no coordinate variables!

Perhaps the obvious solution is to allow the explicit grid_mapping format to also reference netCDF dimension names, which would only be allowed - and be mandatory - for particular grid mappings, i.e. only HEALPix at the current time. This would look like:

dimensions:
        time = 12 ;
	cells = 12 ;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
		healpix:healpix_nside = 1 ;
		healpix:healpix_order = "nested" ;
        float time(time) ;
                time.units = "days since 2024-09-04" ;
	float var(time, cells) ;
		var:grid_mapping = "healpix: cells" ; // Now we do know which one is the X-Y dimension

Are there any problems with this approach, I wonder?

Thanks,
David

@JonathanGregory
Copy link
Contributor

Dear @davidhassell

Your proposal for HEALPix looks sensible to me. You are proposing that the second syntax in Sect 5.6 for grid_mapping should be changed from "gridMappingVariable: coordinate_variable [ coordinate_variable …​] [ gridMappingVariable: ... ]" to "gridMappingVariable: coordinate_variable_or_dimension [ coordinate_variable_or_dimension …​] [ gridMappingVariable: ... ]", I believe - is that right?

Earlier, you asked whether the HEALPix dimension could have optional auxiliary coordinates of latitude and longitude. I think Yes. Even though they're not mandatory any more, they could be recommended, because it makes the data useful to analysts and programs which don't know how to work out the coordinates.

Best wishes

Jonathan

@davidhassell
Copy link
Contributor

Dear Jonathan,

You are proposing that the second syntax in Sect 5.6 for grid_mapping should be changed from "gridMappingVariable: coordinate_variable [ coordinate_variable …​] [ gridMappingVariable: ... ]" to "gridMappingVariable: coordinate_variable_or_dimension [ coordinate_variable_or_dimension …​] [ gridMappingVariable: ... ]", I believe - is that right?

Indeed.

Earlier, you asked whether the HEALPix dimension could have optional auxiliary coordinates of latitude and longitude. I think Yes. Even though they're not mandatory any more, they could be recommended, because it makes the data useful to analysts and programs which don't know how to work out the coordinates.

Ye, I think that's right. I don't we need to recommend it though, as we don't for any other types of grd mapping - we say instead in 5.6: "Such coordinates [lats and lons] may be provided in addition to the provision of a grid mapping variable, but that is not required.". Not being willing or able to use the grid mapping to create missing lats and lons is equally problematic for any projection, including HEALPix - possibly less to for HEALPIx, because at least you know that it has global coverage.

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Sep 5, 2024 via email

@davidhassell
Copy link
Contributor

@JonathanGregory wrote:

I am unclear why the former requirement to produce the 2D lat and lon aux coord vars, when it was abolished, wasn't replaced with a recommendation to do so. Perhaps this should be revisited, because it can certainly help with the usability of data.

That's fine by me.

@davidhassell
Copy link
Contributor

davidhassell commented Sep 5, 2024

This discussion has made me think that there may be an implication for the CF data model, here.

The Coordinate Reference construct "relates the coordinate values of the coordinate system to locations in a planetary reference frame".

What we'll have in the HEALPix case is a relationship between "a Domain Axis construct and locations in a planetary reference frame", and the ordering of the domain axis construct must be consistent with that defined by the Coordinate Conversion.

@JonathanGregory
Copy link
Contributor

You could argue for HEALPix that there is an implied coordinate value, which is an integer index starting from zero for the first element along the dimension. Perhaps the extended syntax you suggested above grid_mapping = "grid-mapping-variable: name" could be interpreted to mean the coordinate variable name if it exists, otherwise implying a discrete axis with coordinate values 0, 1, ..., name-1, where name must be a dimension in the file.

Thinking again about this, I wonder whether in the HEALPix case we ought to require latitude and longitude auxiliary coordinate variables as well, or at least strongly recommend them, because without them a non-HEALPix-aware software has no good way to make a plot of the data. In other cases of non-latitude-longitude horizontal coordinates, you can still plot the data with the supplied 1D coordinate variables.

@davidhassell
Copy link
Contributor

Dear Jonathan,

Thanks for these thoughts ... I'd like to argue the other side, to see where we get to :)

You could argue for HEALPix that there is an implied coordinate value, which is an integer index starting from zero for the first element along the dimension.

I would instead say that such indices are not a coordinates, as they serve no geolocation purpose and cannot be converted to coordinates in another CRS. So I still think we need to extend the data model.

Another application of this extension could be for tripolar grids. For these, we currently have no way of encoding which is the X and which the Y axis of the horizontal grid, information which is essential for operations such as regridding. Perhaps we could use the "grid mapping + dimensions" idea to make this connection, and also provide a place to describe how the particular tripolar grid was created. I.e. something like data_variable:grid_mapping = "tripolar: xdim ydim" (ignoring for now the obvious question of whether X or Y comes first!). This may seem like a diversion, but if the data model were to be extended we should think about what implications that would have on other area, I suppose.

In other cases of non-latitude-longitude horizontal coordinates, you can still plot the data with the supplied 1D coordinate variables.

In the traditional grid mapping case (e.g. transverse_mercator), if there were no 2-d latitudes and longitudes and you were not able to decipher the grid_mapping, then what exactly could you plot? Surely just a rectangle of values for which you have no idea where on Earth it lies. That seems to me not much different to the HEALPix case of plotting the 1-d array of values - just like in the transverse mercator case, you don't know where each value is.

There is of course more to it than that. In the transverse mercator case you at least know that adjacent cells in the array are also physically contiguous (assuming that there are no "gaps" in the coordinate variables). Not so in the HEALPix case, although you do know that the HEALPix grid has global coverage.

@davidhassell
Copy link
Contributor

Hello,

Just to let you know that we'll be working on this issue in the 2024 CF workshop during a hackathon session on Thursday 19th September (10:00–12:30 UTC+2).

Thanks,
David

@carloshorn
Copy link

Hello,

I am experimenting with HEALPix for earth observation data and was wondering how to store such data on NetCDF.
Thanks for taking the initiative starting this discussion.

Reading the conversation, I came across GRIB2's design choice to allow nside != 2^k, which definitely has tremendous value.

the restriction nside = 2^k is only valid for the nested ordering because of the way this ordering works. But it is not a requirement for ring ordering and in fact we use a nside that is not a power of 2 at ECMWF. It is perfectly fine to have 12 diamonds area that are 5 by 5, 9 by 9, 12345 by 12345, etc. It is particularly useful when you go to high resolution because the number of points (pixels) can quickly explode and one might not want to go from nside=1024 to nside=2048 or nside=4096 as a model resolution could increase. Sure you loose the "zooming" feature but it is not always a requirement of your workflow or application. This is why we decided to go with encoding nside in the GRIB2 metadata rather than the zoom level k.

I would like to mention the xy-HEALPix plane coordinates which I have not seen mentioned here. There, each pixel is given as the triple of face index and xy-coordinate within the face.
The transformation from xy-coordinates to ring index is given by equations 15 to 18 in https://arxiv.org/pdf/astro-ph/0409513.
The transformation to nested index (if nside = 2^k) is the inverse of equation 13 and 14, which is technicallly speaking the application of the pdep (parallel bit deposit) intrinsic of modern architecture.

The reason I want to mention this is the limitation I see in using only the ring index for nside != 2^k. The advantage of the nested index is data proximity for continuous data blocks. However, we can also achive this in netCDF using array chunks in the xy-coordinates. Another point is the fact, that we do not loose the "zooming" feature, because we just need to lower the resolution of the face images, e.g. average 3x3 cells in a 192x192 face to get a 64x64 face.

Hope this might be of interest to this discussion,
Carlos

@pelusanchez
Copy link

Hello,

We are making some experiments with Healpix and end up in this thread.

Only for the information of this topic, if this has any value for you:

In the easy.gems project they are using crs for storing the information:

https://easy.gems.dkrz.de/Processing/healpix/healpix_starter.html

To be honest, this has been the only real and open example I have found. This might be an interesting thing to know in this discussion.

I will really appreciate if someone can add in this thread others examples

Nonetheless, according to https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#grid-mappings-and-projections, the approach being commented in this thread seems to be more in agreement with the CF conventions. But on the other hand (I do not know if for Healpix fits in this case, as I am not an expert on this) but CF-1.8 seems to allow also this way for defining grid mapping at crs:

https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#grid-mappings-and-projections

My two cents about what is commented here:

Thanks to all of you who participate in this discussion,

@d70-t
Copy link
Contributor

d70-t commented Nov 13, 2024

I just discovered this thread recently and am thankful for all of you bringing these forward. I can't really add much other examples, as I am the author of the mentioned article at easy.gems though.

I agree with @davidhassell that there has to be a way to mark the "cell"-dimension. However, I also sympathize a lot with @JonathanGregory's interpretation of the implied coordinate:

You could argue for HEALPix that there is an implied coordinate value, which is an integer index starting from zero for the first element along the dimension. Perhaps the extended syntax you suggested above grid_mapping = "grid-mapping-variable: name" could be interpreted to mean the coordinate variable name if it exists, otherwise implying a discrete axis with coordinate values 0, 1, ..., name-1, where name must be a dimension in the file.

We've recently experimented with regional variants of the healpix grid. This seems to be odd at first place, but turns out (at least for us) as quite useful, when comparing with global data. We use regional healpix grids in the sense of defining a grid with high nside-values, but keeping cells which are not covered by the model with missing values, this works for storage (especially when using compression and skipping of empty chunks) but client libraries tend to do some extra work as they don't have a way of quickly skipping empty regions.

Thus we started adding the "cell"-coordinate (as an integer index) explicitly, which allowed to only keep valid cells along the "cell"-dimension. This way, client libraries can easily skip empty areas. This approach also nicely integrates with xarray's behaviour where coordinate-based selection (.sel()) will fall back to index selection (.isel()) whenever there's no explicit coordinate. So while this coordinate might not be a coordinate in the classical sense, it's still a quite useful one.

cc @lkluft

@pelusanchez
Copy link

pelusanchez commented Nov 14, 2024

Dear @d70-t,

What you have explained seems very interesting, in particular for some of our use cases. If I have understood, you are using something like this, isn't it?:

dimensions:
        time = 12 ;
        cell = 6 ;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
		healpix:healpix_nside = 2;
		healpix:healpix_order = "nested" ;
        int time(time) ;
                ...
        int cells(x) ;                    // For example: [9, 10, 11, 12, 26, 28]
	float var(time, cells) ;
		var:grid_mapping = "healpix: cells" ; 

Best regards,

@d70-t
Copy link
Contributor

d70-t commented Nov 14, 2024

Yes, roughly like that. But it would be more like int cells(cells), e.g.:


dimensions:
        time = 12 ;
        cells = 6 ;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
		healpix:healpix_nside = 2;
		healpix:healpix_order = "nested" ;
        int time(time) ;
                ...
        int cells(cells) ;                    // For example: [9, 10, 11, 12, 26, 28]
	float var(time, cells) ;
		var:grid_mapping = "healpix: cells" ;

@davidhassell
Copy link
Contributor

Hello,

Thanks for these recent comments! This topic was discussed in a hackathon durgni the CF meeting in September, a brief writeup of which can be found on pages 40 and 48 of

Bärring, L., Fisher, E., Pamment, A., Eggleton, F., Lee, D., Hassell, D., Davis, E., O'Brien, K., & Cofiño, A. S. (2024, December 4). Proceedings of the 2024 CF Workshop, 17-20 September 2024, Norrköping, Sweden. 2024 CF Workshop, Swedish Meteorological and Hydrological Institute (SMHI), Norrköping, Sweden. https://doi.org/10.5281/zenodo.14271028

In brief, we came to two main conclusions:

  1. We should not restrict the HEALPix parameters that can be provided, in terms of which parameters are available and the values they may take. Any expected constraints on the values will apply, of course - such as the data type of a parameter value - but we won't insist on a subset of valid values, e.g. we would allow any integer for nside.
  1. We would like a new standardised grid_reference (suggested name) attribute to a grid mapping variable that specifies whether or not the grid_mapping applies to coordinates, or only dimensions. The attribute would be only be allowed the value coordinate or dimension where:
  • coordinate (the default if not set) means that the coordinate reference system is defined by the coordinates given by (or implied by) the grid_mapping attribute of a data variable, as well its datum and coordinate conversion (these are well-defined entities from the CF data model, and are both derived from a grid mapping variable's attributes). This default value correctly applies to all existing grid mappings, so there are no backwards compatibility issues.

  • dimension means that the coordinate reference system is defined solely by its datum and coordinate conversion, and applies to the dimensions given explicitly by the grid_mapping attribute of a data variable. This is the case for HELAPix, and could have application to other CRSs (such as tripolar grids).

Example:

dimensions:
        time = 12 ;
	cells = 108 ;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
                healpix:grid_reference = "dimension" ;  // new grid_reference attribute
		healpix:healpix_nside = 3 ; // not a power of 2, in this case
		healpix:healpix_order = "nested" ;
                healpix:earth_radius = 6378000 ;
        float time(time) ;
                time.units = "days since 2024-09-04" ;
	float data(time, cells) ;
		data:grid_mapping = "healpix: cells" ;  // grid mapping explicity associated with "cells",
                                                        // and because healpix:grid_reference=dimension
                                                        // we know that "cells" is a dimension, rather than
                                                        // a variable with that name. 

We didn't consider the idea of only defining data on a subset of the HEALPix cells (e.g. #433 (comment)), as it hadn't been raised back then! Thinking about it now, it seems like it might be a case for the existing functionality on lossless compression by gathering - which is all about not storing data where it has missing values. This would look like:

dimensions:
        time = 12 ;
        cells = 49152 ;  // we include the size of the uncompressed dimension
	data_cells = 15234 ;  // size of the compressed dimension
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
                healpix:grid_reference = "dimension" ;  // new grid_reference attribute
		healpix:healpix_nside = 64 ;
		healpix:healpix_order = "nested" ;
                healpix:earth_radius = 6378000 ;
        float time(time) ;
                time.units = "days since 2024-09-04" ;
        int data_cells(data_cells) ;
                cell:compress = "cells" ;
	float data(time, data_cells) ;  // Data stored on a subset of the HEALPix grid
		data:grid_mapping = "healpix: data_cells" ;  
data:
    data_cells = 9, 10, 11, 12, 26, 28, ... 45784, 48811 ; // 15234 integer values

This compression by gathering example is especially nice, because it highlights that there can be no confusion between the "data_cells" dimension and the "data_cells" coordinate variable, because we have healpix:grid_reference = "dimension".

@davidhassell
Copy link
Contributor

Oh, I forgot to say that

  1. We would like a new standardised grid_reference (suggested name) attribute to a grid mapping variable that specifies whether or not the grid_mapping applies to coordinates, or only dimensions.

has implications for the CF data model, because up to now CRSs can only be defined by coordinates, and not abstract dimensions. This is not a problem, however, and can be easily $^{*}$ fixed by the drawing of a new line between the green "CoordinateReference" and "DomainAxis" boxes in the main data model diagram, with an accompanying sentence or two.

$^{*}$ ... but doing so without crossing any other lines might be a challenge :)

@JonathanGregory
Copy link
Contributor

Dear @davidhassell et al.

Well done on the HEALPix progress!

About David's last point, though I know that David disagrees with this suggestion, I think that we do not necessarily need to link the DomainAxis to the CoordinateReference in the data model. I suggest that if a CoordinateReference needs a DimensionCoordinate for a particular DomainAxis (of size n), but there is no such construct, it can be taken to imply a DimensionCoordinate with values 0,1,...n-1. This "default" kind of DimensionCoordinate could be interpreted as something provided by the "Generic Coordinate Construct" box on the data model diagram. In CF-netCDF, that behaviour would be indicated by the grid_reference = "dimension" attribute which you propose.

Best wishes

Jonathan

@davidhassell
Copy link
Contributor

Thanks, Jonathan.

You're right that some weeks ago I didn't agree with your suggestion ... but I'm going to think about it again with an open mind :-) Starting by re-reading the comments from September ...

@davidhassell
Copy link
Contributor

davidhassell commented Dec 11, 2024

Edited 2025-01-07

I think I may have made a mistake in the above compression by gathering example.

I gave the data variable as

	float data(time, data_cells) ;  // Data stored on a subset of the HEALPix grid
		data:grid_mapping = "healpix: data_cells" ; 

but I feel that it should it have been:

	float data(time, data_cells) ;  // Data stored on a subset of the HEALPix grid
		data:grid_mapping = "healpix: cells" ;  // reference the *uncompressed* dimension

I think this makes more sense. Data variable attributes should not ("not" added on 2025-01-07) be affected by the encoding of the data array, and if the data were not compressed by gathering we'd certainly write healpix: cells.

Does that make sense?

@tinaok
Copy link

tinaok commented Dec 12, 2024

Hello,
Thanks to AGU24, I’m very happy to find this thread on HEALPix. We are currently implementing HEALPix as a part of a Discrete Global Grid System (DGGS), following the OGC-DGGS specifications, as detailed in our recent publication.

Would it be possible to consider following certain terminology defined in the OGC-DGGS specifications into the implementation of HEALPix in the CF-Conventions? This could help ensure alignment and interoperability across different communities (Earth Observation and Earth system modelling .. ).

For example re-considering also the usage of term 'level'
xarray-contrib/xdggs#64
I’d appreciate your thoughts, particularly from @strobpr, @keewis, and @allixender

@keewis
Copy link

keewis commented Dec 13, 2024

for the healpix-related attributes, I'd like to point out that healpix_order is somewhat confusingly defined: In healpy (one of the more widely used packages) has a reorder function to translate between "ring" and "nested", but also nside2order which defines order as the log of the nside. I've also seen "ordering", which is sufficiently different from "order". The healpix paper calls this "numbering scheme", so I'd follow that (in xdggs I went with "indexing_scheme" to be a bit more explicit).

I don't see any immediate issues with healpix_nside, but should you wish to extend this to other DGGS, it might make more sense to follow the OGC terminology, which would split nside into level (or refinement_level) and aperture (or refinement_ratio) – the relation being $N_{\mathrm{side}} = \mathrm{aperture}^\mathrm{level}$

@carloshorn
Copy link

Hi @tinaok and @keewis, thanks for introducing me to the concept of DGGS. Sounds similar to the concept of Multi-Order Coverage Maps (MOCs) in astronomy, see https://cds-astro.github.io/mocpy/ for an implementation.
To my understanding the DGGS concept exploits the hierarchy of some tesselations, e.g. HEALPix with nside = 2^k, where the hierarchy is introduced by using a Z-Order curve for indexing the cells within a face, or as another example the S2 geometry, which uses a Hilbert curve to index the cells within a face.
If I understand it correctly, the "zone indexing reference system" within a Discrete Global Reference System (DGGRS) should encode the cell index together with the level. So the "Unique Identifier scheme", see section 3.2 in https://healpix.sourceforge.io/pdf/intro.pdf would be your cell identifier in HEALPix, or something like the S2 Cell Id (which I find a bit more intuitive), see https://s2geometry.io/devguide/s2cell_hierarchy.
However, allowing nside != 2^k, will create tesselations which are not hierarchical. Therefore, the concept of level will not apply in general.
Note, I do not want to exclude you from any discussion. I could imagine that a DGGS convention would allow HEALPix data as described in this issue with the restriction of using nside = 2^k and nested ordering / numbering scheme, and indicating the DGGS compatibility by providing some special attribute like dggs_level, but IMHO this is out of scope for this issue.

Regarding the healpix_order attribute, I agree with @keewis that the name has an ambiguity and support the change to healpix_indexing_scheme.

@keewis
Copy link

keewis commented Dec 13, 2024

allowing nside != 2^k, will create tesselations which are not hierarchical

the effective nside would still have to be nside = n^k, with n constant for the entire hierarchy of the grid. I think I saw that healpix supports that, but I may be mistaken, and either way most implementations require n = 2. FWIW, I personally don't need the non-base 2 nsides.

I only brought up level because it would make use of the same terminology for all DGGS, but if having non-standardized <dggs>_<dggs-specific attribute> is fine then indeed this discussion can be left to a separate issue.

Sounds similar to the concept of Multi-Order Coverage Maps (MOCs) in astronomy

As far as I understand it, MOC is a separate concept also covered by the OGC spec which allows varying the nside / level within the same dataset, and instead each cell id also contains the nside / level (for healpix, this implies that indexing_scheme = "unique" is required). So for MOCs you'd either have to disallow specifying nside / level separately, or require healpix_nside = "varying" or healpix_nside = null (or something similar).

@florianziemen
Copy link

florianziemen commented Dec 13, 2024

I think we should keep the concept of the (zoom) levels in the game for HEALPix. Otherwise we'd be talking about an EALPix grid. Also, the numbering ( zero to something in the 10s ) is much more intuitive than n-side.

One way to allow for other partitioning would be to (optionally) denote the base of the partitioning. Usually two, but if somebody wants to create their hierarchy by first splitting each of the 12 primary cells into 1234x1234 secondary cells and repeating that process over and over - so be it. These people would then need to decide how to number these cells in an nested ordering, and probably live with the fact that (virtually) no client library will support this without their intervention.

The nested ordering is not only helpful for the hierarchies, but also crucial for getting regionally coherent indices, and thus minimize data loads when working with regions in the output.

@carloshorn
Copy link

I think we should keep the concept of the (zoom) levels in the game for HEALPix. Otherwise we'd be talking about an EALPix grid. Also, the numbering ( zero to something in the 10s ) is much more intuitive than n-side.

@florianziemen, I understand your emphasis on the hierarchy which is part of the name HEALPix, but let's not discourage the EALPix use-cases. I am sure such data providers (e.g. ECMWF) will enable their users to read the data.

One way to allow for other partitioning would be to (optionally) denote the base of the partitioning. Usually two, but if somebody wants to create their hierarchy by first splitting each of the 12 primary cells into 1234x1234 secondary cells and repeating that process over and over - so be it. These people would then need to decide how to number these cells in an nested ordering, and probably live with the fact that (virtually) no client library will support this without their intervention.

The nested ordering is not only helpful for the hierarchies, but also crucial for getting regionally coherent indices, and thus minimize data loads when working with regions in the output.

Regarding the data proximity, I suggested the xy-HEALPix plane coordinates to address this point, but did not see any reaction. I guess most people prefer a flat index for HEALPix. However, the flattened xy-HEALPix plane coordinates of the first refinement level can still serve as the default way to number cells in a hierarchical way. Note, that for a base 2x2 partitioning it results in the Z-order curve, so I would call it a very natural way to generalise it. Furthermore, from the previous comment, #433 (comment), we see that only the ring indexing scheme has been used in these cases. IMO this theoretical definition of a nested indexing scheme for nside != 2 is sufficient, because EALPix users know that their hierarchy will never leave the first level, so they will likely not use any nested ordering. This is more about keeping your concept of levels in the standard.

Taking your suggestions @florianziemen, @keewis, and @tinaok, I extract the following proposal from your comments:
You would like to have something similar to the following new standard attributes for the grid_mapping:

  • indexing_scheme: A string identifier which for a given tesselation provides a known way to number each tile (or depending on the context also called cell or zone). Example values: "nested", "ring", "unique" in the case of HEALPix.
  • refinement_level: The level within the hierarchy of a DGG (e.g. HEALPix, HTM, S2, etc.) starting at 0 with the base tesselation. Example values: 0, 1, 2, ...
  • refinement_ratio: The number of sub-tiles (cells, zones) each tile divides into as the DGG moves up the hierarchy. Example values: 4, 9, 16, 25, ... In case of HEALPix, this needs to be a square number. Other DGGs may have other restrictions like only 4 is allowed.

A complete example similar to #433 (comment) would look like this:

dimensions:
        time = 12 ;
	cells = 192;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
                healpix:grid_reference = "dimension" ;
		healpix:refinement_level = 2;  // number of cells = 12*4^level = 192
                healpix:refinement_ratio = 4;  // optional, should default to 4 in case of HEALPix
		healpix:indexing_scheme= "nested" ;
                healpix:earth_radius = 6378000 ;
        float time(time) ;
                time.units = "days since 2024-09-04" ;
	float data(time, cells) ;
		data:grid_mapping = "healpix: cells" ;  

Is this correct @florianziemen, @keewis, and @tinaok?

Side note: according to the definition I found in the OGC API - DGGS about the refinement_ratio: "The ratio of the area of zones between two consecutive hierarchy level (the ratio of child zones to parent zones, also called the aperture).", it should be nside = sqrt(aperture^level) in HEALPix. Maybe the reason you mentioned that level is more intuitive than nside.

An EALPix use-case would look as follows:

dimensions:
        time = 12 ;
	cells = 588;
variables:
	int healpix ;
		healpix:grid_mapping_name = "healpix" ;
                healpix:grid_reference = "dimension" ;
		healpix:refinement_level = 1;  // number of cells = 12*49^1 = 588
                healpix:refinement_ratio = 49;  // sqrt(49) = 7 = nside
		healpix:indexing_scheme= "ring" ;  // likely not using anything else but "ring" for nside != 2
                healpix:semi_major_axis = 6377563.396 ;
                healpix:inverse_flattening = 299.3249646 ;
        float time(time) ;
                time.units = "days since 2024-09-04" ;
	float data(time, cells) ;
		data:grid_mapping = "healpix: cells" ;  

I would be fine with this instead of healpix_nside and healpix_order. @davidhassell, @sebvi and others, what do you think about it?

@d70-t
Copy link
Contributor

d70-t commented Jan 6, 2025

I think I get the logic behind the compression by gathering example by @davidhassell. I'm wondering though, if this will lead to a situation where we have dimensions which are not used by any variable or coordinate. This seems odd to me, although I guess it would be representable in netCDF. However, I don't think we could have that in e.g. xarray, which might become a problem.

@davidhassell
Copy link
Contributor

Hi Tobias,

I think I get the logic behind the #433 (comment) by @davidhassell. I'm wondering though, if this will lead to a situation where we have dimensions which are not used by any variable or coordinate. This seems odd to me, although I guess it would be representable in netCDF.

Yes - it's possible that there could be a netCDF dimension that is not spanned by any variable's data array. That dimension would still be referenced by, at least, the list variable's compress attribute, so the trail can still be followed by an application.

However, I don't think we could have that in e.g. xarray, which might become a problem.

I have much empathy for that (having previously written code that deals with compression by gathering!), but if we're agreed that the compression-by-gathering approach meets the use case, then CF design principle 10 moves us towards not looking for alternative solutions.

10. Because all previous versions must generally continue to be supported in software for the sake of archived datasets, and in order to limit the complexity of the conventions, there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose (even if a different method would arguably be better than the existing one).

@davidhassell
Copy link
Contributor

Hi,

Thanks to everyone for contributing to this discussion! I think it would be very useful at this stage to collate the actual use cases that already exist. Then we can work towards satisfying these, and not worry about other cases yet. We will worry about extensibility as part of this process, but not at this stage.

So if you have, or plan to produce, data on a one of these grids, could you please describe it in these comments?

Thank you again,
David

@d70-t
Copy link
Contributor

d70-t commented Jan 7, 2025

We currently have produced (and will do so in future) data on HEALPix grids with:

  • nested order
  • nside=2^z (we call z the "zoom" level)
  • we create dense (i.e. global) datasets and sparse datasets (e.g. as in Convention for HEALPix grid parameters #433 (comment))
  • we routinely create hierarchies of datasets with different zoom levels (e.g. for z values 0...12)
    • likely out of scope for this PR: we still search for ways to encode the relation between multiple similar dataset (this is what we do now)
  • we only use the default reference longitude of 45°
  • we do not rotate the grid
  • we have data on spheres and in WGS84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

No branches or pull requests