-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standard way to define subsampled coordinates #37
Comments
@oceandatalab This has been discussed, but we haven't come to any agreement to move forward with any mapping scheme. We've also talked about allowing one netCDF file to reference coordinate variables in another file as a way to address redundancy in the use case where many files cover the same swath. The basic CF approach of connecting coordinates to data via dimensions makes it difficult to figure out how to solve this problem. If you have a robust and concrete suggestion of a way to handle the issues, I'd love to hear it. |
Well, in order to make our code as reusable as possible, our strategy so far has been to homogenize the EO data we use as input files, which implies a pre-processing step to perform the conversion to a common format. Basically, for each data model we have:
For example, for gridded data with axes aligned with meridians and parallels, we have:
The dimensions of a data variable are (time, lat, lon):
Each tuple of ( This is the most simple case but you can already see that it adds quite a lot of non-standard dimensions and variables. It also makes the file impossible to use with other mapping software because there are no |
We are currently revisiting this approach to support more data models and are trying to remove rules that are specific to our tools by adapting our format specifications so that they follow well-defined standards as much as possible. In the example above, one possibility would be to rename This is only a minor improvement over the previous version but it would avoid using a non-standard naming scheme for coordinate variables and allow software to detect that coordinates are not provided at full resolution. |
In my understanding the variable names themselves are of less importance than the attributes attached to them and providing metadata, so choosing a different name for This ties in with work driven by Aleksandar Jelenak - specifically the situation described in this figure. However, in that case the coordinates are still provided in full. I'd also be happy to have a well-described way of doing this as we'd all save a lot of bandwidth! In your example, how do you relate the data points in |
Yes, Aleksandar and others also brought this up a number of times outside of the swath proposal. @ajelenak - Was there ever a proposal or discussion in Trac? Or was it mostly conversations? |
@oceandatalab I see two weaknesses in the scheme you describe. There is no explicit declaration of how the different coordinate-type variables and dimensions relate to one another. There is also no specification of how to interpolate between GCP points. I think both of these issues would need to be addressed. |
It never reached the level of serious discussion that would produce a proposal. NASA's HDF-EOS convention supports this since inception and that's what we mentioned in our swath proposal. I think this issue is another indication of the need. |
@erget - Thank you for the link, very interesting! We use GDAL to interpolate geolocated coordinates for the @JimBiardCics - Thank you for the feedback! We are not sure to understand your first comment: "how the different coordinate-type variables and dimensions relate to one another". Can you give us an example? As for the second comment, the interpolation method used to compute coordinates between GCPs depends on the accuracy you need to achieve and on the data model. For grids that can be described with a geotransform, using linear interpolation with the geotransform parameters is the obvious way. For swaths we currently use thin plate splines (via GDAL) but we have tested other methods as well (Akima, bivariate spline approximation over a rectangular mesh, etc...) and noticed that these methods may provide better accuracy or reduce computation time in some cases. So it would be possible to specify a recommended method, but it would not necessarily be optimal or suitable for the reader constraints (accuracy and processing time). |
@oceandatalab There will need to be an unambiguous way to declare exactly what the relationship between the different dimensions in your example (lat, lon, lat_gcp, and lon_gcp). There will also need to be a way to specify the recommended interpolation method and a statement of the interpolation accuracy. Using an inappropriate interpolation method with some swath-type lat/lon fields can produce significantly incorrect results. I envision a variable akin to a grid_mapping variable that would hold all the above information in attributes. Such a variable might also hold data values that would correspond, for example, to rational function coefficients. |
I agree with @ajelenak that this is an indication that there's a need to address this in CF. Hopefully next week I can take a look - @oceandatalab would you be interested in contributing to a corresponding proposal? |
@erget I'd be up for working on this as well. |
We are talking about a proposal to support "sampled" coordinates in general, not specific to satellite observations, correct? |
@erget - Sure, we are really interested in participating in a proposal and glad to see that this issue is pushed forward. Let us know how we can be of any help. |
@ajelenak I'd say yes. |
Good, that's the right approach. |
From earlier discussions I have seen on this topic, it was never clear if down-sampling lat/lon on GCPs was clearly advantageous wrt to letting internal compression do the size reduction. Has such a trade-off study been conducted for several types of EO swath data? From a user point of view, having access to the lat/lon in the files is clearly an advantage. Also I note that we should maybe not limit ourselves to subsampling coordinates. In the satellite world, what if we want to specify Sun/view angles on GCPs, and provide a routine to interpolate to each FoV? In such a case, the angles are not a coordinate but a regular field to interpolate from GCPs. Finally, several space missions already implement different downsampling strategies (e.g. MWI, AMSR2) so we should take a look at what they do. I would also be interested in contributing to such a proposal. |
Any acceptable solution to this issue will need to be made with some reflection on the CF data model. If the solution amounts to "fancy compression", then the data model is likely unaffected, but other cases that have been alluded to (as I understand them) could possibly require an extension. At this stage, I think the discussion should carry without worrying about this - it's too early in the process - but this is just a reminder for later. (Note that whilst extending the data model is certainly possible (as was done for simple geometries), it should be the last resort - unnecessary extensions will move the data model away from its minimillistic design principles and increase the chance of the data model not meeting future needs. If an acceptable solution can be found that fits in with the current model, then that is ideal.) |
@TomLav - Several satellite observations already require the use of GCPs (Sentinel-1 SAR, Sentinel-3 SLSTR & OLCI) but each method and format are different as there is no consensus at the moment. The idea here is to come up with recommendations so that good practice can be provided for cases wherein GCPs or subsampled coordinates/data are required. |
This is a lot of great input, I'm glad there's so much interest. @TomLav - for our satellites we have indeed done tradeoffs of using compression vs. GCPs and found that GCPs do reduce data volume. There might be a way of regularising these data so that they compress better, but in general reduction seems to be more effective than simple compression. @davidhassell - agreed, I think an optimal solution would be to implement this in a way that can ber considered "fancy compression" rather than an extension to the data model. I think it might be worthwhile to thave a telco on this issue - who's interested and from what time zones? |
@erget - we would be interested in participating in such teleconf, our time zone is CET (we switch to DST on March 29, so GMT + 1 until Saturday, then GMT + 2). |
@erget UTC-4 (US Eastern Daylight Time) |
@erget - I am same as @oceandatalab . |
@erget I am GMT until Saturday, then GMT + 1 |
@erget You say Apr 2-6 but the Doodle poll says Apr 9-13. |
@erget Did I miss the meeting? |
Yes. It didn't allow me to join. I'll try again. |
The final "Join Meeting" button is grayed out. |
I'll leave things in your collective capable hands and go walk my dog, who is standing here with her legs crossed. |
Minutes of first discussion on subsampled coordinates in CFExecutive summaryWe have collected different approaches that can be used in order to reduce data volumes in Earth Observation data (and, to some extent, to gridded data in general). There is interest and capacity to try to fit this cleanly into the CF Conventions. A follow-up meeting will be scheduled for the week from 6-10 April 2020. Please mark your availability in this Doodle. @erget will schedule the meeting on Monday, 2020-04-06 at 16 UTC. IntroductionPurpose: Understand common practices for providing subsampled coordinates in a standardised way. Present:
The utility of subsampled coordinatesThe group agreed that subsampled coordinates can have several benefits:
This means that in some cases, subsampling coordinates can not only provide a description of data products that avoids unnecessary assumptions; it also can have large cost impacts on all users of the data products in question. Known practices of providing only subsampled coordinates in data productsThree approaches were presented. I list them here in what I consider increasing complexity, out of the order that they were introduced in. HDF EOS@ajelenak provided an introduction to HDF EOS which is specified for HDF, not netCDF data. The approach could however be transferred. In this approach, geolocation data is "thinned" and metadata is provided to let the user know what offsets and increments need to be applied in order to select applicable data from geolocation arrays and apply them to the longer observation data arrays. It is left to the user to interpolate coordinates, if desired, or to use another approach. He also noted that it is important to ensure that if we adopt such an approach in CF, users should be given enough information to recognise that coordinates are subsampled, and it may be useful to provide guidance on how to interpolate between Ground Control Points (GCPs). GCPs were understood as the points at which geolocation data is provided; in the case of HDF EOS more observations are present than GCPs. Providing offsets for channels with similar but non-identical view centres@TomLav presented approaches used for conically-scanning passive microwave imagers. For such imagers, netCDF is not a common format but this is changing with the advent of new missions. 2 approaches were presented, both using the example of the AMSR2 instrument. The L1b products are provided with geolocation data only for 1 channel, whereas formulae are provided that allow users to reconstruct the view centre of every other channel for the same position in the data array. This is because the instrument observes multiple channels in a consistent geospatial relation to each other. Additionally, a L1R product is available, for which the individual views of each channel have been resampled in order to simulate having the same coordinates as the baseline channel. The individual values are of course changed by the interpolation process. It was also mentioned that the MWI instrument from EUMETSAT's EPS-SG mission will supply only tie-points; users are expected to interpolate between the tie-points in order to recover the coordinates of individual views. The coordinates are provided in Earth-Centre Earth Fixed coordinates. Recovery of coordinates via parametrised interpolation within tie-point zonesAnders (EUMETSAT) presented the approach used to compress VIIRS Sensor Data Records (SDRs) within the EUMETSAT Advanced Retransmission Service. As the VIIRS instrument uses an irregular pixel aggregation scheme, linear interpolation of coordinates is not accurate. The specific non-linear aspects of the interpolation change along the instrument's scan, meaning that different terms must be applied at different points in the SDR. To solve this problem, VIIRS SDRs are divided into Tie Point Zones (TPZs), for which the corner coordinates are provided for groups of pixels. The centre coordinates for all pixels within a TPZ can then be interpolated given the interpolation correction factors that are encoded into the product as metadata. These factors are applied to all TPZs within a given TPZ group; there may be several TPZ groups within a product. Multiple implementations of this approach exist, including in the FOSS world (see PyTroll). Reducing the coordinate data in this way roundly beats classical compression. Compression and data reduction are of course not mutually exclusive. It was also noted that the coordinates are interpreted on the ellipsoid; no terrain correction is applied, which is relevant due to the parallax effect, which is relevant for this and other instruments. The full algorithm and the encoding parameters are described in the Way forwardA follow-up meeting is planned for next week. These topics will be pursued:
|
This comment has been minimized.
This comment has been minimized.
Just to provide more details about this part of the discussion, there were two questions:
|
Thank you for clarifying, I agree with your text. Hi All, Just to add to the minutes that interpolation of tie point bounds have been added to the concept. The fact that we remove the tie point point offset has made it possible to add the bounds in an simple and elegant way. |
Minutes of the 19th discussion on subsampled coordinates in CF - 15. December 2020Present: @davidhassell @oceandatalab (Sylvain and Lucile) @ajelenak @AndersMS @AndersMS had prepared the first draft elements of the Appendix J, intended to document the details of the standard interpolation methods. It is available here as appj.adoc for review. As not everybody had had a chance to review @AndersMS provided a brief overview of the appendix, and it was agreed that everybody would review before the next meeting in January 2021. Anders will notify everybody when a stable version is available for review. A number of detailed topics where brought up, helping to consolidate the overall approach: The topic of missing coordinate data values was discussed. Missing data is addressed here in the CF conventions document. It was agreed to follow the principle that if any data required to reconstitute a (auxiliary) coordinate variable is marked as missing, then that coordinate variable will be marked as missing for the whole interpolation zone. (Post meeting: note that the NetCDF does not permit coordinates variables with missing data) It was agreed that all coordinate variables that are to be interpolated by the same interpolation variable, as defined in the tie_points attribute, must have identical ordering of their dimensions. It was agreed that in the general case, no mapping between the interpolation dimensions and the internal dimensions of the interpolation method is provided, requiring the interpolation methods to 'be symmetric' and provide identical results for any ordering of the interpolation dimensions. Should there be a future need for interpolation methods not meeting this requirement, the specific method must state a particular ordering of the dimensions as a constraint on its usage. Based on a review of the Example 8.5 for VIIRS, it was noted that no mapping between the data variable interpolation dimension
where the correct mapping for Actions from this time |
Dear All, When updating the Example 8.5, I came to the conclusion that we need to permit multiple mappings for the same interpolation dimensions both for the
Please let me know if you agree? |
Minutes of the 20th discussion on subsampled coordinates in CFPresent: @davidhassell @oceandatalab (Sylvain) @ajelenak @AndersMS @erget Actions from last timeNone. State of play & way forwardDrafts for Appendix and chapter 8 are underway, conformance document changes are yet to be proposed. Chapter 8 will be ready for review within the next few days; all present have time and are willing to review before the next meeting in 2 weeks. The appendix and conformance document will be reviewed subsequently. No input has been received regarding supersampling from @TomLav so we will drop the approach from the proposal. This makes it simpler to propose the text and to review. Supersampling could always be re-introduced as a future enhancement if needed. @davidhassell and @ajelenak will collaborate on the drafting of the updates to the conformance document. It was agreed not to include MODIS examples, as these are in HDF4 and thus not straightforward to port to netCDF, thus CDL, and the format is not expected to change in the MODIS mission. @AndersMS provided explanation concerning the interpolation accuracy being a function of the number of tie point zones. It was agreed to provide a short text explaining this as a decision on the part of the data producer. |
Minutes of the 21th discussion on subsampled coordinates in CF, 26 January 2021Present: @davidhassell The meeting had two agenda points:
Team review of the consolidated text prepared for inclusion as section 8.3 in the CF Conventions document. The comments made in the Review Pull Request erget/cf-conventions#1 was discussed. The comments made were generally agreed, with two points requiring clarification, see action 1 an 2 below. It was confirmed that the proposed glossary would be in part additions to the CF Conventions section 1.3. Terminology and in part additions to the Appendix A: Attributes. The Review Pull Request remains open until next meeting in two weeks for further review and comments. Introduction to draft Appendix J under preparation to complement section 8.3 with descriptions of actual interpolation methods @AndersMS presented the current version of Appendix J, contained in the Review Pull Request referenced above. It was agreed that @AndersMS would notify the team in the next days when the draft is ready for a first review. It would then be discussed in more detail at the next meting. List of Actions:
|
@AndersMS thanks for these minutes! I've resolved the comment referenced in action (2). |
Minutes of the 22th discussion on subsampled coordinates in CFPresent: @davidhassell @oceandatalab (Sylvain) @ajelenak @AndersMS @erget The progress on the Appendix was reviewed. We are nearing convergence on the document and can address outstanding (minor) issues at the next meeting. In the meantime work on the Conformance document will begin. All will review the updated Appendix draft being prepared by @AndersMS.
|
Minutes of the 23th discussion on subsampled coordinates in CF, 9 March 2021Present: @davidhassell The meeting had three agenda points, all relating to erget/cf-conventions#1: Review of improvements to the method descriptions in Appendix J introduced by @AndersMS Review of proposed conformance rules as distributed by @davidhassell Identification of completed and outstanding tasks Completed tasks:
Outstanding tasks:
|
Minutes of the 25th discussion on subsampled coordinates in CF, 9 March 2021Present: @davidhassell The meeting had four agenda points: Work is progressing with the verification of the Biquadratic Interpolation of geographic coordinates method: Coding and check of accuracy of reconstituted coordinates. Results will be presented at next meeting (@AndersMS) Issue identified with sharing of tie point between different spatial resolution grids, see description : Standard name for reflectance, see description: Figure 3 in Chapter 8: |
Minutes of 26th discussion on subsampled coordinatesPresent: @davidhassell @ajelenak @oceandatalab @AndersMS @erget Items from last meeting
Adressed in this meeting@AndersMS proposes restructuring chapter 8 to introduce tie point dimensions/indices terms at more appropriate points in the text. Revisions agreed. Pending for next time
Intent: Next session send off first draft to CF Community! |
Dear All, Just to update you on the Bug analysis of interpolation errors. discussed yesterday. The problem was related to the handling of interpolation zones with an odd number of coordinate points. After changing the formula in Appendix J for Biquadratic Interpolation of geographic coordinates from |
Dear All, Just to update you on the Profiling results discussed yesterday. It turned out that the code spent excessive time on memory allocation. After resolving this, the code now executes in
for a single VIIRS M-Band granule, when excluding all file read and write operations from the benchmarking. So it is certainly worth keeping both methods. A VIIRS M-Band granule contains around 85 s of observations or 2,457,600 coordinate points, so the execution time is quite impressive. The VIIRS M-Band granule used for the benchmark contained observations between 26 and 36 degree North. The mean position error in the reconstituted data was
which is very promising. I will do some further benchmarking for a data set covering the polar region. Anders |
Hi @AndersMS, Great news on your bug finding. I agree that factor of 6 makes it worth keeping both methods (I can't remember - need to check in appendix J - if it's now clear that you can bypass some of the calculations if interpolating directly in latitude longitude. Have I got that right?). I presume that an error of ~1m is well within the acceptable tolerance? David |
Yes, the algorithm in Appendix J has branches for each of the two cases. Interpolating directly in latitude longitude requires no trigonometric function calls, apart from what is needed to initially convert the interpolation coefficients. That's the main reason for the performance difference as the trigonometric function calls are expensive. The ~1m is well within the acceptable tolerance for applications of this type of data. For comparison the pixel size is around 750m x 750m at the sub-satellite point and the interpolation zone varies from around 12km x 12km at the sub-satellite point to around 25km x 25km at the swath edge. Another reference is the fact that latitude and longitude is stored in Float32 in both the original and the compact product files. Float32 has a 24 bits significand precision. At the scale of the Earth radius of 6,371 km, that gives a precision of 0.4m, so we are quite close to that limit. As an experiment, we could see what would happen if we did the compression and expansion calculations in Float64, still with the latitude and longitude stored in Float32 in both the original and the compact product files. That could bring a small improvement, but execution would likely be slower. Anders |
Minutes of 27th discussion on subsampled coordinatesPresent: @davidhassell @ajelenak @oceandatalab @AndersMS @erget Items from last meeting
Intent: Next session send off first draft to CF Community! Adressed in this meeting@davidhassell noted that the 1.9 release is impending. Pending for next time
|
Minutes of 28th discussion on subsampled coordinatesPresent: @davidhassell @AndersMS @erget @ajelenak Items from last meeting
Adressed in this meeting
Pending for next time
|
Dear All, I have updated the verification software to do all internal calculation in 64 bit floating point rather than 32 bit floating point, while maintaining all lat/lon in the netCDF files as 32 bit floating point. The change made the structure and the error outliers that we saw during the meeting today disappear, see error plot below. The errors are now in the range 0.22 m - 2.45 m with an average value of 0.847 m. The software execution time only went up marginally, in the order of 10%. Anders |
Hi @AndersMS, that looks like good confirmation of what we hoped! Thanks. |
Hi @davidhassell , Yes, I agree :-) I would suggest that we specify in the appendix J that internal calculations must be carried out in 64 bit floating point. Either generally or we could state it for each interpolation method where it would be required. What do you think? Anders |
Hi @AndersMS, I'd be OK with stating that (i.e internal calculations must be carried out in 64 bit floating point) in appendix J. I think a general statement at the top would suffice. Since the precision of arithmetic is demonstrably of importance, and we explicitly say that the user should be able to recreate the coordinates as intended by the creator, we really have to say something about this precision. An interpolation variable attribute-based solution that satisfactorily provides this so that the creator can set it to one of a number of possible values is nice in principle, but A) I don't think there is a use case for this flexibility and B) any such method I can think of has at least one drawback. So that avenue is probably best left alone at this time. Thanks, |
Hi @erget and @davidhassell , I have added the sentence: All floating point calculations defined in this appendix must be carried out in 64 bit floating point, even if the related coordinates and interpolation parameters are stored in the netCDF files with a lower precision. in the first section of Appendix J. I fully agree with @erget that even the precision we obtain with internal calculations in 32 bit floating point is well above what is required for practical purposes. However, to meet the aim stated in our first section in chapter 8: The metadata that define the interpolation formula and its inputs are complete, so that the results of the coordinate reconstitution process are well defined and of a predictable accuracy. it is correct to have the requirement for 64 bit floating point. On a related topic, after the meeting yesterday I exercised the algorithm with different sizes of the interpolation zones. Standardly I have used 16x16 coordinate points, but now I tried out as well 4x4, 8x8, 16x32 and 16x64. With internal calculations in 32 bit floating point 4x4 and 8x8 failed with a division by zero due to rounding to 32 bit floating point. However with internal calculations in64 bit floating point it worked correctly and with high accuracy. Nevertheless, I will consider if we can improve this behavior: degraded accuracy might be acceptable, division by zero is not. Cheers Anders |
I'm not comfortable that CF should specify the precision of data stored or the precision used in manipulating that data. Isn't that a scientific judgement that depends on application? In particular would 64 bit arithmetic be sufficient under all circumstances? I confess I haven't been following the details of this proposal, but if the original data is 32 bit and rounding of some operation leads to 0 difference, isn't 0 the correct answer (within the precision of the data)? If 0 is subsequently used as the devisor in an operation, rather than some (arbitrary) small number yielded by a 64 bit operation, then I think there must be a flaw in the algorithm. |
Hi Karl, Not off base at all - your comment reflects some of the conversations we have been having off line, and it's good to air them here. Where I am (and this may not be the view of others) is that given that the compression is lossy, the extent of the loss should be able to be determined by the data creator, and that can only be the case if the precision of arithmetic used in the interpolation algorithm is somehow prescribed. I see this as a similar situation to packing, where the accuracy is determined by the data type of the stored values and the scale_factor and add_offset attributes. Ah! writing about the data type packing attributes reminds that I said an interpolation variable attribute approach to define the precision would be ideal, but had sufficient drawbacks (hence my endorsement of the general statement on precision). However, I hadn't though of using the data type of the such an attribute's numeric value to set it, just like data type In this case if the precision is left undefined, the user uses what they like. If the creator has said "use double" precision, the user should do so. A
|
This issue has now resulted in a full proposal that is being pursued in cf-convention/cf-conventions#327 |
@oceandatalab would you object to closing this issue in favour of cf-convention/cf-conventions#327 in order to avoid confusion? |
No objection, the discussion continues in cf-convention/cf-conventions#327 |
At OceanDataLab we process a wide variety of EO data (satellite, in-situ, model) on a daily basis and develop tools to analyse and visualise these data.
Frame of the question
The resolution of sensors has steadily increased in the past years, producing data at higher resolution with each new generation of instruments. As finer and finer scales are now observable, more and more storage are also required to save these data.
Most remote sensing data are also in the geometry of the sensor, meaning that you need to provide coordinates along with each acquisition. So, for satellites, data at higher resolution also means more coordinates to store, download, load into memory, etc... In order to reduce the size overhead induced by these coordinates, a solution which seems to gain traction consists in providing coordinates only for a subset of cells and relying on interpolation to compute coordinates for the other cells of the data matrix. This is what has been done for Sentinel-1 (using ground control points) and Sentinel-3 data for example, but it is very likely that this solution becomes prevalent in the future.
Question
NetCDF is the main file format for satellite data, but the CF convention does not provide a standard way to define coordinates arrays or matrices at a resolution that does not match the resolution of data. This is a major issue because it is impossible to develop a generic software for reading these data, as each satellite mission implements this method in a different way: users have to adapt to each satellite product for reconstructing the coordinates. This is a barrier for many users and really is an issue for the exploitation of satellite data.
Has this problem already been discussed and are there plans to standardize the mapping of data matrices with coordinates provided at a different resolution?
The text was updated successfully, but these errors were encountered: