-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental clean-up release #285
Comments
It all looks good to me! Will gladly help out where necessary. |
I've been thinking a bit about xarray as a first class dependency and the fact that xarray doesn't support nan chunks. Here's the xarray issue: In the case of standard averaging this will require reification of def row_mapper(time, interval, antenna1, antenna2,
flag_row=None, time_bin_secs=1): In the more complex BDA case this requires reification of def bda_mapper(time, interval, ant1, ant2, uvw, chan_width, chan_freq,
max_uvw_dist, flag_row=None, max_fov=3.0,
decorrelation=0.98,
time_bin_secs=None,
min_nchan=1): I can broadly see two approaches Place reified coordinates on DatasetsThe obvious way of handlng this (which I've avoided up until now due to memory concerns) would be to make the following fully reified coordinates on the resultant xarray datasets:
Pros:
Cons:
Compute chunk sizes in the indexing columns in as a pre-compute stepWe tend to do this in our applications in any case (@JSKenyon in Quartical, myself in xova, @landmanbester ?). Pros:
Cons:
|
Yup, I read the indexing columns to set things up as well. It would be nice to have them reified. Maybe don't go as far as reifying uvw's but having access to time and frequency without calling compute is very handy |
Wholeheartedly agree with making it a first class dependency. The nan chunk problem is irritating but as you mentioned, there are ways around it.
This is certainly possible. As you say, it will make things more memory intensive in the driver, but also gives us access to a a fair amount of powerful xarray functionality.
I think if we add this functionality, we do it in such a way that a user may optionally specify which elements are coordinates and which ones should be read/exposed. That way you don't get reified UVW unless you need it.
I would say that I want to seriously discourage bolting every indexing column into the coordinates - I think that we need some sort of mechanism for getting what we need in an application specific way. That said, even with all of the above, it is still probably manageable for current problem sizes (although I agree that it could become a problem). Some of the above could be compressed substantially in memory if there was RLE or something of the like in the xarray coordinates (not sure if this exists/could exist).
Just reiterating that this gives us a great deal more power in xarray land.
There will still be a disconnect in the chunking problem i.e. users would still need to use the time and channel information to decide on appropriate chunks and then do another
I don't really see the indexing columns ever becoming a problem relative to the actual data. I guess one can argue that the driver may end up with a fair amount of data in memory when doing graph construction. Apologies, I think that this may be a difficult reply to read due to all the quoting. I do think that there are two separate problems to consider here. The first is that we currently limit xarray functionality due to the absence of true coordinate values. This could be solved by allowing users to optionally request certain data_vars (columns) as coordinates during the read. The tricky part is how we attach those coordinates to the other data_vars e.g. if we require both CHAN_FREQ and CHAN_WIDTH, which one is the proper coordinate? We could also opt to establish a CHAN_ID which would be a global integer index over all channels in a spectral window (or optionally over all spectral windows?). Then each CHAN_ID could be associated with a CHAN_FREQ and CHAN_WIDTH. This would pave the way for more easily splitting and combining datasets in frequency. The second problem is the chunking. The reason that many of our applications have this two-step process is that dask-ms is not currently aware of the way in which we typically chunk our data i.e. in blocks of complete time and channel. If dask-ms had a more elaborate chunking mechanism, it may be possible to skip this step in user code e.g. in |
Another point in favour of reified coordinates/extending the ROWID behaviour to other common axes is that we could support partial updates (see #93). This would make writing data after selection much easier (at present it requires a fair amount of wrangling downstream). |
Description
We've been experimenting with a number of features in dask-ms for a while now. Now that we better understand how these features operate in a distributed environment, its worth consolidating these into a list for a release. This may involve breaking some experimental functionality and hopefully less of the existing non-experimental functionality.
TODO
open_zarr
andto_zarr
methods__daskms_metadata__
. Metadata such as the following should be placed within it:xds_to_table
handled writes to any CASA table, whilexds_from_ms
existed to interpret a CASA table (and subtables) as a MS-specific, adding appropriate coordinates to the MS columns.xds_{from/to}_ms
accessors.python-casacore
for https://github.com/ratt-ru/arcae for CASA table access:The text was updated successfully, but these errors were encountered: