-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve multi-ms functionality #121
Comments
I understand that your perspective here is from a performance POV, especially in light of casacore/casacore#1038 and casacore/python-casacore#209. However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.
I'm concerned that this request is asking dask-ms to reverse engineer the intended virtual interface provided by CASA Tables in order to resolve performance problems. This would likely involve reproducing substantial internal CASA table logic, correctly. Put differently, a multi-ms is baking a cake. To unbake the cake would be a Herculean task. I'd be interested in seeing the casacore maintainer's response to casacore/python-casacore#209. |
/cc @IanHeywood who was also interested in multi-ms's |
I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject. Essentially my request would be to make dask-ms MMS aware. I have tried an application-end test (very ugly, not perfect, almost certain there are some threading related things to work out) and see a decent (factor of x3) improvement in performance by reading in the sub-MSs directly. I don't really believe that this even constitutes unbaking the cake - all it means is that a given row range lies in a specific subms. Whether the backing getcol points at the subms vs the monolithic ms seems like a relatively minor detail. Of course, I know that it is not that simple, and keeping track of all the subms tables could get difficult. This is by no means a requirement, just thought it was worth putting my thoughts together in an issue. |
The performance factor is best taken with a pinch of salt as I think there are some discrepancies in my tests. But there definitely is an improvement. |
Ah thanks for pointing that, I haven't worked with MMS much so my experience of them is limited. I guess something like the following is currently possible (although python-casacore and the GIL is still an issue...) from itertools import chain
from daskms import xds_from_ms
mms = ["ms01.ms", "ms02.ms", "ms03.m3"]
datasets = list(chain(map(xds_from_ms, mms))) What exactly distinguishes a normal MS from a multi-MS?
Yes, lets keep the discussion going. |
No need to reply, I'm reading the PDF which seems to describe it well. |
Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms? |
Hi @miguelcarcamov. I think the issue here is that:
I've done some work on a new pybind11 wrapper for CASA Tables to solve the above issues, but haven't touched it recently. https://github.com/ratt-ru/futurecasapy In the long term, we're exploring newer formats for the Measurement Set v{2,3} spec. Earlier this year, zarr and parquet support for CASA Table-like data was added to dask-ms master branch -- these have proven highly performant on a supercomputer -- orders of magnitude faster compared to the current CASA Table system, although it is possible that it hasn't been optimally compiled. |
Description
This is very much a non-urgent, wish list feature. As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user. The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of
pyrap.table
calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end. That being said, it is possible to do this on the application side, so if that is going to be the paradigm, I can make it happen.The text was updated successfully, but these errors were encountered: