Improve multi-ms functionality #121

JSKenyon · 2020-08-12T09:10:16Z

dask-ms version: master
Python version: 3.6.9
Operating System: Ubuntu 18.04 LTS

Description

This is very much a non-urgent, wish list feature. As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user. The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of pyrap.table calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end. That being said, it is possible to do this on the application side, so if that is going to be the paradigm, I can make it happen.

The text was updated successfully, but these errors were encountered:

sjperkins · 2020-08-12T10:02:56Z

As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user.

I understand that your perspective here is from a performance POV, especially in light of casacore/casacore#1038 and casacore/python-casacore#209.

However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.

The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of pyrap.table calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end.

I'm concerned that this request is asking dask-ms to reverse engineer the intended virtual interface provided by CASA Tables in order to resolve performance problems. This would likely involve reproducing substantial internal CASA table logic, correctly.

Put differently, a multi-ms is baking a cake. To unbake the cake would be a Herculean task. I'd be interested in seeing the casacore maintainer's response to casacore/python-casacore#209.

sjperkins · 2020-08-12T10:03:12Z

/cc @IanHeywood who was also interested in multi-ms's

JSKenyon · 2020-08-12T10:24:08Z

However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.

I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject.

Essentially my request would be to make dask-ms MMS aware. I have tried an application-end test (very ugly, not perfect, almost certain there are some threading related things to work out) and see a decent (factor of x3) improvement in performance by reading in the sub-MSs directly.

I don't really believe that this even constitutes unbaking the cake - all it means is that a given row range lies in a specific subms. Whether the backing getcol points at the subms vs the monolithic ms seems like a relatively minor detail. Of course, I know that it is not that simple, and keeping track of all the subms tables could get difficult.

This is by no means a requirement, just thought it was worth putting my thoughts together in an issue.

JSKenyon · 2020-08-12T10:27:29Z

The performance factor is best taken with a pinch of salt as I think there are some discrepancies in my tests. But there definitely is an improvement.

sjperkins · 2020-08-12T12:34:04Z

I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject.

Ah thanks for pointing that, I haven't worked with MMS much so my experience of them is limited.

I guess something like the following is currently possible (although python-casacore and the GIL is still an issue...)

from itertools import chain
from daskms import xds_from_ms

mms = ["ms01.ms", "ms02.ms", "ms03.m3"]

datasets = list(chain(map(xds_from_ms, mms)))

What exactly distinguishes a normal MS from a multi-MS?

This is by no means a requirement, just thought it was worth putting my thoughts together in an issue.

Yes, lets keep the discussion going.

sjperkins · 2020-08-12T12:36:25Z

What exactly distinguishes a normal MS from a multi-MS?

No need to reply, I'm reading the PDF which seems to describe it well.

miguelcarcamov · 2021-05-05T19:50:35Z

Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms?

sjperkins · 2021-05-07T15:48:51Z

Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms?

Hi @miguelcarcamov. I think the issue here is that:

CASA Tables aren't thread safe and must therefore be accessed from a single thread. Why can't an "autonoread"-locked table be read from multiple threads? casacore/casacore#1038
python-casacore doesn't drop the Global Interpreter Lock (GIL) when accessing CASA Table data. Add code to drop the GIL. Demostrate usefulness on reads/writes. casacore/python-casacore#209. This is probably the most painful as it means that no compute happens while waiting for data.

I've done some work on a new pybind11 wrapper for CASA Tables to solve the above issues, but haven't touched it recently. https://github.com/ratt-ru/futurecasapy

In the long term, we're exploring newer formats for the Measurement Set v{2,3} spec. Earlier this year, zarr and parquet support for CASA Table-like data was added to dask-ms master branch -- these have proven highly performant on a supercomputer -- orders of magnitude faster compared to the current CASA Table system, although it is possible that it hasn't been optimally compiled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve multi-ms functionality #121

Improve multi-ms functionality #121

JSKenyon commented Aug 12, 2020

sjperkins commented Aug 12, 2020

sjperkins commented Aug 12, 2020

JSKenyon commented Aug 12, 2020

JSKenyon commented Aug 12, 2020 •

edited

Loading

sjperkins commented Aug 12, 2020

sjperkins commented Aug 12, 2020

miguelcarcamov commented May 5, 2021

sjperkins commented May 7, 2021

Improve multi-ms functionality #121

Improve multi-ms functionality #121

Comments

JSKenyon commented Aug 12, 2020

Description

sjperkins commented Aug 12, 2020

sjperkins commented Aug 12, 2020

JSKenyon commented Aug 12, 2020

JSKenyon commented Aug 12, 2020 • edited Loading

sjperkins commented Aug 12, 2020

sjperkins commented Aug 12, 2020

miguelcarcamov commented May 5, 2021

sjperkins commented May 7, 2021

JSKenyon commented Aug 12, 2020 •

edited

Loading