Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve multi-ms functionality #121

Open
JSKenyon opened this issue Aug 12, 2020 · 8 comments
Open

Improve multi-ms functionality #121

JSKenyon opened this issue Aug 12, 2020 · 8 comments

Comments

@JSKenyon
Copy link
Collaborator

  • dask-ms version: master
  • Python version: 3.6.9
  • Operating System: Ubuntu 18.04 LTS

Description

This is very much a non-urgent, wish list feature. As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user. The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of pyrap.table calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end. That being said, it is possible to do this on the application side, so if that is going to be the paradigm, I can make it happen.

@sjperkins
Copy link
Member

As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user.

I understand that your perspective here is from a performance POV, especially in light of casacore/casacore#1038 and casacore/python-casacore#209.

However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.

The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of pyrap.table calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end.

I'm concerned that this request is asking dask-ms to reverse engineer the intended virtual interface provided by CASA Tables in order to resolve performance problems. This would likely involve reproducing substantial internal CASA table logic, correctly.

Put differently, a multi-ms is baking a cake. To unbake the cake would be a Herculean task. I'd be interested in seeing the casacore maintainer's response to casacore/python-casacore#209.

@sjperkins
Copy link
Member

/cc @IanHeywood who was also interested in multi-ms's

@JSKenyon
Copy link
Collaborator Author

However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.

I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject.

Essentially my request would be to make dask-ms MMS aware. I have tried an application-end test (very ugly, not perfect, almost certain there are some threading related things to work out) and see a decent (factor of x3) improvement in performance by reading in the sub-MSs directly.

I don't really believe that this even constitutes unbaking the cake - all it means is that a given row range lies in a specific subms. Whether the backing getcol points at the subms vs the monolithic ms seems like a relatively minor detail. Of course, I know that it is not that simple, and keeping track of all the subms tables could get difficult.

This is by no means a requirement, just thought it was worth putting my thoughts together in an issue.

@JSKenyon
Copy link
Collaborator Author

JSKenyon commented Aug 12, 2020

The performance factor is best taken with a pinch of salt as I think there are some discrepancies in my tests. But there definitely is an improvement.

@sjperkins
Copy link
Member

I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject.

Ah thanks for pointing that, I haven't worked with MMS much so my experience of them is limited.

I guess something like the following is currently possible (although python-casacore and the GIL is still an issue...)

from itertools import chain
from daskms import xds_from_ms

mms = ["ms01.ms", "ms02.ms", "ms03.m3"]

datasets = list(chain(map(xds_from_ms, mms)))

What exactly distinguishes a normal MS from a multi-MS?

This is by no means a requirement, just thought it was worth putting my thoughts together in an issue.

Yes, lets keep the discussion going.

@sjperkins
Copy link
Member

What exactly distinguishes a normal MS from a multi-MS?

No need to reply, I'm reading the PDF which seems to describe it well.

@miguelcarcamov
Copy link

Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms?

@sjperkins
Copy link
Member

Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms?

Hi @miguelcarcamov. I think the issue here is that:

  1. CASA Tables aren't thread safe and must therefore be accessed from a single thread. Why can't an "autonoread"-locked table be read from multiple threads? casacore/casacore#1038
  2. python-casacore doesn't drop the Global Interpreter Lock (GIL) when accessing CASA Table data. Add code to drop the GIL. Demostrate usefulness on reads/writes. casacore/python-casacore#209. This is probably the most painful as it means that no compute happens while waiting for data.

I've done some work on a new pybind11 wrapper for CASA Tables to solve the above issues, but haven't touched it recently. https://github.com/ratt-ru/futurecasapy

In the long term, we're exploring newer formats for the Measurement Set v{2,3} spec. Earlier this year, zarr and parquet support for CASA Table-like data was added to dask-ms master branch -- these have proven highly performant on a supercomputer -- orders of magnitude faster compared to the current CASA Table system, although it is possible that it hasn't been optimally compiled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants