The following objectives were formulated for this collaboration:
- Create and document a new visibility data schema that is scalable and maintainable, satisfying the use cases of the NRAO and SKA. The documentation should be independently reviewed by experts for each of the use cases.
- Implement a prototype in Python using off-the-shelf packages (such as xarray, dask, etc.) that can read and write data in the format of the new schema and convert measurement set v2.
Our aim is to build a system that is viable long-term, so we must make care to sufficiently discuss and document design decision we make. The following table lists some decisions made, indicating status of the discussion and short rationale. The statuses are meant to be used:
- decided - sufficient consensus has been achieved, we move forward based on the decision
- proposal - we have a reasonably complete proposal, but not achieved consensus yet
- open - relevant and still under active discussion, no proposal has been made yet
- new - new topic, relevancy has not been decided yet
Decisions
Topic | Status | Decision | Alternatives considered / Rationale |
---|---|---|---|
Language | decided | Python | Previously, C++ has often been seen as the primary implementation language for data exchange. However, there is a substantial and growing trend to use Python for purposes of high-level programming in scientific fields. This is due to its flexibility working with various data formats, and it's well-established infrastructure of dealing with numeric data (chiefly based on numpy). On Python version: For the moment we are aiming to support Python versions that are not currently "end-of-life" according to https://devguide.python.org/versions/ |
License | decided | BSD3 | SKAO preference, has been checked to be okay with NRAO. Note that this means that no dependencies can be GPL - but LGPL is permissable. |
Name | decided | xradio: https://github.com/casangi/xradio | Want to be reasonably indicative at what it is - "measurement set" is a bit too general. However, also want to keep the door open to include data models beyond visibilities in the future. Also don't want to stick a "ng" somewhere to invoke the "next generation CASA" project.
xradio won the poll. |
Goal | proposal | Library to allow construction of consistent xarray-based data structures for radio interferometry
| We want this library to gather shared assumptions about our radio astronomy data models. This is primarily about the conventions used when interacting with the xarray API, i.e. what datasets and data arrays contain in terms of dimensions, coordinates, data arrays, attributes - their names, dimensions (if applicable), data types and associated semantics (description, units, etc). Developers and users should be able to use this library to:
|
Abstract datamodel | proposal | acyclic graph of xarray datasets | With "abstract datamodel" we refer to the non-domain-specific data modelling framework that we can use to express the (say) visibility data model. Measurement sets were described as a "set of tables", each with:
Associations between tables were in the form of (implicit) foreign keys. The proposal is to replace this by a "acyclic graph of xarray datasets", where every dataset can have:
Multiple xarray datasets can point to a shared dataset (e.g. antenna information). TBD whether this should be observable (e.g. define antenna set IDs?) so that we can quickly decide whether datasets are mergable. |
Heterogenous data | proposal | multiple datasets indexed by coordinates (+ attributes?) | There are two separate concerns to do with heterogenity:
Therefore the proposal is that allow every "set of xarray datasets" to contain multiple instances of the same dataset "type". For instance, we might have multiple "visibility" datasets for different BLDA modes or scans. There are two principal ways in which we can then "select" the appropriate datasets:
Interestingly enough, we can replace the former by the latter by having coordinates for (say) scan_id or BLDA that have just one entry, and no associated data variables. This would have the same general semantics, and would simplify the overall model. Though regarding (say) "type" as a coordinate might be a bit surprising? |
Dimension variability | open | In some cases, different applications might want to choose different dimensions for their data variables. For instance, weights and flags might "often" be the same for all polarisations, so having a polarisation axis might mean that large amounts of redundant data gets generated and stored (potentially up to same size as visibilities, so very significant). Similarly, time centroids might or might not also depend on frequency, depending on flagging method. Options considered:
| |
Data type variability | proposal | allow arbitrary floating point precisions supported by numpy (half / single / double IEEE-754) | Especially for visibilities and flags we might be under quite a bit of pressure to represent them in a compact fashion - both for storage as well as processing. At the same time, certain use cases might have high expectations for accuracy, therefore it is hard to define a one-size-fits-all approach to floating point accuracy. So for the moment, the proposal is to allow all floating point precisions supported by numpy for visibilities, which is currently half, single and double precision (16 / 32 / 64 bytes) respectively (assumed IEEE-754). This extends to complex numbers. Do we need to support all choices for weights, uvw etc as well? Do we want to enforce just one floating point type per dataset? |
Storage backend implementation | proposal | xradio should support multiple storage backends (ms, zarr, netcdf, asdm_v2, ...) that are independently implemented so that installation can be modular. | |
Nomenclature | proposal | For descriptions of reference frames in keywords, we propose to use the astropy nomenclature rather than the casacore nomenclature. E.g. in 'time', we use 'time_scale' rather than 'reference'. | |
Measures, units and reference frames | open | One of the features of the measurement set format was that all fields could be associated with "measures" - a general representation of "physical quantities within a certain reference frame". This is a pretty powerful system, and in some cases positively required (for instance, to express moving reference frames). To start with, we likely (at this point?) don't want to try to maintain a system for doing conversions between measures ourselves - we want to stick as much as possible to "just" building a data model library. Instead, our goal is to build something that will work easily with existing radio astronomy packages such as astropy for that purpose. This means that when in doubt, what was captured as "measures" will need to be exploded out into separate attributes for measures value, unit, reference frame etcetera. From a data modelling perspective, the main question to decide is therefore granularity: Do we want to allow every single value to potentially have a different reference frame, or would we want to standardise this on some level - e.g. a dataset, data array/coordinate or "info" dictionary? (TODO - this is a pretty important topic to consider carefully, likely needs more investigation) |
3 Comments
Ger Van Diepen
I have a few questions/remarks:
Wortmann, Peter
xarray is just the API used for accessing data, and doesn't really dictate how data is stored or accessed. This is because any "data array" can be represented by a dask array, which means that basically anything that can produce array data (especially in chunks of a certain configurable size) can be used to back xarray. Especially - but not limited to - existing in-memory data, data on disk, or data that still has to be computed (potentially even on a different computer).
While the latter means there's inherent "parallelism support", I don't think we will get too much mileage out of that approach. In my mind for parallelisation it will be more useful to partition all datasets into chunks - the xarray API is naturally "sliceable" across any dimension. From there, we can hopefully make meaningful decisions what dataset chunks to load / evaluate / keep in memory where and when. There's no free food here, we need to think about this carefully.
We have not made a decision on the actual storage backend yet, but zarr seems to be quite popular. I suspect we will also implement measurement set backends - and even just for easing the transition. Ludwig has already confirmed that it's quite straightforward to access Meerkat's object storage through this as well (actually through zarr even).
Currently we are not actively aiming to define a standard for how images are represented, but it seems very likely that we will define at least a provisional data model. I would expect that we are going to scope this even more strongly to our immediate use cases. For SKA I am mostly thinking about using it for representing image data in memory using a consistent API.
Wortmann, Peter
Re-reading your comment, I didn't actually address all points:
Add Comment