Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TopicStatusDecisionAlternatives considered / Rationale
LanguagedecidedPython

Previously, C++ has often been seen as the primary implementation language for data exchange. However, there is a substantial and growing trend to use Python for purposes of high-level programming in scientific fields. This is due to its flexibility working with various data formats, and it's well-established infrastructure of dealing with numeric data (chiefly based on numpy).

On Python version: For the moment we are aiming to support Python versions that are not currently "end-of-life" according to https://devguide.python.org/versions/

LicensedecidedBSD3SKAO preference, has been checked to be okay with NRAO. Note that this means that no dependencies can be GPL - but LGPL is permissable.
Namedecidedxradio: https://github.com/casangi/xradio

Want to be reasonably indicative at what it is - "measurement set" is a bit too general. However, also want to keep the door open to include data models beyond visibilities in the future. Also don't want to stick a "ng" somewhere to invoke the "next generation CASA" project.

  • pyrat (python radio astronomy tools), trapy (tools for radio astronomy python), pytra (python tools for radio astronomy), astropyrat, astroradius, …
  • radex or radax (radio astronomy data exchange format) - pyradex already taken on PyPI, but radax is still available
  • vidax (visibility data exchange format) - still available on PyPI
  • combination of the two: ravix (radio astronomy visibility exchange format) - still available on PyPI
  • xradio (xarray radio astronomy data I/O library?) - bonus points for semi-recursive acronym!

xradio won the poll.

Goalproposal

Library to allow construction of consistent xarray-based data structures for radio interferometry

  • Document
  • Construct
  • Convert
  • Check

We want this library to gather shared assumptions about our radio astronomy data models. This is primarily about the conventions used when interacting with the xarray API, i.e. what datasets and data arrays contain in terms of dimensions, coordinates, data arrays, attributes - their names, dimensions (if applicable), data types and associated semantics (description, units, etc).

Developers and users should be able to use this library to:

  • determining whether a certain data structure is compliant - and if so, how it is to be interpreted in detail
    • to some degree, this can be achieved simply by having the data structure be self-describing, and leaning on xarray's rules (i.e. coordinate-data variable association, and descriptive attributes)
    • however, for this to work as exchange format, we will need to also check and document more detailed conventions (completeness, semantic descriptions etc).
  • easily generate compliant data structures
    • allow generating APIs from lower-level APIs (e.g. dask or numpy arrays)
      • We will have to assume that there are going to be a large number of possible producers for this data - especially processing functions. For instance, given that xarray can be a front for dask arrays, it is a valid use case to wrap an entire computation graph behind an xradio-compliant API
    • allow conversion of existing formats
      • Measurement set v2/3 being the obvious first stop
Abstract datamodelproposalacyclic graph of xarray datasets

With "abstract datamodel" we refer to the non-domain-specific data modelling framework that we can use to express the (say) visibility data model. Measurement sets were described as a "set of tables", each with:

  • a name/type such as "MAIN", "ANTENNA" etcetera. A MS could only have at maximum one table of any given type.
  • a set of named columns of various types, and
  • a series of rows assigning a value to every column.

Associations between tables were in the form of (implicit) foreign keys. The proposal is to replace this by a "acyclic graph of xarray datasets", where every dataset can have:

  • A set of named dimensions, with associated index spaces (replacing "rows" / "row count" with a multi-dimensional approach)
  • A set of named data arrays, each utilising a sub-set of the dimensions (replacing columns). There is two distinct types:
    • Coordinates (naming convention: lower case name) associate dimensions with domain-specific labels, such as frequency or time stamps. These are generally loaded eagerly (i.e. data represented as numpy arrays), and are used for indexing into the data.
    • Data variables (naming convention: upper case name) represent the actual bulk data contained in the dataset. Generally loaded lazily (i.e. data represented as dask arrays by default until applying dask.compute).
  • A set of named attributes, which can be either:
    • Primitive attributes (naming convention: without postfix) of primitive types - strings, int/double numbers and lists
    • Complex attributes (naming convention: with "_info" postfix, e.g. "field_info"), a dictionary of string names to other primitive or complex attributes.
    • Dataset attributes / associations (naming convention: with "_xds" postfix), a link to an associated sub-dataset

draw.io Diagram
bordertrue
diagramNamexarray data model
simpleViewerfalse
linksauto
tbstyletop
lboxtrue
diagramWidth541
revision2

Multiple xarray datasets can point to a shared dataset (e.g. antenna information). TBD whether this should be observable (e.g. define antenna set IDs?) so that we can quickly decide whether datasets are mergable.

Heterogenous dataproposalmultiple datasets indexed by coordinates (+ attributes?)

There are two separate concerns to do with heterogenity:

  • The xarray abstract data model works best on homogenous data where dimensions are constant throughout. This runs into problems when this is broken - for instance baseline-dependent averaging might want to change frequency resolution depending on baseline. Or different scans might use different channelisations.
  • Furthermore, being able to express arbitrary configuration changes within a dataset would require adding extra data variables such as "scan number" or "field", which would only change very rarely (and in most cases at the same time). In practice our data will often naturally come in "chunks" were such attributes remain constant - something we could not represent directly in measurement set's table format.

Therefore the proposal is that allow every "set of xarray datasets" to contain multiple instances of the same dataset "type". For instance, we might have multiple "visibility" datasets for different BLDA modes or scans. There are two principal ways in which we can then "select" the appropriate datasets:

  • By attribute: For instance, we could have a scan_id or BLDA mode set on the datasets
  • By coordinates: Similar as how you would use coordiantes to index into a dataset, where datasets have non-overlapping coordinate sets you can select by coordinate. For instance, you could select by time (which might implicitly select a scan), or by baseline ID (which might imply a certain BLDA mode)

Interestingly enough, we can replace the former by the latter by having coordinates for (say) scan_id or BLDA that have just one entry, and no associated data variables. This would have the same general semantics, and would simplify the overall model. Though regarding (say) "type" as a coordinate might be a bit surprising?

Dimension variabilityopen

In some cases, different applications might want to choose different dimensions for their data variables. For instance, weights and flags might "often" be the same for all polarisations, so having a polarisation axis might mean that large amounts of redundant data gets generated and stored (potentially up to same size as visibilities, so very significant). Similarly, time centroids might or might not also depend on frequency, depending on flagging method. Options considered:

  • Hide it - e.g. compress on storage, and maybe use a "virtual" numpy axis with a stride of 0 (see here) to make it look like a numpy arrays with the appropriate shape. Unclear whether this would not cause lots of surprises down the road.
  • Allow data variables to omit coordinates - this means that quite a few processing functions would need to special-case this behaviour. As long as it's just polarisation it might not be too bad, and it would be reasonably in line with how numpy operations automatically broadcast (though noteably polarisation is on the "wrong" side of the shape for this to work automatically).
  • Have two different dimensions for "visibility polarisations" and "flag/weight polarisations". Mostly the same as previous option, but might invite more generality than we want?
Data type variabilityproposalallow arbitrary floating point precisions supported by numpy (half / single / double IEEE-754)

Especially for visibilities and flags we might be under quite a bit of pressure to represent them in a compact fashion - both for storage as well as processing. At the same time, certain use cases might have high expectations for accuracy, therefore it is hard to define a one-size-fits-all approach to floating point accuracy.

So for the moment, the proposal is to allow all floating point precisions supported by numpy for visibilities, which is currently half, single and double precision (16 / 32 / 64 bytes) respectively (assumed IEEE-754). This extends to complex numbers.

Do we need to support all choices for weights, uvw etc as well? Do we want to enforce just one floating point type per dataset?

Storage backend implementationproposal
xradio should support multiple storage backends (ms, zarr, netcdf, asdm_v2, ...) that are independently implemented so that installation can be modular. 
Nomenclatureproposal
For descriptions of reference frames in keywords, we propose to use the astropy nomenclature rather than the casacore nomenclature. E.g. in 'time', we use 'time_scale' rather than 'reference'.
Measures, units and reference framesopen

One of the features of the measurement set format was that all fields could be associated with "measures" - a general representation of "physical quantities within a certain reference frame". This is a pretty powerful system, and in some cases positively required (for instance, to express moving reference frames).

To start with, we likely (at this point?) don't want to try to maintain a system for doing conversions between measures ourselves - we want to stick as much as possible to "just" building a data model library. Instead, our goal is to build something that will work easily with existing radio astronomy packages such as astropy for that purpose. This means that when in doubt, what was captured as "measures" will need to be exploded out into separate attributes for measures value, unit, reference frame etcetera.

From a data modelling perspective, the main question to decide is therefore granularity: Do we want to allow every single value to potentially have a different reference frame, or would we want to standardise this on some level - e.g. a dataset, data array/coordinate or "info" dictionary?

(TODO - this is a pretty important topic to consider carefully, likely needs more investigation)

...