Page History

...

Time

Item

Who

Notes

25 min

XRADIO presentation (15 min) and discussion (10 min)

load_processing_set with MSv2 can be costly and cause performance issues.
load_processing_set doesn't use Dask; it directly interfaces with zarr.
MSv4 is in memory, but get on-disk persistence with zarr for "free".
Have tested with imaging to check data access - over four nodes and produced a large spectral cube
Performance data from this in plan

25 min

Schema checker and documentation presentation (15 min) and discussion (10 min)

AA2 processing scaling - it is not scaling. Current tests failed. Looking into further measures. Failed at distributed processing.
I/O is an expected bottleneck in scaling.
Tests are not good enough to determine where the bottlenecks are.
XRADIO is a python based project. Actual reading on zarr files uses C++ libraries. Reimplementing everything into C++ is probably not feasible. Reading WSClean from C++ should be simple.

Recap and next steps - are we progressing towards our goals? Have our goals changed? (see e.g. "Goal" from Data models collaboration - decision log)

Embedded Google Drive File

url	https://docs.google.com/presentation/d/1uekrD0cyIYB_KoC5u2sRDuzlfrEWRjw-3BrR6cAe3Og/edit?usp=drivesdk
fullwidth	true
height	600

everybody

DMS at NRAO plans: continuation of XRADIO; starting next round of prototyping ("pilot") looking into different kinds of workflow orchestration (for example, combining Prefect with Dask). Have started writing domain (pure-science) functions that aren't parallel. gridflag, fringefit based on xradio
ARDG - algorithm architecture is being tested for scaling. Testing on 100+ GPUs across US. Deployed architecture on that to process 2TB of VLA wide-band data. Architecture scaled quite well, as expected. There was some data distribution issues not related to algorithm architecture. Will do another run in a few weeks. Expect throughput to improve by a factor of 2. Focussing on throughput metrics, not flops per second.
- Current throughput is about 1 TB/hour; expect to go to 2 TB/hour. Using HT Condor.
- Distribute along an axis that data is stored in - time, frequency; exploring other axes.
- Imaging is done separately and then brought together.
- Does not yet include calibration. Next step is to do self-cal. (DI and pointing) Working on deploying that and measuring its scaling.
- Is implementation sufficiently decoupled from parallelization framework? So far, yes. Tested on both HT Condor, AWS, and ...?
Should build up a list of tools/software using XRADIO
- GraphVIPER - container orchestration
- AstroVIPER - domain code
- https://github.com/ARDG-NRAO/LibRA

For immediate future - are we progressing sufficiently along prototyping phase?
Is the amount of effort right?
DMS (Jeff) is happy with the way things are going
- As we go forward, we will need more formalism along milestones and deliverables.
- Eventually transition from exploratory mode, need to have some deadlines
Should have review of schemas and prove that within that schema, we can deliver on goals for scalability and performance. Starting exercise on these activities on SKA side, but need a milestone for schemas. This should be the basis for v1.0 of schema.
SKA - AA2 pipeline scaling tests by end of the year. This likely next milestone. NRAO - also doing prototyping testing around the same time. Determine what scaling looks like and exploring different architectures using the data schema and identification of bottlenecks.

Timeline

Schema is largely complete and is awaiting feedback from testing. Will need to incorporate feedback.
Sept 2024 - schema documentation complete and prototyping documentation complete.
Review by end of the year.

Action

.

(Jeff visiting in March)
- Contributing institutes involved in review process to effectively agree they're willing to use (does this become an IAU standard?).
- List of organisations interested in contributing/participating in the review. Jan-Willem has a starting

lst

- list of people involved.
April - Management steering committee meeting.
Revisit potential for non-CALIM
- post-review perhaps smaller meeting to discuss what's been done with the testing
Describe what kind of more algorithm-focussed meeting we would want - gather an SOC for this

.10 min

Meetings moving forward - another F2F, Cal-IM reboot with focus on algorithms, leveraging existing conferences?

...

Space shortcuts