2024-02-15 NRAO-SKA check-in Meeting notes

Date

15 Feb 2024 15:00 - 17:00 UTC

Zoom details

Zoom URL: https://us02web.zoom.us/j/4856030651

Invitees

Participants

Goals

Discussion items

Time	Item	Who	Notes
25 min	XRADIO presentation (15 min) and discussion (10 min)	Steeb, JW	load_processing_set with MSv2 can be costly and cause performance issues. load_processing_set doesn't use Dask; it directly interfaces with zarr. MSv4 is in memory, but get on-disk persistence with zarr for "free". Have tested with imaging to check data access - over four nodes and produced a large spectral cube Performance data from this in plan
25 min	Schema checker and documentation presentation (15 min) and discussion (10 min)	Wortmann, Peter	AA2 processing scaling - it is not scaling. Current tests failed. Looking into further measures. Failed at distributed processing. I/O is an expected bottleneck in scaling. Tests are not good enough to determine where the bottlenecks are. XRADIO is a python based project. Actual reading on zarr files uses C++ libraries. Reimplementing everything into C++ is probably not feasible. Reading WSClean from C++ should be simple.
	Recap and next steps - are we progressing towards our goals? Have our goals changed? (see e.g. "Goal" from Data models collaboration - decision log)	everybody	DMS at NRAO plans: continuation of XRADIO; starting next round of prototyping ("pilot") looking into different kinds of workflow orchestration (for example, combining Prefect with Dask). Have started writing domain (pure-science) functions that aren't parallel. gridflag, fringefit based on xradio ARDG - algorithm architecture is being tested for scaling. Testing on 100+ GPUs across US. Deployed architecture on that to process 2TB of VLA wide-band data. Architecture scaled quite well, as expected. There was some data distribution issues not related to algorithm architecture. Will do another run in a few weeks. Expect throughput to improve by a factor of 2. Focussing on throughput metrics, not flops per second. Current throughput is about 1 TB/hour; expect to go to 2 TB/hour. Using HT Condor. Distribute along an axis that data is stored in - time, frequency; exploring other axes. Imaging is done separately and then brought together. Does not yet include calibration. Next step is to do self-cal. (DI and pointing) Working on deploying that and measuring its scaling. Is implementation sufficiently decoupled from parallelization framework? So far, yes. Tested on both HT Condor, AWS, and ...? Should build up a list of tools/software using XRADIO GraphVIPER - container orchestration AstroVIPER - domain code https://github.com/ARDG-NRAO/LibRA
			For immediate future - are we progressing sufficiently along prototyping phase? Is the amount of effort right? DMS (Jeff) is happy with the way things are going As we go forward, we will need more formalism along milestones and deliverables. Eventually transition from exploratory mode, need to have some deadlines Should have review of schemas and prove that within that schema, we can deliver on goals for scalability and performance. Starting exercise on these activities on SKA side, but need a milestone for schemas. This should be the basis for v1.0 of schema. SKA - AA2 pipeline scaling tests by end of the year. This likely next milestone. NRAO - also doing prototyping testing around the same time. Determine what scaling looks like and exploring different architectures using the data schema and identification of bottlenecks. Timeline Schema is largely complete and is awaiting feedback from testing. Will need to incorporate feedback. Sept 2024 - schema documentation complete and prototyping documentation complete. Review by end of the year. Action Nick/Jeff - who and what are being reviewed (Jeff visiting in March) Contributing institutes involved in review process to effectively agree they're willing to use (does this become an IAU standard?). List of organisations interested in contributing/participating in the review. Jan-Willem has a starting list of people involved. April - Management steering committee meeting. Revisit potential for non-CALIM post-review perhaps smaller meeting to discuss what's been done with the testing Describe what kind of more algorithm-focussed meeting we would want - gather an SOC for this
.10 min	Meetings moving forward - another F2F, Cal-IM reboot with focus on algorithms, leveraging existing conferences?

Action items

Space shortcuts

Page tree

Date

Zoom details

Invitees

Participants

Goals

Discussion items

Action items