Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TimeItemWhoNotes
25 minXRADIO presentation (15 min) and discussion (10 min)  
  • load_processing_set with MSv2 can be costly and cause performance issues.
  • load_processing_set doesn't use Dask; it directly interfaces with zarr.
  • MSv4 is in memory, but get on-disk persistence with zarr for "free".
  • Have tested with imaging to check data access - over four nodes and produced a large spectral cube
  • Performance data from this in plan
25 minSchema checker and documentation presentation (15 min) and discussion (10 min)
  • AA2 processing scaling - it is not scaling. Current tests failed. Looking into further measures. Failed at distributed processing.
  • I/O is an expected bottleneck in scaling. 
  • Tests are not good enough to determine where the bottlenecks are.
  • XRADIO is a python based project. Actual reading on zarr files uses C++ libraries. Reimplementing everything into C++ is probably not feasible. Reading WSClean from C++ should be simple.

Recap and next steps - are we progressing towards our goals? Have our goals changed? (see e.g. "Goal" from Data models collaboration - decision log)

Embedded Google Drive File
urlhttps://docs.google.com/presentation/d/1uekrD0cyIYB_KoC5u2sRDuzlfrEWRjw-3BrR6cAe3Og/edit?usp=drivesdk
fullwidthtrue
height600

everybody

  • DMS at NRAO plans: continuation of XRADIO; starting next round of prototyping ("pilot") looking into different kinds of workflow orchestration (for example, combining Prefect with Dask). Have started writing domain (pure-science) functions that aren't parallel. gridflag, fringefit based on xradio
  • ARDG - algorithm architecture is being tested for scaling. Testing on 100+ GPUs across US. Deployed architecture on that to process 2TB of VLA wide-band data. Architecture scaled quite well, as expected. There was some data distribution issues not related to algorithm architecture. Will do another run in a few weeks. Expect throughput to improve by a factor of 2. Focussing on throughput metrics, not flops per second.
    • Current throughput is about 1 TB/hour; expect to go to 2 TB/hour. Using HT Condor. 
    • Distribute along an axis that data is stored in - time, frequency; exploring other axes.
    • Imaging is done separately and then brought together.
    • Does not yet include calibration. Next step is to do self-cal. (DI and pointing) Working on deploying that and measuring its scaling.
    • Is implementation sufficiently decoupled from parallelization framework? So far, yes. Tested on both HT Condor, AWS, and ...?
  • Should build up a list of tools/software using XRADIO

 


  • For immediate future - are we progressing sufficiently along prototyping phase?
  • Is the amount of effort right? 
  • DMS (Jeff) is happy with the way things are going
    • As we go forward, we will need more formalism along milestones and deliverables.
    • Eventually transition from exploratory mode, need to have some deadlines
  • Should have review of schemas and prove that within that schema, we can deliver on goals for scalability and performance. Starting exercise on these activities on SKA side, but need a milestone for schemas. This should be the basis for v1.0 of schema.
  • SKA - AA2 pipeline scaling tests by end of the year. This likely next milestone. NRAO - also doing prototyping testing around the same time. Determine what scaling looks like and exploring different architectures using the data schema and identification of bottlenecks.

Timeline

  • Schema is largely complete and is awaiting feedback from testing. Will need to incorporate feedback.
  • Sept 2024 - schema documentation complete and prototyping documentation complete.
  • Review by end of the year.

Action

  • Nick/Jeff - who and what are being reviewed
.
  • (Jeff visiting in March)
    • Contributing institutes involved in review process to effectively agree they're willing to use (does this become an IAU standard?).
    • List of organisations interested in contributing/participating in the review. Jan-Willem has a starting
lst
    • list of people involved.
  • April - Management steering committee meeting.
  • Revisit potential for non-CALIM 
    • post-review perhaps smaller meeting to discuss what's been done with the testing
  • Describe what kind of more algorithm-focussed meeting we would want - gather an SOC for this
.10 minMeetings moving forward - another F2F, Cal-IM reboot with focus on algorithms, leveraging existing conferences?

...