24th February, 2021

Attendees: Josh Moore, Eric Perlman, Marco Peters, Matt McCormick, Matthias Bussonnier, Sabine Embacher, Stephan Saalfeld, Ward Fisher, Ryan Williams, John Kirkham

  • Introductions

    • Eric: suffer on bridging various worlds

    • Marco (Brockmann Consult Development; @bcdev) remote sensing

    • Matt: working on xarray / zarr interoperation. adding stores

      (consolidated metadata…). Working towards single reference zip implementation (range requests)

    • Matthias: did most of the work on the V3 spec. still trying to

      finish it up

    • Sabine: zarr will be the basis of

      snap standard IO format in the near future. Also zarr is used by our mass computing system calvalus producing data cubes. JZarr is also used to read data cubes produced using xcube. Started JZarr for own needs. Not entirely of v2. Extending in the future.

    • Stephan: wrote n5 about the same time as zarr. needed power of a

      computer cluster. Formats are very similar. Lots of adoption in both directions.

      • Have written additional backends: https://github.com/saalfeldlab/n5-zarr also didn’t implement everything. Didn’t implement Python specifics.
      • Stopped coming to the meetings when got when somethings left the table. e.g. need arbitrary size chunks. Also reusing chunks as simple array of arbitrary objects (serialized out from Java).
      • Currently difficult to implement cloud+filesystem backends for n5-zarr. Need to extract the key value store. Would be interesting to hear more about the filesystem backend of JZarr.
    • Ward: netcdf team lead. drifted away briefly from these calls.

      zarr support in netcdf libraries is there (v2+s3). been trying to mint release for 2 months (non-zarr problems. Windows bugs) There and feature complete (until we hear otherwise)

      • Shawn Arms (?) is responsible for the Java implementation. Also, “there” (*not* wrapping C)
      • Could potentially be available by the end of the week
      • putting breaks on V3 until we have a V2 release.
      • Should implement compound-types.
    • Ryan: single-cell, mostly Python

    • John: nvidia
  • Community: Java implementations: status, V3 plans, points of overlap

    • SS: Read into native arrays? WF: will talk to Dennis.

    • SE: compressors implemented? or wrap C libraries? WF: no, but

      will ask for status.

    • SE: use ncar arrays but not in the frontend. only used in the

      background for slicing data and moving data to the right chunk. when reading over chunk borders, then the array combines the data. JZarr currently doesn’t worry about signed/unsigned ints. i.e. only support pure java types. SS: can load everything as bytes.

    • SS: can you load something where the type is 26 bytes? SE: If

      there’s a wrapper that converts to a bytes stream. Unclear if it will work with types from Python

    • SS: can the backend store/load chunks with number of bytes that

      isn’t the size of the chunk? e.g. 3x3 but load chunks with more than 9 bytes. SE: since using the common data model (CDM) can only handle unidata’s types, but compressors are designed such that it must be a byte-stream. plan is to encapsulate data preparation (e.g. data to byte-stream converters)

    • From the netcdf-java team lead: Compressors in netcdf-java/zarr

      are being implemented but the list of initial compressors supported will be short. The challenge will be finding compressors with native java support. Wrapping native compressors will be possible but opens the door to complications when building and distributing.

    • SS: have no intent to be feature complete.

      • Know where we stopped (and why). Have a ByteBuffer interface. Currently primitive java array based. Then have a library higher up that uses it. (imglib2) decoupled from n5.

      • Stopped with default type. (fishy in the v2 spec)

      • Stopped at complex c structs.

      • Stopped at the locking. Try to do proper locking on the filesystem. (Continue even if exceptions are thrown) v2 spec wasn’t too explicit about the locking mechanism.

      • Stopped the “hacks” for consolidating metadata. SE: do this via xcube optimize Python for JZarr.

      • Stopped at filters.

      • Compressors: see that JZarr uses jblosc (n5-zarr is doing the same). Specify it via an annotation-based system. If the JVM finds at start time one of these compressors, then they are available. Also: gzip, lz (standard Java things)
        JZarr also uses a zlib and null compressor.

        • Difference between n5 & zarr is that whether or not

          there’s a registry. (Maintain the Python list in n5-zarr)

    • SS: Locking in JZarr? SE: use a chunk file name creator that

      creates the name and a thread safe locking is implemented using this chunk file name as lock key while writing chunked data. See line 208 and 263 - 270 of class ZarrArray.

      • On the cluster? None implemented. Up to the compute team. There’s code which knows how the data should be written (in chunk size)
      • SS: locking on attributes is tricky. SE: write all attributes at one time. No blocking there. (SE: initializing phase comes from 15 years of prepping XML)
      • EP: probably ok if it’s only within one library
      • JK: solved it in dask by writing on chunk boundaries
      • SS: still need it for attributes (JK: in dask that’s during `to_zarr()`)
    • SE: have a software tool for

      rechunking large volumes (e.g. initially write layer by layer, and then later combine them together)

  • Core

    • Nothing
  • Other business (time permitting)

  • Links

  • Implementations discussed

  • Take home

    • working on collaborative comparison of features.