zarr-datafusion

View on GitHub

An Apache DataFusion extension that brings Arrow-native SQL to Zarr-native array data — query petabyte-scale climate, satellite, and sensor archives directly in object storage, without copying them into a database first.

Try it

zarr> CREATE EXTERNAL TABLE era5 STORED AS ZARR
      LOCATION 'gs://gcp-public-data-arco-era5/...';

zarr> DESCRIBE era5;

zarr> SELECT latitude, longitude, AVG(2m_temperature)
      FROM era5
      WHERE time > '2020-01-01'
      GROUP BY latitude, longitude
      LIMIT 10;

Zarr v2 & v3

Reads both Zarr storage spec versions, with schema inference from array metadata.

Filter & projection pushdown

DataFusion pushes WHERE clauses and column selection down to chunk-level reads — no full-array scans.

Cloud-native

Query Zarr stores directly on S3 and GCS, no local copy required.

Full SQL

Filtering, aggregation, GROUP BY, joins, ORDER BY, HAVING — the full DataFusion SQL surface.

Roadmap

Advanced filter and aggregate pushdown, streaming output, additional numeric types, Python bindings, and cloud prefetching. Follow progress and contribute on GitHub .

Cloud — coming soon. A managed, hosted version of zarr-datafusion is in development.