zarr-datafusion
View on GitHubAn Apache DataFusion extension that brings Arrow-native SQL to Zarr-native array data — query petabyte-scale climate, satellite, and sensor archives directly in object storage, without copying them into a database first.
Try it
zarr> CREATE EXTERNAL TABLE era5 STORED AS ZARR
LOCATION 'gs://gcp-public-data-arco-era5/...';
zarr> DESCRIBE era5;
zarr> SELECT latitude, longitude, AVG(2m_temperature)
FROM era5
WHERE time > '2020-01-01'
GROUP BY latitude, longitude
LIMIT 10; Zarr v2 & v3
Reads both Zarr storage spec versions, with schema inference from array metadata.
Filter & projection pushdown
DataFusion pushes WHERE clauses and column selection down to chunk-level reads — no full-array scans.
Cloud-native
Query Zarr stores directly on S3 and GCS, no local copy required.
Full SQL
Filtering, aggregation, GROUP BY, joins, ORDER BY, HAVING — the full DataFusion SQL surface.
Roadmap
Advanced filter and aggregate pushdown, streaming output, additional numeric types, Python bindings, and cloud prefetching. Follow progress and contribute on GitHub .
Cloud — coming soon. A managed, hosted version of zarr-datafusion is in development.