What it costs to copy the planet's data

Climate models, satellite constellations, and environmental sensor networks now produce data at a scale most organizations have no good way to query. The default response is almost always the same: copy a subset out of object storage and load it into a database or warehouse so analysts can finally run SQL against it.

That copy is never free, and it is never a one-time cost.

The duplication tax

Start with storage. Object storage holding Zarr or NetCDF archives typically runs $0.02–$0.03 per GB per month — AWS S3 Standard lists $0.023/GB for the first 50TB/month in US East (N. Virginia). The moment that data is also loaded into a warehouse, you’re paying for it twice — and usually more than twice, once dev, staging, and regional copies are counted. A 50TB archive that “just” gets mirrored into a warehouse adds on the order of $1,000–$1,500 a month in storage alone, before a single query runs against it.

Then there’s the move itself. Pulling data out of object storage across the internet or between clouds runs roughly $0.05–$0.09 per GB in egress fees — AWS’s own data transfer pricing starts at $0.09/GB for the first 10TB/month and steps down to $0.05/GB at higher volume tiers. Loading a 10TB subset for one analysis can cost $500–$900 in transfer alone — and that number resets every time the upstream archive updates and the copy needs refreshing.

That refresh cycle is where the real cost hides. A climate team re-pulling the latest CMIP6 model run, an insurer re-loading SAR flood-extent tiles after every storm, a platform team running a separate ETL job for each BI tool pointed at the same sensor archive — each one is paying the storage-plus-egress tax again, on a clock set by how often the source data changes. Multiply that by every downstream copy, and “occasional ETL job” turns into a standing pipeline someone has to operate, monitor, and fix when it breaks.

And copies don’t just cost money — they age. The version sitting in the warehouse is correct the moment it lands and increasingly wrong after that. Decisions made against a six-week-old copy of a fast-moving archive are decisions made against the wrong data, and nobody finds out until something doesn’t reconcile.

There’s an organizational cost layered on top of the financial one. The people who produce this data — domain scientists running Xarray/Dask pipelines — and the people who need to act on it — analysts, risk teams, product managers — end up working against different versions of the truth, translated through whatever pipeline happened to run last. The copy doesn’t just cost storage and egress; it costs alignment.

The playbook that doesn’t fit

The instinct to copy into a warehouse isn’t irrational — it’s copying a playbook that has worked. BigQuery, Snowflake, and Databricks have spent over a decade proving that centralizing data into one queryable system, at petabyte scale, is the right move for tabular, relational data. That part of the lakehouse model is genuinely solved.

What breaks is the data model underneath it. Climate, weather, and satellite data is fundamentally multidimensional — array-shaped, not row-shaped. Earthmover’s own benchmarks make the mismatch concrete: a single 4D weather dataset (longitude × latitude × time × forecast step) needs about 3,500 coordinate values in its native array form. Flatten that same dataset into rows and columns for a warehouse, and the coordinate values alone balloon past 964 billion — roughly a 277-million-fold blowup, before a single measurement is stored. Query performance follows the same pattern: Earthmover measured array-native tools (Xarray/Zarr) answering timeseries and spatial queries over weather data more than 10x faster than an equivalent tabular stack (DuckDB/Parquet).

This isn’t a new observation. A decade before the current generation of cloud data warehouses existed, a survey of array storage and query systems put it plainly: scientific data is intrinsically ordered, and “the ubiquitous unordered relational model made popular by business database systems cannot handle massive ordered data optimally.” The architectural gap between arrays and tables isn’t a quirk of any one vendor’s product — database researchers have been documenting it since at least the early 1990s, long before “lakehouse” was a word.

The lakehouse giants didn’t get this wrong — they built for a different shape of data. Climate, satellite, and sensor archives need a system that treats the array as the native unit, not one that flattens it into rows first.

The alternative is not duplicating it

None of this is necessary if the data never has to leave object storage to become queryable. That’s the premise behind zarr-datafusion: a SQL engine that runs directly against Zarr-native arrays where they already live — no warehouse load, no second copy, no refresh job to maintain. One archive, one source of truth, queried in place.

The duplication tax isn’t a fixed cost of working with large array data. It’s a cost of a specific architectural choice — copy first, query second — and it’s optional.

Explore zarr-datafusion →