Chunking of datasets
How to save a data cube with a desired chunking¶
A DeepESDL example notebook¶
This notebook demonstrates how modify the chunking of a dataset before persisting it.
Please, also refer to the DeepESDL documentation and visit the platform's website for further information!
Brockmann Consult, 2024
This notebook runs with the python environment deepesdl-xcube-1.7.0
, please checkout the documentation for help on changing the environment.
First, lets create a small cube, which we can later on append data to. We will use ESA CCI data for this. Please head over to 3 - Generate CCI data cubes to get more details about the xcube-cci data store :)
import datetime
import os
from xcube.core.store import new_data_store
store = new_data_store("ccizarr")
Next, we create a cube containing only 4 days of data:
def open_zarrstore(filename, time_range, variables):
ds = store.open_data(filename)
subset = ds.sel(time=slice(time_range[0], time_range[1]))
subset = subset[variables]
return subset
dataset = open_zarrstore(
"ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.zarr",
time_range=[datetime.datetime(2015, 10, 1), datetime.datetime(2015, 10, 5)],
variables=["analysed_sst"],
)
dataset
<xarray.Dataset> Size: 33MB Dimensions: (time: 4, lat: 720, lon: 1440) Coordinates: * lat (lat) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88 * lon (lon) float32 6kB -179.9 -179.6 -179.4 ... 179.4 179.6 179.9 * time (time) datetime64[ns] 32B 2015-10-01T12:00:00 ... 2015-10-0... Data variables: analysed_sst (time, lat, lon) float64 33MB dask.array<chunksize=(4, 720, 720), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.4 acknowledgment: Funded by ESA cdm_data_type: grid comment: creator_email: science.leader@esa-sst-cci.org creator_name: SST_cci ... ... summary: An ensemble product with input from a number ... time_coverage_end: 20170101T000000Z time_coverage_start: 20161231T000000Z title: Global SST Ensemble, L4 GMPE uuid: dc0c5b25-93bf-4943-aba1-7f0de9109620 westernmost_longitude: -180.0
In the example above, we can see that the variable analysed_sst is chunked as follows: (4, 720, 720). This means, each chunk contains 4 time values, 720 lat values and 720 lon values per chunk. Variables, which contain 1 time value and many spatial dimensions in one chunk are optimal for visualisation/plotting of one time stamp.
For analysing long time series, it is benificial to chunk a dataset accordingly, so the chunks contain more values of the time dimension and less of the spatial dimensions.
# time optimised chunking - please note, this is just an example
time_chunksize = 1
x_chunksize = 120 # or lon
y_chunksize = 120 # or lat
Now the chunking is applied to all variables, but skipping crs if present:
dataset.data_vars
Data variables: analysed_sst (time, lat, lon) float64 33MB dask.array<chunksize=(4, 720, 720), meta=np.ndarray>
for var_name in dataset.data_vars:
if var_name != "crs" and "_bounds" not in var_name:
print(var_name)
dataset[var_name] = dataset[var_name].chunk(
{"time": time_chunksize, "lat": y_chunksize, "lon": x_chunksize}
)
analysed_sst
To save a copy of a cube with a specific chunking, the encoding must be adjusted acordingly.
encoding_dict = dict()
We want to ensure that the coordinate variables are stored in the best performant way, so we ensure that they are not chunked. This can be specified via the encoding:
coords_encoding = {k: dict(chunks=v.shape) for k, v in dataset.coords.items()}
coords_encoding
{'lat': {'chunks': (720,)}, 'lon': {'chunks': (1440,)}, 'time': {'chunks': (4,)}}
Specify the chunking the data variables encoding and ensuring that empty chunks are not written to disk by adding write_empty_chunks=False
. This saves space on disk. Again, skipping crs if present.
vars_encoding = {
k: dict(chunks=(time_chunksize, y_chunksize, x_chunksize), write_empty_chunks=False)
for k, v in dataset.data_vars.items()
if k != "crs"
}
vars_encoding
{'analysed_sst': {'chunks': (1, 120, 120), 'write_empty_chunks': False}}
Next, combining both dictionaries to form the encoding for the entire dataset.
encoding_dict.update(coords_encoding)
encoding_dict.update(vars_encoding)
encoding_dict
{'lat': {'chunks': (720,)}, 'lon': {'chunks': (1440,)}, 'time': {'chunks': (4,)}, 'analysed_sst': {'chunks': (1, 120, 120), 'write_empty_chunks': False}}
Next, save it to the team s3 storage:
To store the cube in your teams user space, please first retrieve the details from your environment variables as the following:
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]
You need to instantiate a s3 datastore pointing to the team bucket:
team_store = new_data_store(
"s3",
root=S3_USER_STORAGE_BUCKET,
storage_options=dict(
anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
),
)
If you have stored no data to your user space, the returned list will be empty:
list(team_store.get_data_ids())
['SST.levels', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'noise_trajectory.zarr', 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr', 's2-demo-cube.zarr']
output_id = "analysed_sst.zarr"
Now let's write the data to the team s3 storage and remember to specify the encoding while doing so:
team_store.write_data(dataset, output_id, encoding=encoding_dict, replace=True)
'analysed_sst.zarr'
If you list the content of you datastore again, you will now see the newly written dataset in the list:
list(team_store.get_data_ids())
['SST.levels', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'analysed_sst.zarr', 'noise_trajectory.zarr', 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr', 's2-demo-cube.zarr']
Let's verify that our chunking has been applied:
ds = team_store.open_data(output_id)
ds
<xarray.Dataset> Size: 33MB Dimensions: (time: 4, lat: 720, lon: 1440) Coordinates: * lat (lat) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88 * lon (lon) float32 6kB -179.9 -179.6 -179.4 ... 179.4 179.6 179.9 * time (time) datetime64[ns] 32B 2015-10-01T12:00:00 ... 2015-10-0... Data variables: analysed_sst (time, lat, lon) float64 33MB dask.array<chunksize=(1, 120, 120), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.4 acknowledgment: Funded by ESA cdm_data_type: grid comment: creator_email: science.leader@esa-sst-cci.org creator_name: SST_cci ... ... summary: An ensemble product with input from a number ... time_coverage_end: 20170101T000000Z time_coverage_start: 20161231T000000Z title: Global SST Ensemble, L4 GMPE uuid: dc0c5b25-93bf-4943-aba1-7f0de9109620 westernmost_longitude: -180.0
Looks good, now let's clean up the example cube :)
team_store.delete_data(output_id)