Chunking of datasets

How to save a data cube with a desired chunking¶

A DeepESDL example notebook¶

This notebook demonstrates how modify the chunking of a dataset before persisting it.

Please, also refer to the DeepESDL documentation and visit the platform's website for further information!

Brockmann Consult, 2024

This notebook runs with the python environment deepesdl-xcube-1.7.0, please checkout the documentation for help on changing the environment.

First, lets create a small cube, which we can later on append data to. We will use ESA CCI data for this. Please head over to 3 - Generate CCI data cubes to get more details about the xcube-cci data store :)

In [1]:

Copied!

import datetime
import os

from xcube.core.store import new_data_store
import datetime
import os

from xcube.core.store import new_data_store

In [2]:

Copied!

store = new_data_store("ccizarr")
store = new_data_store("ccizarr")

Next, we create a cube containing only 4 days of data:

In [3]:

Copied!





def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset


dataset = open_zarrstore(
    "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.zarr",
    time_range=[datetime.datetime(2015, 10, 1), datetime.datetime(2015, 10, 5)],
    variables=["analysed_sst"],
)
def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset


dataset = open_zarrstore(
    "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.zarr",
    time_range=[datetime.datetime(2015, 10, 1), datetime.datetime(2015, 10, 5)],
    variables=["analysed_sst"],
)

In [4]:

Copied!

dataset
dataset

Out[4]:

<xarray.Dataset> Size: 33MB
Dimensions:       (time: 4, lat: 720, lon: 1440)
Coordinates:
  * lat           (lat) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * lon           (lon) float32 6kB -179.9 -179.6 -179.4 ... 179.4 179.6 179.9
  * time          (time) datetime64[ns] 32B 2015-10-01T12:00:00 ... 2015-10-0...
Data variables:
    analysed_sst  (time, lat, lon) float64 33MB dask.array<chunksize=(4, 720, 720), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.4
    acknowledgment:             Funded by ESA
    cdm_data_type:              grid
    comment:                    
    creator_email:              science.leader@esa-sst-cci.org
    creator_name:               SST_cci
    ...                         ...
    summary:                    An ensemble product with input from a number ...
    time_coverage_end:          20170101T000000Z
    time_coverage_start:        20161231T000000Z
    title:                      Global SST Ensemble, L4 GMPE
    uuid:                       dc0c5b25-93bf-4943-aba1-7f0de9109620
    westernmost_longitude:      -180.0

In the example above, we can see that the variable analysed_sst is chunked as follows: (4, 720, 720). This means, each chunk contains 4 time values, 720 lat values and 720 lon values per chunk. Variables, which contain 1 time value and many spatial dimensions in one chunk are optimal for visualisation/plotting of one time stamp.

For analysing long time series, it is benificial to chunk a dataset accordingly, so the chunks contain more values of the time dimension and less of the spatial dimensions.

In [5]:

Copied!





# time optimised chunking - please note, this is just an example
time_chunksize = 1
x_chunksize = 120  # or lon
y_chunksize = 120  # or lat
# time optimised chunking - please note, this is just an example
time_chunksize = 1
x_chunksize = 120  # or lon
y_chunksize = 120  # or lat

Now the chunking is applied to all variables, but skipping crs if present:

In [6]:

Copied!

dataset.data_vars
dataset.data_vars

Out[6]:

Data variables:
    analysed_sst  (time, lat, lon) float64 33MB dask.array<chunksize=(4, 720, 720), meta=np.ndarray>

In [7]:

Copied!





for var_name in dataset.data_vars:
    if var_name != "crs" and "_bounds" not in var_name:
        print(var_name)
        dataset[var_name] = dataset[var_name].chunk(
            {"time": time_chunksize, "lat": y_chunksize, "lon": x_chunksize}
        )
for var_name in dataset.data_vars:
    if var_name != "crs" and "_bounds" not in var_name:
        print(var_name)
        dataset[var_name] = dataset[var_name].chunk(
            {"time": time_chunksize, "lat": y_chunksize, "lon": x_chunksize}
        )

analysed_sst

To save a copy of a cube with a specific chunking, the encoding must be adjusted acordingly.

In [8]:

Copied!

encoding_dict = dict()
encoding_dict = dict()

We want to ensure that the coordinate variables are stored in the best performant way, so we ensure that they are not chunked. This can be specified via the encoding:

In [9]:

Copied!

coords_encoding = {k: dict(chunks=v.shape) for k, v in dataset.coords.items()}
coords_encoding = {k: dict(chunks=v.shape) for k, v in dataset.coords.items()}

In [10]:

Copied!

coords_encoding
coords_encoding

Out[10]:

{'lat': {'chunks': (720,)},
 'lon': {'chunks': (1440,)},
 'time': {'chunks': (4,)}}

Specify the chunking the data variables encoding and ensuring that empty chunks are not written to disk by adding write_empty_chunks=False. This saves space on disk. Again, skipping crs if present.

In [11]:

Copied!





vars_encoding = {
    k: dict(chunks=(time_chunksize, y_chunksize, x_chunksize), write_empty_chunks=False)
    for k, v in dataset.data_vars.items()
    if k != "crs"
}
vars_encoding = {
    k: dict(chunks=(time_chunksize, y_chunksize, x_chunksize), write_empty_chunks=False)
    for k, v in dataset.data_vars.items()
    if k != "crs"
}

In [12]:

Copied!

vars_encoding
vars_encoding

Out[12]:

{'analysed_sst': {'chunks': (1, 120, 120), 'write_empty_chunks': False}}

Next, combining both dictionaries to form the encoding for the entire dataset.

In [13]:

Copied!

encoding_dict.update(coords_encoding)
encoding_dict.update(vars_encoding)
encoding_dict.update(coords_encoding)
encoding_dict.update(vars_encoding)

In [14]:

Copied!

encoding_dict
encoding_dict

Out[14]:

{'lat': {'chunks': (720,)},
 'lon': {'chunks': (1440,)},
 'time': {'chunks': (4,)},
 'analysed_sst': {'chunks': (1, 120, 120), 'write_empty_chunks': False}}

Next, save it to the team s3 storage:

To store the cube in your teams user space, please first retrieve the details from your environment variables as the following:

In [15]:

Copied!

S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]

You need to instantiate a s3 datastore pointing to the team bucket:

In [16]:

Copied!





team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)
team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)

If you have stored no data to your user space, the returned list will be empty:

In [17]:

Copied!

list(team_store.get_data_ids())
list(team_store.get_data_ids())

Out[17]:

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

In [18]:

Copied!

output_id = "analysed_sst.zarr"
output_id = "analysed_sst.zarr"

Now let's write the data to the team s3 storage and remember to specify the encoding while doing so:

In [19]:

Copied!

team_store.write_data(dataset, output_id, encoding=encoding_dict, replace=True)
team_store.write_data(dataset, output_id, encoding=encoding_dict, replace=True)

Out[19]:

'analysed_sst.zarr'

If you list the content of you datastore again, you will now see the newly written dataset in the list:

In [20]:

Copied!

list(team_store.get_data_ids())
list(team_store.get_data_ids())

Out[20]:

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'analysed_sst.zarr',
 'noise_trajectory.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

Let's verify that our chunking has been applied:

In [21]:

Copied!

ds = team_store.open_data(output_id)
ds = team_store.open_data(output_id)

In [22]:

Copied!

ds
ds

Out[22]:

<xarray.Dataset> Size: 33MB
Dimensions:       (time: 4, lat: 720, lon: 1440)
Coordinates:
  * lat           (lat) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * lon           (lon) float32 6kB -179.9 -179.6 -179.4 ... 179.4 179.6 179.9
  * time          (time) datetime64[ns] 32B 2015-10-01T12:00:00 ... 2015-10-0...
Data variables:
    analysed_sst  (time, lat, lon) float64 33MB dask.array<chunksize=(1, 120, 120), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.4
    acknowledgment:             Funded by ESA
    cdm_data_type:              grid
    comment:                    
    creator_email:              science.leader@esa-sst-cci.org
    creator_name:               SST_cci
    ...                         ...
    summary:                    An ensemble product with input from a number ...
    time_coverage_end:          20170101T000000Z
    time_coverage_start:        20161231T000000Z
    title:                      Global SST Ensemble, L4 GMPE
    uuid:                       dc0c5b25-93bf-4943-aba1-7f0de9109620
    westernmost_longitude:      -180.0

Looks good, now let's clean up the example cube :)

In [23]:

Copied!

team_store.delete_data(output_id)
team_store.delete_data(output_id)

In [ ]: