Append to existing cube

How to append data to existing datacube stored in team S3 storage¶

A DeepESDL example notebook¶

This notebook demonstrates how to append new data to an existing datacube. This cannot be done using xcube directly yet.

Please, also refer to the DeepESDL documentation and visit the platform's website for further information!

Brockmann Consult, 2024

This notebook runs with the python environment deepesdl-xcube-1.7.0, please checkout the documentation for help on changing the environment.

First, lets create a small cube, which we can later on append data to. We will use ESA CCI data for this. Please head over to 3 - Generate CCI data cubes to get more details about the xcube-cci data store :)

In [1]:

Copied!

import datetime
import os

from xcube.core.store import new_data_store
import datetime
import os

from xcube.core.store import new_data_store

In [2]:

Copied!

store = new_data_store("ccizarr")
store = new_data_store("ccizarr")

Next, we create a cube containing only 2 months of data:

In [3]:

Copied!





def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset
def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset

In [4]:

Copied!





dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 3, 1), datetime.datetime(1998, 4, 30)],
    variables=["Rrs_412"],
)
dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 3, 1), datetime.datetime(1998, 4, 30)],
    variables=["Rrs_412"],
)

In [5]:

Copied!

dataset
dataset

Out[5]:

<xarray.Dataset> Size: 299MB
Dimensions:  (time: 2, lat: 4320, lon: 8640)
Coordinates:
  * lat      (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98
  * lon      (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
  * time     (time) datetime64[ns] 16B 1998-03-01 1998-04-01
Data variables:
    Rrs_412  (time, lat, lon) float32 299MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray>
Attributes: (12/52)
    Conventions:                       CF-1.7
    Metadata_Conventions:              Unidata Dataset Discovery v1.0
    NCO:                               netCDF Operators version 4.7.5 (Homepa...
    cdm_data_type:                     Grid
    comment:                           See summary attribute
    creation_date:                     Tue Jan 31 12:15:15 2023
    ...                                ...
    time_coverage_end:                 20221201T000000Z
    time_coverage_resolution:          P1M
    time_coverage_start:               19970904T000000Z
    title:                             ESA CCI Ocean Colour Product
    tracking_id:                       f7c46c10-67de-460c-bdce-6de91ad32075
    catalogue_url:                     https://catalogue.ceda.ac.uk/uuid/a078...

Next, save it to the team s3 storage:

To store the cube in your teams user space, please first retrieve the details from your environment variables as the following:

In [6]:

Copied!

S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]

You need to instantiate a s3 datastore pointing to the team bucket:

In [7]:

Copied!





team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)
team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)

If you have stored no data to your user space, the returned list will be empty:

In [8]:

Copied!

list(team_store.get_data_ids())
list(team_store.get_data_ids())

Out[8]:

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

Appending data currently only works with .zarr format and is not supported in .levels yet.

In [9]:

Copied!

output_id = "ocean_color.zarr"
output_id = "ocean_color.zarr"

In [10]:

Copied!

team_store.write_data(dataset, output_id, replace=True)
team_store.write_data(dataset, output_id, replace=True)

Out[10]:

'ocean_color.zarr'

If you list the content of you datastore again, you will now see the newly written dataset in the list:

In [11]:

Copied!

list(team_store.get_data_ids())
list(team_store.get_data_ids())

Out[11]:

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'ocean_color.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

Now, to append new time stamps, xcube cannot be used but there is a workaround :)

In [12]:

Copied!

# needed for appending data to an existing cube saved in s3 storage
import s3fs
# needed for appending data to an existing cube saved in s3 storage
import s3fs

Connect to your team storage in S3

In [13]:

Copied!





# Connect to AWS S3 storage
fs = s3fs.S3FileSystem(
    anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
)
# Connect to AWS S3 storage
fs = s3fs.S3FileSystem(
    anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
)

In [14]:

Copied!

s3_client_kwargs = {"endpoint_url": "https://s3.eu-central-1.amazonaws.com"}
target_bucket_path = f"s3://{S3_USER_STORAGE_BUCKET}"
s3_client_kwargs = {"endpoint_url": "https://s3.eu-central-1.amazonaws.com"}
target_bucket_path = f"s3://{S3_USER_STORAGE_BUCKET}"

We create a new dataset, with different time stamps, which we want to append to the existing cube:

In [15]:

Copied!





dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 5, 1), datetime.datetime(1998, 6, 30)],
    variables=["Rrs_412"],
)
dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 5, 1), datetime.datetime(1998, 6, 30)],
    variables=["Rrs_412"],
)

In [16]:

Copied!

dataset
dataset

Out[16]:

<xarray.Dataset> Size: 299MB
Dimensions:  (time: 2, lat: 4320, lon: 8640)
Coordinates:
  * lat      (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98
  * lon      (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
  * time     (time) datetime64[ns] 16B 1998-05-01 1998-06-01
Data variables:
    Rrs_412  (time, lat, lon) float32 299MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray>
Attributes: (12/52)
    Conventions:                       CF-1.7
    Metadata_Conventions:              Unidata Dataset Discovery v1.0
    NCO:                               netCDF Operators version 4.7.5 (Homepa...
    cdm_data_type:                     Grid
    comment:                           See summary attribute
    creation_date:                     Tue Jan 31 12:15:15 2023
    ...                                ...
    time_coverage_end:                 20221201T000000Z
    time_coverage_resolution:          P1M
    time_coverage_start:               19970904T000000Z
    title:                             ESA CCI Ocean Colour Product
    tracking_id:                       f7c46c10-67de-460c-bdce-6de91ad32075
    catalogue_url:                     https://catalogue.ceda.ac.uk/uuid/a078...

In [17]:

Copied!

# we need to create a mapper pointing to the existing cube, stored in the team s3 storage
mapper = fs.get_mapper(f"{target_bucket_path}/{output_id}")
# we need to create a mapper pointing to the existing cube, stored in the team s3 storage
mapper = fs.get_mapper(f"{target_bucket_path}/{output_id}")

Now we can append the new dataset to the existing datacube:

In [18]:

Copied!

dataset.to_zarr(mapper, mode="a", append_dim="time", consolidated=True)
dataset.to_zarr(mapper, mode="a", append_dim="time", consolidated=True)

Out[18]:

<xarray.backends.zarr.ZarrStore at 0x7f211d077240>

In [19]:

Copied!

list(team_store.get_data_ids())
list(team_store.get_data_ids())

Out[19]:

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'ocean_color.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

Check if the cube now contains the expected time stamps:

In [20]:

Copied!

ds = team_store.open_data(output_id)
ds = team_store.open_data(output_id)

As expected, we now find all four days in the datacube. Please note: you are responsible for passing the time stamps in the right order. If you do not, this might cause trouble later on and you will need to reorder the time dimension.

In [21]:

Copied!

ds
ds

Out[21]:

<xarray.Dataset> Size: 597MB
Dimensions:  (time: 4, lat: 4320, lon: 8640)
Coordinates:
  * lat      (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98
  * lon      (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
  * time     (time) datetime64[ns] 32B 1998-03-01 1998-04-01 ... 1998-06-01
Data variables:
    Rrs_412  (time, lat, lon) float32 597MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray>
Attributes: (12/52)
    Conventions:                       CF-1.7
    Metadata_Conventions:              Unidata Dataset Discovery v1.0
    NCO:                               netCDF Operators version 4.7.5 (Homepa...
    catalogue_url:                     https://catalogue.ceda.ac.uk/uuid/a078...
    cdm_data_type:                     Grid
    comment:                           See summary attribute
    ...                                ...
    time_coverage_duration:            P1M
    time_coverage_end:                 20221201T000000Z
    time_coverage_resolution:          P1M
    time_coverage_start:               19970904T000000Z
    title:                             ESA CCI Ocean Colour Product
    tracking_id:                       f7c46c10-67de-460c-bdce-6de91ad32075

It is also possible to append a variable with the same dimensions to an existing datacube:

In [22]:

Copied!





dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 3, 1), datetime.datetime(1998, 6, 30)],
    variables=["Rrs_443"],
)
dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 3, 1), datetime.datetime(1998, 6, 30)],
    variables=["Rrs_443"],
)

In [23]:

Copied!

dataset
dataset

Out[23]:

<xarray.Dataset> Size: 597MB
Dimensions:  (time: 4, lat: 4320, lon: 8640)
Coordinates:
  * lat      (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98
  * lon      (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
  * time     (time) datetime64[ns] 32B 1998-03-01 1998-04-01 ... 1998-06-01
Data variables:
    Rrs_443  (time, lat, lon) float32 597MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray>
Attributes: (12/52)
    Conventions:                       CF-1.7
    Metadata_Conventions:              Unidata Dataset Discovery v1.0
    NCO:                               netCDF Operators version 4.7.5 (Homepa...
    cdm_data_type:                     Grid
    comment:                           See summary attribute
    creation_date:                     Tue Jan 31 12:15:15 2023
    ...                                ...
    time_coverage_end:                 20221201T000000Z
    time_coverage_resolution:          P1M
    time_coverage_start:               19970904T000000Z
    title:                             ESA CCI Ocean Colour Product
    tracking_id:                       f7c46c10-67de-460c-bdce-6de91ad32075
    catalogue_url:                     https://catalogue.ceda.ac.uk/uuid/a078...

Now we can append the new dataset with the additional variable to the existing datacube:

In [24]:

Copied!

dataset.to_zarr(mapper, mode="a", consolidated=True)
dataset.to_zarr(mapper, mode="a", consolidated=True)

Out[24]:

<xarray.backends.zarr.ZarrStore at 0x7f211d0aadc0>

Check if the cube now contains the expected new variable:

In [25]:

Copied!

ds = team_store.open_data(output_id)
ds = team_store.open_data(output_id)

In [26]:

Copied!

ds
ds

Out[26]:

<xarray.Dataset> Size: 1GB
Dimensions:  (time: 4, lat: 4320, lon: 8640)
Coordinates:
  * lat      (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98
  * lon      (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
  * time     (time) datetime64[ns] 32B 1998-03-01 1998-04-01 ... 1998-06-01
Data variables:
    Rrs_412  (time, lat, lon) float32 597MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray>
    Rrs_443  (time, lat, lon) float32 597MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray>
Attributes: (12/52)
    Conventions:                       CF-1.7
    Metadata_Conventions:              Unidata Dataset Discovery v1.0
    NCO:                               netCDF Operators version 4.7.5 (Homepa...
    catalogue_url:                     https://catalogue.ceda.ac.uk/uuid/a078...
    cdm_data_type:                     Grid
    comment:                           See summary attribute
    ...                                ...
    time_coverage_duration:            P1M
    time_coverage_end:                 20221201T000000Z
    time_coverage_resolution:          P1M
    time_coverage_start:               19970904T000000Z
    title:                             ESA CCI Ocean Colour Product
    tracking_id:                       f7c46c10-67de-460c-bdce-6de91ad32075

In other use cases the chunking could be different for the new variable. You can ensure the desired chunking before writing the data to a cube by specifying it in the encoding. To learn about chunking, head over to the example notebook 10 - Chunking of Datasets

Alright, now you know how to append new time stamps or variables to an existing cube - let's clean up our example :)

In [27]:

Copied!

team_store.delete_data(output_id)
team_store.delete_data(output_id)

In [28]:

Copied!

list(team_store.get_data_ids())
list(team_store.get_data_ids())

Out[28]:

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

In [ ]: