Append to existing cube
How to append data to existing datacube stored in team S3 storage¶
A DeepESDL example notebook¶
This notebook demonstrates how to append new data to an existing datacube. This cannot be done using xcube directly yet.
Please, also refer to the DeepESDL documentation and visit the platform's website for further information!
Brockmann Consult, 2024
This notebook runs with the python environment deepesdl-xcube-1.7.0
, please checkout the documentation for help on changing the environment.
First, lets create a small cube, which we can later on append data to. We will use ESA CCI data for this. Please head over to 3 - Generate CCI data cubes to get more details about the xcube-cci data store :)
import datetime
import os
from xcube.core.store import new_data_store
store = new_data_store("ccizarr")
Next, we create a cube containing only 2 months of data:
def open_zarrstore(filename, time_range, variables):
ds = store.open_data(filename)
subset = ds.sel(time=slice(time_range[0], time_range[1]))
subset = subset[variables]
return subset
dataset = open_zarrstore(
"ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
time_range=[datetime.datetime(1998, 3, 1), datetime.datetime(1998, 4, 30)],
variables=["Rrs_412"],
)
dataset
<xarray.Dataset> Size: 299MB Dimensions: (time: 2, lat: 4320, lon: 8640) Coordinates: * lat (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98 * lon (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 * time (time) datetime64[ns] 16B 1998-03-01 1998-04-01 Data variables: Rrs_412 (time, lat, lon) float32 299MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray> Attributes: (12/52) Conventions: CF-1.7 Metadata_Conventions: Unidata Dataset Discovery v1.0 NCO: netCDF Operators version 4.7.5 (Homepa... cdm_data_type: Grid comment: See summary attribute creation_date: Tue Jan 31 12:15:15 2023 ... ... time_coverage_end: 20221201T000000Z time_coverage_resolution: P1M time_coverage_start: 19970904T000000Z title: ESA CCI Ocean Colour Product tracking_id: f7c46c10-67de-460c-bdce-6de91ad32075 catalogue_url: https://catalogue.ceda.ac.uk/uuid/a078...
Next, save it to the team s3 storage:
To store the cube in your teams user space, please first retrieve the details from your environment variables as the following:
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]
You need to instantiate a s3 datastore pointing to the team bucket:
team_store = new_data_store(
"s3",
root=S3_USER_STORAGE_BUCKET,
storage_options=dict(
anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
),
)
If you have stored no data to your user space, the returned list will be empty:
list(team_store.get_data_ids())
['SST.levels', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'noise_trajectory.zarr', 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr', 's2-demo-cube.zarr']
Appending data currently only works with .zarr format and is not supported in .levels yet.
output_id = "ocean_color.zarr"
team_store.write_data(dataset, output_id, replace=True)
'ocean_color.zarr'
If you list the content of you datastore again, you will now see the newly written dataset in the list:
list(team_store.get_data_ids())
['SST.levels', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'noise_trajectory.zarr', 'ocean_color.zarr', 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr', 's2-demo-cube.zarr']
Now, to append new time stamps, xcube cannot be used but there is a workaround :)
# needed for appending data to an existing cube saved in s3 storage
import s3fs
Connect to your team storage in S3
# Connect to AWS S3 storage
fs = s3fs.S3FileSystem(
anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
)
s3_client_kwargs = {"endpoint_url": "https://s3.eu-central-1.amazonaws.com"}
target_bucket_path = f"s3://{S3_USER_STORAGE_BUCKET}"
We create a new dataset, with different time stamps, which we want to append to the existing cube:
dataset = open_zarrstore(
"ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
time_range=[datetime.datetime(1998, 5, 1), datetime.datetime(1998, 6, 30)],
variables=["Rrs_412"],
)
dataset
<xarray.Dataset> Size: 299MB Dimensions: (time: 2, lat: 4320, lon: 8640) Coordinates: * lat (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98 * lon (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 * time (time) datetime64[ns] 16B 1998-05-01 1998-06-01 Data variables: Rrs_412 (time, lat, lon) float32 299MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray> Attributes: (12/52) Conventions: CF-1.7 Metadata_Conventions: Unidata Dataset Discovery v1.0 NCO: netCDF Operators version 4.7.5 (Homepa... cdm_data_type: Grid comment: See summary attribute creation_date: Tue Jan 31 12:15:15 2023 ... ... time_coverage_end: 20221201T000000Z time_coverage_resolution: P1M time_coverage_start: 19970904T000000Z title: ESA CCI Ocean Colour Product tracking_id: f7c46c10-67de-460c-bdce-6de91ad32075 catalogue_url: https://catalogue.ceda.ac.uk/uuid/a078...
# we need to create a mapper pointing to the existing cube, stored in the team s3 storage
mapper = fs.get_mapper(f"{target_bucket_path}/{output_id}")
Now we can append the new dataset to the existing datacube:
dataset.to_zarr(mapper, mode="a", append_dim="time", consolidated=True)
<xarray.backends.zarr.ZarrStore at 0x7f211d077240>
list(team_store.get_data_ids())
['SST.levels', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'noise_trajectory.zarr', 'ocean_color.zarr', 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr', 's2-demo-cube.zarr']
Check if the cube now contains the expected time stamps:
ds = team_store.open_data(output_id)
As expected, we now find all four days in the datacube. Please note: you are responsible for passing the time stamps in the right order. If you do not, this might cause trouble later on and you will need to reorder the time dimension.
ds
<xarray.Dataset> Size: 597MB Dimensions: (time: 4, lat: 4320, lon: 8640) Coordinates: * lat (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98 * lon (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 * time (time) datetime64[ns] 32B 1998-03-01 1998-04-01 ... 1998-06-01 Data variables: Rrs_412 (time, lat, lon) float32 597MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray> Attributes: (12/52) Conventions: CF-1.7 Metadata_Conventions: Unidata Dataset Discovery v1.0 NCO: netCDF Operators version 4.7.5 (Homepa... catalogue_url: https://catalogue.ceda.ac.uk/uuid/a078... cdm_data_type: Grid comment: See summary attribute ... ... time_coverage_duration: P1M time_coverage_end: 20221201T000000Z time_coverage_resolution: P1M time_coverage_start: 19970904T000000Z title: ESA CCI Ocean Colour Product tracking_id: f7c46c10-67de-460c-bdce-6de91ad32075
It is also possible to append a variable with the same dimensions to an existing datacube:
dataset = open_zarrstore(
"ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
time_range=[datetime.datetime(1998, 3, 1), datetime.datetime(1998, 6, 30)],
variables=["Rrs_443"],
)
dataset
<xarray.Dataset> Size: 597MB Dimensions: (time: 4, lat: 4320, lon: 8640) Coordinates: * lat (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98 * lon (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 * time (time) datetime64[ns] 32B 1998-03-01 1998-04-01 ... 1998-06-01 Data variables: Rrs_443 (time, lat, lon) float32 597MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray> Attributes: (12/52) Conventions: CF-1.7 Metadata_Conventions: Unidata Dataset Discovery v1.0 NCO: netCDF Operators version 4.7.5 (Homepa... cdm_data_type: Grid comment: See summary attribute creation_date: Tue Jan 31 12:15:15 2023 ... ... time_coverage_end: 20221201T000000Z time_coverage_resolution: P1M time_coverage_start: 19970904T000000Z title: ESA CCI Ocean Colour Product tracking_id: f7c46c10-67de-460c-bdce-6de91ad32075 catalogue_url: https://catalogue.ceda.ac.uk/uuid/a078...
Now we can append the new dataset with the additional variable to the existing datacube:
dataset.to_zarr(mapper, mode="a", consolidated=True)
<xarray.backends.zarr.ZarrStore at 0x7f211d0aadc0>
Check if the cube now contains the expected new variable:
ds = team_store.open_data(output_id)
ds
<xarray.Dataset> Size: 1GB Dimensions: (time: 4, lat: 4320, lon: 8640) Coordinates: * lat (lat) float64 35kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98 * lon (lon) float64 69kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 * time (time) datetime64[ns] 32B 1998-03-01 1998-04-01 ... 1998-06-01 Data variables: Rrs_412 (time, lat, lon) float32 597MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray> Rrs_443 (time, lat, lon) float32 597MB dask.array<chunksize=(1, 2160, 2160), meta=np.ndarray> Attributes: (12/52) Conventions: CF-1.7 Metadata_Conventions: Unidata Dataset Discovery v1.0 NCO: netCDF Operators version 4.7.5 (Homepa... catalogue_url: https://catalogue.ceda.ac.uk/uuid/a078... cdm_data_type: Grid comment: See summary attribute ... ... time_coverage_duration: P1M time_coverage_end: 20221201T000000Z time_coverage_resolution: P1M time_coverage_start: 19970904T000000Z title: ESA CCI Ocean Colour Product tracking_id: f7c46c10-67de-460c-bdce-6de91ad32075
In other use cases the chunking could be different for the new variable. You can ensure the desired chunking before writing the data to a cube by specifying it in the encoding. To learn about chunking, head over to the example notebook 10 - Chunking of Datasets
Alright, now you know how to append new time stamps or variables to an existing cube - let's clean up our example :)
team_store.delete_data(output_id)
list(team_store.get_data_ids())
['SST.levels', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'noise_trajectory.zarr', 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr', 's2-demo-cube.zarr']