Create Zarr With Slices SentinelHub
Please note:
In order to access data from Sentinel Hub, you need Sentinel Hub API credentials. They may be passed as store parameters (see further below) or exported from environment variables. In case you have not exported them already, you may also set them by uncommenting the cell below and adjusting the content to your access credentials. However, we do not recommend this method!
- Within DeepESDL there is the possiblity to apply for SentinelHub access - please contact the DeepESDL team :)
- This notebook runs with the python environment
deepesdl-xcube-1.7.0
, please checkout the documentation for help on changing the environment.**
import os
import numpy as np
import pandas as pd
import xarray as xr
# mandatory imports
from xcube.core.store import find_data_store_extensions
from xcube.core.store import get_data_store_params_schema
from xcube.core.store import new_data_store
from zappend.api import zappend
Access data from Sentinel Hub¶
The xcube plugin xcube-sh
provides a data store named sentinelhub
that can be used to access Sentinel data in form 3-d datacubes of type xarray.Dataset
.
sh_store = new_data_store('sentinelhub', num_retries=400)
Which datasets are supported by this data store?
sh_store.list_data_ids()
['S2L1C', 'S1GRD', 'S2L2A', 'DEM']
Describe the S2L2A dataset. The open_params_schema
describes the parameters that can be passed to the data store's open_data()
method.
sh_store.describe_data('S2L2A')
<xcube.core.store.descriptor.DatasetDescriptor at 0x7fb102d20a50>
Let's open a subset of this dataset to familiarize with its representation as an xarray.Dataset
. The dataset is opened lazily, that is, data is loaded only at the time it is needed:
s2_cube = sh_store.open_data(
'S2L2A',
variable_names=['B02','B03','B04'],
tile_size=[1024, 1024],
spatial_res=0.00018, # = 20.038 meters in degree
bbox=[10.0, 54.27, 11.0, 54.6],
time_range=[f'2023-07-08', f'2023-07-12'],
time_period='1D'
)
s2_cube
<xarray.Dataset> Size: 755MB Dimensions: (time: 5, lat: 2048, lon: 6144, bnds: 2) Coordinates: * lat (lat) float64 16kB 54.64 54.64 54.64 54.64 ... 54.27 54.27 54.27 * lon (lon) float64 49kB 10.0 10.0 10.0 10.0 ... 11.11 11.11 11.11 * time (time) datetime64[ns] 40B 2023-07-08T12:00:00 ... 2023-07-12T1... time_bnds (time, bnds) datetime64[ns] 80B dask.array<chunksize=(5, 2), meta=np.ndarray> Dimensions without coordinates: bnds Data variables: B02 (time, lat, lon) float32 252MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> B03 (time, lat, lon) float32 252MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> B04 (time, lat, lon) float32 252MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> Attributes: (12/13) Conventions: CF-1.7 title: S2L2A Data Cube Subset history: [{'program': 'xcube_sh.chunkstore.SentinelHubC... date_created: 2024-09-11T08:47:04.855222 time_coverage_start: 2023-07-08T00:00:00+00:00 time_coverage_end: 2023-07-13T00:00:00+00:00 ... ... time_coverage_resolution: P1DT0H0M0S geospatial_lon_min: 10.0 geospatial_lat_min: 54.27 geospatial_lon_max: 11.10592 geospatial_lat_max: 54.63864 processing_level: L2A
s2_cube.B03.isel(time=0).plot.imshow(vmax=0.2, cmap="bone")
<matplotlib.image.AxesImage at 0x7fb10012f690>
The dataset s2_cube
could be written to Zarr by simply using s2_cube.to_zarr(path)
but that approach is not appropriate for larger datacubes for several reasons:
- Sentinel Hub may lower bandwidth due to the very many concurrent API requests it receives for each written chunk of our datacube.
- Your Sentinel Hub subscription may run out of credits.
- Network errors while reading or filesystem errors while writing may occur unexpectedly at any time.
- Potential, yet undiscovered memory leaks in the involved libraries may lead to out-of-memory conditions.
Therefore we will introduce a new tool zappend that can avoid or at least mitigate some of the issues above. Before that we discuss the team storage where you can persist your datacubes.
Access team storage¶
The DeepESDL team storage is a special writable bucket in AWS S3 object storage that is used by your team to allow for shared access of datasets.
The xcube software provides an data store of type s3
to access datasets in object storage.
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]
team_store = new_data_store(
"s3",
root=S3_USER_STORAGE_BUCKET,
storage_options=dict(
anon=False,
key=S3_USER_STORAGE_KEY,
secret=S3_USER_STORAGE_SECRET
)
)
Which datasets are already in the team store?
team_store.list_data_ids()
['SST.levels', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'noise_trajectory.zarr', 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr', 's2-demo-cube.zarr']
If you need to delete datasets from team storage, use the has_data()
and delete_data()
store methods:
if team_store.has_data('ERA5_example.levels'):
team_store.delete_data('ERA5_example.levels')
Generate a larger Zarr datacube from Sentinel data¶
As anounced, we will utilize the zappend tool for writing larger datacubes into team storage. The objective of zappend is to address recurring memory issues when generating large geospatial datacubes using the Zarr format by subsequently concatenating data slices along an append dimension, e.g., time (the default) for geospatial satellite observations. Each append step is atomic, that is, the append operation is a transaction that is rolled back, in case the append operation fails. This ensures integrity of the target data cube. Find our more:
zappend can be used in a number of ways. In the following example we read smaller Sentinel datacubes, dataset slices, from the sh_store
as described above.
We use a Python generator function that provides a stream of dataset slices. The slice stream will be passed to zappend allowing it to process the slices subsequentially and open a slice dataset only when it is its turn. Another advantage of this approach is that a used slice dataset vanishes from memory, as no references are being kept anywhere.
Our datacube specification:
variables = ['B02', 'B03', 'B04']
x1 = 10.00 # degree
y1 = 54.27 # degree
x2 = 11.00 # degree
y2 = 54.60 # degree
bbox = x1, y1, x2, y2
spatial_res = 0.00018 # = 20.038 meters in degree
time_range = '2023-06-01', '2023-06-05' # For testing. adjust!
time_period = '1D'
target_zarr = 's2-demo-cube.zarr'
Delete an existing target datacube:
if team_store.has_data(target_zarr):
team_store.delete_data(target_zarr)
This is the aforementioned generator function that provides the stream of slice datasets:
def generate_slices(sh_store, time_ranges, variables, bbox, spatial_res, time_period):
for time_range in time_ranges:
print(f'Opening slice dataset for {time_range}')
slice_ds = sh_store.open_data(
'S2L2A',
variable_names=variables,
tile_size=[1024, 1024],
spatial_res=spatial_res,
bbox=bbox,
time_range=time_range,
time_period=time_period
)
print(f'Opened slice dataset of size {slice_ds.nbytes / (10 ** 9)} GB')
# Note, that you can modify slice_ds in any way you want here.
# E.g. you could add computed variables or update dataset attributes.
yield slice_ds
Define a function that creates a list of time ranges each having the duration given by time_period
:
def get_time_ranges(time_range, time_period):
interval_td = pd.Timedelta(time_period)
one_sec = pd.Timedelta("1s")
start_date, stop_date = pd.to_datetime(time_range)
dates = pd.date_range(start_date, stop_date + interval_td, freq=interval_td)
def to_date_str(date):
return date.strftime("%Y-%m-%d")
return [
(to_date_str(dates[i]), to_date_str(dates[i + 1] - one_sec))
for i in range(len(dates) - 1)
]
time_ranges = get_time_ranges(time_range, time_period)
# Test it
time_ranges
[('2023-06-01', '2023-06-01'), ('2023-06-02', '2023-06-02'), ('2023-06-03', '2023-06-03'), ('2023-06-04', '2023-06-04'), ('2023-06-05', '2023-06-05')]
# Remember, no data is read here, the generator produces a lazy slice stream!
slices = generate_slices(sh_store, time_ranges, variables, bbox, spatial_res, time_period)
zappend(
slices=slices,
target_dir=f"s3://{S3_USER_STORAGE_BUCKET}/{target_zarr}",
target_storage_options={
"anon": False,
"key": S3_USER_STORAGE_KEY,
"secret": S3_USER_STORAGE_SECRET
}
)
Opening slice dataset for ('2023-06-01', '2023-06-01') Opened slice dataset of size 0.151060504 GB Opening slice dataset for ('2023-06-02', '2023-06-02') Opened slice dataset of size 0.151060504 GB Opening slice dataset for ('2023-06-03', '2023-06-03') Opened slice dataset of size 0.151060504 GB Opening slice dataset for ('2023-06-04', '2023-06-04') Opened slice dataset of size 0.151060504 GB Opening slice dataset for ('2023-06-05', '2023-06-05') Opened slice dataset of size 0.151060504 GB
list(team_store.get_data_ids())
['SST.levels', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'noise_trajectory.zarr', 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr', 's2-demo-cube.zarr']
Open generated cube from team storage¶
s2_cube = team_store.open_data(target_zarr)
s2_cube
<xarray.Dataset> Size: 755MB Dimensions: (time: 5, lat: 2048, lon: 6144, bnds: 2) Coordinates: * lat (lat) float64 16kB 54.64 54.64 54.64 54.64 ... 54.27 54.27 54.27 * lon (lon) float64 49kB 10.0 10.0 10.0 10.0 ... 11.11 11.11 11.11 * time (time) datetime64[ns] 40B 2023-06-01T12:00:00 ... 2023-06-05T1... time_bnds (time, bnds) datetime64[ns] 80B dask.array<chunksize=(1, 2), meta=np.ndarray> Dimensions without coordinates: bnds Data variables: B02 (time, lat, lon) float32 252MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> B03 (time, lat, lon) float32 252MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> B04 (time, lat, lon) float32 252MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> Attributes: (12/13) Conventions: CF-1.7 date_created: 2024-09-11T08:47:12.058769 geospatial_lat_max: 54.63864 geospatial_lat_min: 54.27 geospatial_lon_max: 11.10592 geospatial_lon_min: 10.0 ... ... processing_level: L2A time_coverage_duration: P1DT0H0M0S time_coverage_end: 2023-06-02T00:00:00+00:00 time_coverage_resolution: P1DT0H0M0S time_coverage_start: 2023-06-01T00:00:00+00:00 title: S2L2A Data Cube Subset
Looks good, now let's clean up the example cube :)
team_store.delete_data(target_zarr)