How to ...
This section provides practical examples for common xcube-multistore workflows.
It covers configuring storage, discovering available datasets, adding new data
sources, and adjusting spatial and temporal properties of generated data cubes.
How to Change the Location where Final Data Cubes are Stored
The final storage location for generated data cubes is defined by the user in the
data_stores section, as described in the configuration schema.
The storage location can be changed to a local directory using the file data store:
- identifier: storage
store_id: file
store_params:
root: <relative-path>
where <relative-path> is the path relative to the directory from which the
Multi-Source Data Store is executed.
Alternatively, the final data cube can be stored in an S3-compatible object storage using:
- identifier: storage
store_id: s3
store_params:
root: <s3-bucket>
storage_options:
anon: False
key: <S3-key>
secret: <S3-secret>
For additional S3 configuration options (e.g. custom endpoints or other object-storage settings), refer to MultiSourceDataStore.get_data_store_params_schema("s3").
How to List All Available Datasets within a Data Store
Use the auxiliary function list_data_ids
to retrieve all available dataset identifiers (data_id) from a specific data store.
In the example below, we list the available data IDs for two different data stores:
store_params = {
"cds": {
"endpoint_url": "https://cds.climate.copernicus.eu/api",
"cds_api_key": "<cds-api-key>"
},
"stac": {
"url": "https://s3.gfz-potsdam.de/dog.atlaseo-glm.eo-gridded-data/collections/public/catalog.json"
}
}
MultiSourceDataStore.list_data_ids(store_params)
Note that some data stores provide hundreds of datasets, which can make it difficult
to identify the desired one. Therefore, xcube-multistore also provides functionality
to search for specific datasets using MultiSourceDataStore.search_data_ids.
The available search parameters for a specific data store can be inspected with
MultiSourceDataStore.get_search_params_schema.
An example demonstrating the use of these functions is available in setup_config.ipynb.
How to Add New Datasets
New datasets can be added in the datasets section of the configuration file. Each
dataset definition must follow the structure of a dataset object.
This section demonstrates two common workflows:
- Opening a dataset using the data store's native opening parameters and capabilities.
- Accessing a static Zarr data cube and resampling it to a user-defined grid using the Multi-Source Data Store's built-in resampling functionality.
A Sentinel-2 L2A Data Cube Using Opening Parameters
The example below shows how to configure access to a Sentinel-2 L2A dataset hosted on the Planetary Computer.
First, add the corresponding data store:
datastores:
- identifier: stac-ardc
store_id: stac-pc-ardc
The identifier is a user-defined name that is referenced throughout the configuration.
The store_id refers to the underlying xcube data store implementation.
All available (installed) data store IDs can be listed using:
MultiSourceDataStore.list_data_store_ids()
The parameters accepted by a given data store can be inspected with:
MultiSourceDataStore.get_data_store_params_schema("<store-id>")
For example, to access Sentinel-2 data via the Copernicus Data Space Ecosystem (CDSE) STAC service instead of the Planetary Computer, the configuration could look as follows:
datastores:
- identifier: stac-ardc
store_id: stac-cdse-ardc
store_params:
key: <CDSE-S3-key>
secret: <CDSE-S3-secret>
Next, add the dataset definition to the datasets section:
datasets:
- identifier: sen2l2a
store: stac-ardc
data_id: sentinel-2-l2a
open_params:
bbox:
- 9.7
- 53.3
- 10.3
- 53.8
time_range:
- 2020-07-15
- 2020-08-01
spatial_res: 10 / 111320 # 10 m in degrees
crs: EPSG:4326
The identifier must be unique among all datasets. It is also used as the output
filename (data ID) when storing the final data cube.
The store parameter references the configured data store identifier, while data_id
specifies the dataset to open within that store.
Available dataset identifiers can be listed using:
MultiSourceDataStore.list_data_ids(...)
Supported opening parameters for a specific dataset can be inspected with:
MultiSourceDataStore.get_open_data_params_schema(...)
An example demonstrating the use of auxiliary functions is available in setup_config.ipynb.
A Static Dataset and Resampling It
The following example demonstrates how to access a dataset from the EO Forest STAC and resample it to the grid of the Sentinel-2 data cube defined in the previous section.
First, configure the STAC data store:
datastores:
- identifier: store_foresteo
store_id: stac
store_params:
url: https://s3.gfz-potsdam.de/dog.atlaseo-glm.eo-gridded-data/collections/public/catalog.json
Next, add the dataset definition:
datasets:
- identifier: AU-How/SAATCHI_BIOMASS
store: store_foresteo
data_id: biomass-carbon/SAATCHI_BIOMASS/SAATCHI_BIOMASS_v2.0/SAATCHI_BIOMASS_v2.0.json
grid_mapping: sen2l2a
The available data_id values can be explored using:
MultiSourceDataStore.list_data_ids(...)
The grid_mapping parameter references the identifier of another dataset or grid
mapping definition. In this example, the biomass dataset is reprojected and resampled
to the grid of the sen2l2a dataset.
Alternatively, a dedicated grid mapping can be defined in the grid mappings section and referenced by its identifier.
Examples of custom grid mappings can be found in:
How to Add Local Data Sources to the Configuration
Local datasets can easily be added from the local file system or from private object
storage by using xcube's filesystem-based data stores,
such as file or s3.
For example, a local Zarr dataset can be added to the configuration as follows:
datasets:
- identifier: local_dataset
store: local_store
grid_mapping: globe
data_id: "dataset.zarr"
data_stores:
# Storage data store
- identifier: storage
store_id: file
store_params:
root: data
# Local source data store
- identifier: local_store
store_id: file
store_params:
root: local
grid_mappings:
- identifier: globe
bbox: [-180, -90, 180, 90]
spatial_res: 0.1
crs: EPSG:4326
tile_size: [1800, 1800]
The same configuration can be used with an S3-compatible object storage by replacing
the file data store with an s3 data store:
- identifier: local_store
store_id: s3
store_params:
root: <s3-bucket>
storage_options:
anon: False
key: <S3-key>
secret: <S3-secret>
An example workflow is provided in Example 4. In this example, the LAI dataset from Zenodo is first prepared locally using prepare_laiv3.ipynb. The resulting local dataset is then incorporated into the cube generation workflow defined in Example 4.
How to Adjust the Time Axis
Some data stores allow specifying a time range directly in the opening parameters.
For example, Sentinel-2 and Sentinel-3 retrievals using stac-pc-ardc and
stac-pc-cdse support time filtering, as shown in Example 1.
To check whether a specific dataset supports time-related opening parameters, use:
MultiSourceDataStore.get_open_data_params_schema(...)
Additionally, datasets can be resampled along the time axis using the
temporal_resample_params option. This allows the temporal resolution of a
dataset to be adjusted to a user-defined frequency and aggregation method.
An example from Example 4 is shown below:
- identifier: laiv3
store: local_laiv3
grid_mapping: globe
data_id: "GlobMapLAIV3.zarr"
temporal_resample_params:
frequency: "1MS"
agg_methods: "mean"
format_id: netcdf
How to Modify the Spatial Extent and Resolution
There are several ways to specify the spatial extent and resolution of a dataset.
First, users can define grid mappings in the grid_mappings section of the
configuration file. A grid mapping allows specifying the spatial extent, spatial
resolution, coordinate reference system, and tile size. For example:
grid_mappings:
- identifier: globe
bbox: [-180, -90, 180, 90]
spatial_res: 0.1
crs: EPSG:4326
tile_size: [1800, 1800]
- identifier: france
bbox: [3400000, 2300000, 4150000, 3050000]
spatial_res: 30
crs: EPSG:3035
tile_size: 2048
Multiple grid mappings can be defined within a single configuration. The grid mapping
identifier can then be referenced in the datasets
section, allowing different datasets within the same configuration to use different
spatial grids.
Second, some data stores support specifying the spatial extent, resolution, and
coordinate reference system directly through opening parameters. Examples include
Sentinel-2 and Sentinel-3 retrievals via stac-pc-ardc and stac-cdse-ardc
(see Example 1), as well as ERA5 retrievals through the cds data store
(see Example 3).
To check whether a specific data store supports adjusting the spatial extent and resolution through opening parameters, use:
MultiSourceDataStore.get_open_data_params_schema(...)