Setup Config YAML File¶

This notebook demonstrates how to use xcube-multistore to set up a configuration YAML file. The file can then be used with a MultiSourceDataStore instance to generate data cubes.

Each auxiliary function used below is documented in detail in the Python API reference.

In [1]:

Copied!

from xcube_multistore import MultiSourceDataStore
from xcube_multistore import MultiSourceDataStore

The full schema can be viewed by execting the following cell.

A full documentation on the configuration is given in the Configuration Guide.

In [2]:

Copied!

MultiSourceDataStore.get_config_schema()
MultiSourceDataStore.get_config_schema()

Out[2]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7e9c1e8911d0>

In this example, we want to resample all datacubes to the same gridmapping. We therefore define a gridmapping covering the full globe with a resolution of 0.1° and crs="EPSG:4326". Note that the grid_mappings is given as a list, as the user may define multiple grid mappings.

In [3]:

Copied!





config = dict()
config["grid_mappings"] = []
config["grid_mappings"].append(
    dict(
        identifier="gm",
        bbox=[-180, -90, 180, 90],
        spatial_res=0.1,
        crs="EPSG:4326",
    )
)
config = dict()
config["grid_mappings"] = []
config["grid_mappings"].append(
    dict(
        identifier="gm",
        bbox=[-180, -90, 180, 90],
        spatial_res=0.1,
        crs="EPSG:4326",
    )
)

Secondly, we set up the data stores. To define them, you can first list all data stores available in your environment.

Note: if a data store identifier is missing, make sure the corresponding xcube plugin is installed in your environment.

In [4]:

Copied!

MultiSourceDataStore.list_data_store_ids()
MultiSourceDataStore.list_data_store_ids()

Out[4]:

['esa-cci',
 'esa-cdc',
 'esa-climate-data-centre',
 'esa-cci-kc',
 'esa-cdc-kc',
 'ccikc',
 'esa-cci-zarr',
 'esa-cdc-zarr',
 'ccizarr',
 'cciodp',
 'abfs',
 'file',
 'ftp',
 'https',
 'memory',
 'reference',
 's3',
 'cds',
 'clms',
 'stac',
 'stac-cdse',
 'stac-cdse-ardc',
 'stac-pc',
 'stac-pc-ardc',
 'stac-xcube',
 'zenodo']

To get the data store parameters needed for initializing a data store, you can use the method get_data_store_params_schema.

You can provide a single data store identifier, a list of identifiers, or None. If None, parameters for all available data stores are returned.

In [5]:

Copied!

MultiSourceDataStore.get_data_store_params_schema(["file", "cciodp", "zenodo"])
MultiSourceDataStore.get_data_store_params_schema(["file", "cciodp", "zenodo"])

Out[5]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7e9c1e857690>

Now we have all the information needed to set up the data_stores field. Also this field is a list.

In [6]:

Copied!

config["data_stores"] = []
config["data_stores"] = []

We first add the storage data store, which is the data store where the final data cubes will be written. The store_id is set to "file", and the root is set to "data", so that the data cubes are stored in a folder named data.

Note: You could also configure a different type of data store, such as s3, if desired.

In [7]:

Copied!





config["data_stores"].append(
    dict(
        identifier="storage",
        store_id="file",
        store_params={"root": "data"}
    )
)
config["data_stores"].append(
    dict(
        identifier="storage",
        store_id="file",
        store_params={"root": "data"}
    )
)

Next, we add the "cciodp" data store to provide access to the ESA CCI datasets, since we are specifically interested in ESA CCI Biomass. This store does not require any store parameters. We set its identifier to "esa_cci" for later reference.

In [8]:

Copied!





config["data_stores"].append(
    dict(
        identifier="esa_cci",
        store_id="cciodp",
    )
)
config["data_stores"].append(
    dict(
        identifier="esa_cci",
        store_id="cciodp",
    )
)

Furthermore, we are interested in the dataset "Changes in Global Terrestrial Live Biomass over the 21st Century", published on Zenodo with record number 4161694. The setup is shown in the following cell.

Note: The identifier zenodo is user-defined and can be changed. It is used only for reference later in the workflow.

In [9]:

Copied!





config["data_stores"].append(
    dict(
        identifier="zenodo",
        store_id="zenodo",
        store_params={"root": "4161694"},
    )
)
config["data_stores"].append(
    dict(
        identifier="zenodo",
        store_id="zenodo",
        store_params={"root": "4161694"},
    )
)

Next, we need to retrieve the correct data IDs within the store to access the desired datasets. This can be done by running the following cell.

In [10]:

Copied!





MultiSourceDataStore.list_data_ids(
    dict(
        cciodp={},
        zenodo={"root": "4161694"},
    )
)
MultiSourceDataStore.list_data_ids(
    dict(
        cciodp={},
        zenodo={"root": "4161694"},
    )
)

Out[10]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7e9c1e12bbd0>

For the zenodo store, we are interested in the data ID "test10a_cd_ab_pred_corr_2000_2019_v2.tif". This dataset will be accessed using the user-defined store identifier "zenodo", and the resulting data cube will be named "biomass_xu", which will also be the name of the final file in the storage. Since we want to reproject the dataset to the previously defined grid mapping, we assign grid_mapping="gm".

In [11]:

Copied!

config["datasets"] = []
config["datasets"] = []

In [12]:

Copied!





config["datasets"].append(
    dict(
        identifier="biomass_xu",
        store="zenodo",
        grid_mapping="gm",
        data_id="test10a_cd_ab_pred_corr_2000_2019_v2.tif",
    )
)
config["datasets"].append(
    dict(
        identifier="biomass_xu",
        store="zenodo",
        grid_mapping="gm",
        data_id="test10a_cd_ab_pred_corr_2000_2019_v2.tif",
    )
)

For the cciodp store, many data IDs are available. We can filter the results by searching for specific data IDs.

To obtain the search parameters for the store, use the method get_search_params_schema, providing the store_id and store_params as a dictionary.

In [13]:

Copied!

MultiSourceDataStore.get_search_params_schema(dict(cciodp={}))
MultiSourceDataStore.get_search_params_schema(dict(cciodp={}))

Out[13]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7e9c1c4d9fd0>

Since we are interested in "BIOMASS", we can filter the data IDs using the "ecv" field within "cci_attrs", as shown below.

In [14]:

Copied!

MultiSourceDataStore.search_data_ids(dict(cciodp=({}, {"cci_attrs": {"ecv": "BIOMASS"}})))
MultiSourceDataStore.search_data_ids(dict(cciodp=({}, {"cci_attrs": {"ecv": "BIOMASS"}})))

Out[14]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7e9c1c36fcb0>

Here we can see the different "BIOMASS" datasets. We are interested in the 10000m resolution. Both version 5 and version 6 are available.
To describe the data, we can use the method describe_data, providing the store_id, store_params, and data_id for both versions, as shown in the following cell.

In [15]:

Copied!





MultiSourceDataStore.describe_data(
    "cciodp",
    {}, 
    "esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.5-0.10000m",
)
MultiSourceDataStore.describe_data(
    "cciodp",
    {}, 
    "esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.5-0.10000m",
)

Out[15]:

<xcube.core.store.descriptor.DatasetDescriptor at 0x7e9bd4e43e30>

In [16]:

Copied!





MultiSourceDataStore.describe_data(
    "cciodp",
    {}, 
    "esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.6-0.10000m",
)
MultiSourceDataStore.describe_data(
    "cciodp",
    {}, 
    "esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.6-0.10000m",
)

Out[16]:

<xcube.core.store.descriptor.DatasetDescriptor at 0x7e9bd4b07e30>

We have decided to use version 6.

Finally, we can inspect the opening parameters, which are passed to the data access method, using the method get_open_data_params_schema.

In [17]:

Copied!





MultiSourceDataStore.get_open_data_params_schema(
    "cciodp",
    {}, 
    "esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.6-0.10000m",
)
MultiSourceDataStore.get_open_data_params_schema(
    "cciodp",
    {}, 
    "esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.6-0.10000m",
)

Out[17]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7e9c1455def0>

Now we have all the necessary information to configure access to the ESA CCI Biomass dataset.

In [18]:

Copied!





config["datasets"].append(
    dict(
        identifier="esa_cci_biomass",
        store="esa_cci",
        grid_mapping="gm",
        data_id="esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.5-0.10000m",
    )
)
config["datasets"].append(
    dict(
        identifier="esa_cci_biomass",
        store="esa_cci",
        grid_mapping="gm",
        data_id="esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.5-0.10000m",
    )
)

Finally, we can display the complete configuration for each data cube that will be generated during the cube generation process.

In [19]:

Copied!

MultiSourceDataStore.display_config(config)
MultiSourceDataStore.display_config(config)

Configuration

User-defined ID	Data Store ID	Data Store Params	Data ID	Open Data Params	Grid-Mapping	Format
biomass_xu	zenodo	root: 4161694	test10a_cd_ab_pred_corr_2000_2019_v2.tif	-	-	Zarr
esa_cci_biomass	cciodp	-	esacci.BIOMASS.yr.L4.AGB.multi-sensor.multi-platform.MERGED.5-0.10000m	-	-	Zarr

This configuration can either be passed directly to the xcube_multistore.MultiSourceDataStore class, or saved as a YAML file for persistence:

with open("config.yml", "w") as file:
    yaml.dump(config, file, sort_keys=False)