Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

earth and related environmental sciences

EOPF zarr Performance Comparison - xcube

Compare the performance of the Sentinel EOPF zarr format to the SAFE format using the xcube readers.

Authors
Affiliations
German Aerospace Center
Eurac Research
Brockmann Consult GmbH
ESA EOPF Zarr Logo

🚀 Launch in JupyterHub

Run this notebook interactively with all dependencies pre-installed

Introduction

This notebook provides a configurable framework for performance tests on the new EOPF zarr format.

Note: In addition to the chosen format, the influence of the underlying STAC API (EOPF vs CDSE) and the implementation of the used software (xcube-stac vs xcube-eopf) also affects the performance.

Setup

Start importing the necessary libraries

import requests

from xcube.core.store import new_data_store
from xcube_eopf.utils import reproject_bbox

# for benchmarking
from dataclasses import dataclass
from typing import List
from itertools import product
import pandas as pd
import time

Functions

These are the functions that are needed for performance tests and some convenience.

Create an area of interest from the original bounding box reduced by a given factor beteween 0 (centroid pixel) and 1 (full scene).

def create_aoi(bbox, reduction):
    """
    Generate a reduced bounding box or centroid based on a reduction factor.
    Helper function to easily create portions of the original bbox around the centroid.

    Parameters:
    - bbox: [min_lon, min_lat, max_lon, max_lat]
    - reduction: float between 0 and 1
        - 0 returns the centroid as (lon, lat)
        - 0 < reduction < 1 returns a scaled bounding box centered at the centroid

    Returns:
    - reduced bounding box list
    """
    if not (0 <= reduction <= 1):
        raise ValueError("Reduction must be between 0 and 1.")

    min_lon, min_lat, max_lon, max_lat = bbox

    # Compute centroid
    centroid_lon = (min_lon + max_lon) / 2
    centroid_lat = (min_lat + max_lat) / 2

    # if reduction == 0:
    #   return (centroid_lon, centroid_lat)

    # Compute reduced bounding box dimensions
    lat_span = (max_lat - min_lat) * reduction
    lon_span = (max_lon - min_lon) * reduction

    return [
        centroid_lon - lon_span / 2,
        centroid_lat - lat_span / 2,
        centroid_lon + lon_span / 2,
        centroid_lat + lat_span / 2,
    ]

Dataclass for carrying the defined inputs for data access like bounding box, time range, etc.

# this is used for defining inputs
@dataclass
class BenchmarkConfig:
    data_id: str
    bbox: List[float]
    time_range: List[str]
    spatial_res: int
    crs: str
    variables: List[str]

Function to access data in the EOPF Zarr format via xcube-eopf.

def access_eopf(cfg: BenchmarkConfig):
    """
    Use the dataclass BenchmarkConfig to read the EOPF zarr.

    Parameters:
    - cfg: BenchmarkConfig object, holding the information to read the data.

    Returns:
    - Loaded dataset as xarray
    """
    return store_zarr.open_data(
        data_id=cfg.data_id,
        bbox=reproject_bbox(
            cfg.bbox, "EPSG:4326", cfg.crs
        ),  # has to be done for xcube # TODO: throws error with dfg.crs = EPSG:4326
        time_range=cfg.time_range,
        spatial_res=cfg.spatial_res,
        crs=cfg.crs,
        variables=cfg.variables,
    ).load()

Function to access data in the SAFE format via xcube-stac.

def access_safe(cfg: BenchmarkConfig):
    """
    Use the dataclass BenchmarkConfig to read the SAFE.

    Parameters:
    - cfg: BenchmarkConfig object, holding the information to read the data.

    Returns:
    - Loaded dataset as xarray
    """
    return store_safe.open_data(
        data_id=cfg.data_id,
        bbox=reproject_bbox(
            cfg.bbox, "EPSG:4326", cfg.crs
        ),  # has to be done for xcube # TODO: throws error with dfg.crs = EPSG:4326
        time_range=cfg.time_range,
        spatial_res=cfg.spatial_res,
        crs=cfg.crs,
        asset_names=[v.upper() for v in cfg.variables],
    ).load()

Function to track the benchmarking KPIs like duration, number of pixels, etc.

def benchmark_data_access(configs, access_fn):
    """
    Loops through the given configs using the specified access function.

    Parameters:
    - configs: List of BenchmarkConfigs to loop over
    - access_fn: Access function to use, here either access_eopf or access_safe

    Returns:
    - DataFrame with the access specifics and performance indicators
    """
    results = []

    for cfg in configs:
        print(f"Running: {cfg}")
        start = time.perf_counter()
        ds = access_fn(cfg)
        end = time.perf_counter()

        n_pixels_xy = ds.sizes["x"] * ds.sizes["y"]  # get pixel count
        results.append(
            {
                "data_id": cfg.data_id,
                "bbox": cfg.bbox,
                "time_range": cfg.time_range,
                "spat_res": cfg.spatial_res,
                "crs": cfg.crs,
                "variables": cfg.variables,
                "n_pixels_xy": n_pixels_xy,
                "duration_sec": round(end - start, 4),
            }
        )

    return pd.DataFrame(results)

Define Scenarios

Here we define scenarios for the performance tests.

First we define the dataset to be used.

url = "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2A_MSIL2A_20200607T101031_N0500_R022_T32TPS_20230502T194134"

Here we get the original bbox of the data set. And the native CRS.

response = requests.get(url)
item = response.json()
bbox = item["bbox"]
print(bbox)

crs_native = item["properties"]["proj:code"]  # "EPSG:32632"
print(crs_native)
[10.290484068727272, 45.93349650488509, 11.755584916064224, 46.94616825397537]
EPSG:32632

Now we’ll define the parameters for the performance test. Options can be added and adjusted to different use case scenarios.

Note: The function benchmark_data_access() allows to loop through different configs. But after the first iteration the results will be cached and subsequent runtimes are drastically reduced.

Tip: Execute one setting. Then restart the kernel of the jupyter notebook.

# define data id
opt_data_id = ["sentinel-2-l2a"]

# define bboxes
# only in lat/lon, reprojection of bbox to chosen crs happens later in code
opt_bbox = [
    # create_aoi(bbox, 256 / 10980), # ml patch approx 256*256
    # create_aoi(bbox, 0.125), # eight of a full scene
    create_aoi(bbox, 0.25),  # quarter of a full scene
]

# define crs
# mandatory in xcube
# if it differs from native crs processing is enforced (reprojection, resampling)
opt_crs = [
    crs_native,
]

# define times
opt_time_range = [
    ["2025-05-01", "2025-05-07"],  # week
]

# define spatial resolution
# everything deviating from native resolution enforces processing (resampling)
opt_spatial_res = [
    10,
]

# define band combinations
# choosing bands with different resolutions enforces processing (resampling)
opt_variables = [
    ["b02"],
    # ["b02", "b04"],
]

Create all combinations of parameters specified above

configs = [
    BenchmarkConfig(data_id, bbox, time_range, spatial_res, crs, variables)
    for data_id, bbox, time_range, spatial_res, crs, variables in product(
        opt_data_id, opt_bbox, opt_time_range, opt_spatial_res, opt_crs, opt_variables
    )
]
print(f"Number of configs: {len(configs)}")
print(f"Example config: {configs[0]}")
Number of configs: 1
Example config: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])

Performance EOPF

We’ll carry out the performance tests on the EOPF zarr format by accessing the data via the given parameters and tracking their performance.

First we create a data store using the xcube library and its eopf extension.

store_zarr = new_data_store("eopf-zarr")

Then we run the different scenarios.

df_benchm_eopf = benchmark_data_access(configs, access_eopf)
Running: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])

Round the bbox for nicer plotting.

df_benchm_eopf["bbox"] = df_benchm_eopf["bbox"].apply(
    lambda b: [round(x, 3) for x in b]
)

And report the results.

df_benchm_eopf
Loading...

Performance SAFE

We’ll carry out the performance tests on the SAFE format by accessing the data via the given parameters and tracking their performance.

First we need CDSE credentials to access the data. Get them here: CDSE S3 Access Credentials.

credentials = {
    "key": "XXX",
    "secret": "XXX",
}

Then we create a data store using the xcube library and its SAFE extension.

store_safe = new_data_store("stac-cdse-ardc", **credentials)

Then we run the perfomance tests.

df_benchm_safe = benchmark_data_access(configs, access_safe)
Running: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])

And report the results.

df_benchm_safe["bbox"] = df_benchm_safe["bbox"].apply(
    lambda b: [round(x, 3) for x in b]
)
df_benchm_safe
Loading...