EOPF zarr Performance Comparison - xcube

Table of contents¶

Introduction
Setup
Functions
Define Scenarios
Performance EOPF
Performance SAFE
References

Run this notebook interactively with all dependencies pre-installed

Introduction¶

This notebook provides a configurable framework for performance tests on the new EOPF zarr format.

Note: In addition to the chosen format, the influence of the underlying STAC API (EOPF vs CDSE) and the implementation of the used software (xcube-stac vs xcube-eopf) also affects the performance.

Setup¶

Start importing the necessary libraries

import requests

from xcube.core.store import new_data_store
from xcube_eopf.utils import reproject_bbox

# for benchmarking
from dataclasses import dataclass
from typing import List
from itertools import product
import pandas as pd
import time

Functions¶

These are the functions that are needed for performance tests and some convenience.

Create an area of interest from the original bounding box reduced by a given factor beteween 0 (centroid pixel) and 1 (full scene).

def create_aoi(bbox, reduction):
    """
    Generate a reduced bounding box or centroid based on a reduction factor.
    Helper function to easily create portions of the original bbox around the centroid.

    Parameters:
    - bbox: [min_lon, min_lat, max_lon, max_lat]
    - reduction: float between 0 and 1
        - 0 returns the centroid as (lon, lat)
        - 0 < reduction < 1 returns a scaled bounding box centered at the centroid

    Returns:
    - reduced bounding box list
    """
    if not (0 <= reduction <= 1):
        raise ValueError("Reduction must be between 0 and 1.")

    min_lon, min_lat, max_lon, max_lat = bbox

    # Compute centroid
    centroid_lon = (min_lon + max_lon) / 2
    centroid_lat = (min_lat + max_lat) / 2

    # if reduction == 0:
    #   return (centroid_lon, centroid_lat)

    # Compute reduced bounding box dimensions
    lat_span = (max_lat - min_lat) * reduction
    lon_span = (max_lon - min_lon) * reduction

    return [
        centroid_lon - lon_span / 2,
        centroid_lat - lat_span / 2,
        centroid_lon + lon_span / 2,
        centroid_lat + lat_span / 2,
    ]

Dataclass for carrying the defined inputs for data access like bounding box, time range, etc.

# this is used for defining inputs
@dataclass
class BenchmarkConfig:
    data_id: str
    bbox: List[float]
    time_range: List[str]
    spatial_res: int
    crs: str
    variables: List[str]

Function to access data in the EOPF Zarr format via xcube-eopf.

def access_eopf(cfg: BenchmarkConfig):
    """
    Use the dataclass BenchmarkConfig to read the EOPF zarr.

    Parameters:
    - cfg: BenchmarkConfig object, holding the information to read the data.

    Returns:
    - Loaded dataset as xarray
    """
    return store_zarr.open_data(
        data_id=cfg.data_id,
        bbox=reproject_bbox(
            cfg.bbox, "EPSG:4326", cfg.crs
        ),  # has to be done for xcube # TODO: throws error with dfg.crs = EPSG:4326
        time_range=cfg.time_range,
        spatial_res=cfg.spatial_res,
        crs=cfg.crs,
        variables=cfg.variables,
    ).load()

Function to access data in the SAFE format via xcube-stac.

def access_safe(cfg: BenchmarkConfig):
    """
    Use the dataclass BenchmarkConfig to read the SAFE.

    Parameters:
    - cfg: BenchmarkConfig object, holding the information to read the data.

    Returns:
    - Loaded dataset as xarray
    """
    return store_safe.open_data(
        data_id=cfg.data_id,
        bbox=reproject_bbox(
            cfg.bbox, "EPSG:4326", cfg.crs
        ),  # has to be done for xcube # TODO: throws error with dfg.crs = EPSG:4326
        time_range=cfg.time_range,
        spatial_res=cfg.spatial_res,
        crs=cfg.crs,
        asset_names=[v.upper() for v in cfg.variables],
    ).load()

Function to track the benchmarking KPIs like duration, number of pixels, etc.

def benchmark_data_access(configs, access_fn):
    """
    Loops through the given configs using the specified access function.

    Parameters:
    - configs: List of BenchmarkConfigs to loop over
    - access_fn: Access function to use, here either access_eopf or access_safe

    Returns:
    - DataFrame with the access specifics and performance indicators
    """
    results = []

    for cfg in configs:
        print(f"Running: {cfg}")
        start = time.perf_counter()
        ds = access_fn(cfg)
        end = time.perf_counter()

        n_pixels_xy = ds.sizes["x"] * ds.sizes["y"]  # get pixel count
        results.append(
            {
                "data_id": cfg.data_id,
                "bbox": cfg.bbox,
                "time_range": cfg.time_range,
                "spat_res": cfg.spatial_res,
                "crs": cfg.crs,
                "variables": cfg.variables,
                "n_pixels_xy": n_pixels_xy,
                "duration_sec": round(end - start, 4),
            }
        )

    return pd.DataFrame(results)

Define Scenarios¶

Here we define scenarios for the performance tests.

First we define the dataset to be used.

url = "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2A_MSIL2A_20200607T101031_N0500_R022_T32TPS_20230502T194134"

Here we get the original bbox of the data set. And the native CRS.

response = requests.get(url)
item = response.json()
bbox = item["bbox"]
print(bbox)

crs_native = item["properties"]["proj:code"]  # "EPSG:32632"
print(crs_native)

[10.290484068727272, 45.93349650488509, 11.755584916064224, 46.94616825397537]
EPSG:32632

Now we’ll define the parameters for the performance test. Options can be added and adjusted to different use case scenarios.

Note: The function benchmark_data_access() allows to loop through different configs. But after the first iteration the results will be cached and subsequent runtimes are drastically reduced.

Tip: Execute one setting. Then restart the kernel of the jupyter notebook.

# define data id
opt_data_id = ["sentinel-2-l2a"]

# define bboxes
# only in lat/lon, reprojection of bbox to chosen crs happens later in code
opt_bbox = [
    # create_aoi(bbox, 256 / 10980), # ml patch approx 256*256
    # create_aoi(bbox, 0.125), # eight of a full scene
    create_aoi(bbox, 0.25),  # quarter of a full scene
]

# define crs
# mandatory in xcube
# if it differs from native crs processing is enforced (reprojection, resampling)
opt_crs = [
    crs_native,
]

# define times
opt_time_range = [
    ["2025-05-01", "2025-05-07"],  # week
]

# define spatial resolution
# everything deviating from native resolution enforces processing (resampling)
opt_spatial_res = [
    10,
]

# define band combinations
# choosing bands with different resolutions enforces processing (resampling)
opt_variables = [
    ["b02"],
    # ["b02", "b04"],
]

Create all combinations of parameters specified above

configs = [
    BenchmarkConfig(data_id, bbox, time_range, spatial_res, crs, variables)
    for data_id, bbox, time_range, spatial_res, crs, variables in product(
        opt_data_id, opt_bbox, opt_time_range, opt_spatial_res, opt_crs, opt_variables
    )
]
print(f"Number of configs: {len(configs)}")
print(f"Example config: {configs[0]}")

Number of configs: 1
Example config: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])

Performance EOPF¶

We’ll carry out the performance tests on the EOPF zarr format by accessing the data via the given parameters and tracking their performance.

First we create a data store using the xcube library and its eopf extension.

store_zarr = new_data_store("eopf-zarr")

Then we run the different scenarios.

df_benchm_eopf = benchmark_data_access(configs, access_eopf)

Running: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])

Round the bbox for nicer plotting.

df_benchm_eopf["bbox"] = df_benchm_eopf["bbox"].apply(
    lambda b: [round(x, 3) for x in b]
)

And report the results.

df_benchm_eopf

Performance SAFE¶

We’ll carry out the performance tests on the SAFE format by accessing the data via the given parameters and tracking their performance.

First we need CDSE credentials to access the data. Get them here: CDSE S3 Access Credentials.

credentials = {
    "key": "XXX",
    "secret": "XXX",
}

Then we create a data store using the xcube library and its SAFE extension.

store_safe = new_data_store("stac-cdse-ardc", **credentials)

Then we run the perfomance tests.

df_benchm_safe = benchmark_data_access(configs, access_safe)

Running: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])

And report the results.

df_benchm_safe["bbox"] = df_benchm_safe["bbox"].apply(
    lambda b: [round(x, 3) for x in b]
)

df_benchm_safe

References¶

Notebook References: