EOPF zarr Performance Comparison - xcube
Compare the performance of the Sentinel EOPF zarr format to the SAFE format using the xcube readers.

Table of contents¶
Run this notebook interactively with all dependencies pre-installed
Introduction¶
This notebook provides a configurable framework for performance tests on the new EOPF zarr format.
Note: In addition to the chosen format, the influence of the underlying STAC API (EOPF vs CDSE) and the implementation of the used software (xcube-stac vs xcube-eopf) also affects the performance.
Setup¶
Start importing the necessary libraries
import requests
from xcube.core.store import new_data_store
from xcube_eopf.utils import reproject_bbox
# for benchmarking
from dataclasses import dataclass
from typing import List
from itertools import product
import pandas as pd
import timeFunctions¶
These are the functions that are needed for performance tests and some convenience.
Create an area of interest from the original bounding box reduced by a given factor beteween 0 (centroid pixel) and 1 (full scene).
def create_aoi(bbox, reduction):
"""
Generate a reduced bounding box or centroid based on a reduction factor.
Helper function to easily create portions of the original bbox around the centroid.
Parameters:
- bbox: [min_lon, min_lat, max_lon, max_lat]
- reduction: float between 0 and 1
- 0 returns the centroid as (lon, lat)
- 0 < reduction < 1 returns a scaled bounding box centered at the centroid
Returns:
- reduced bounding box list
"""
if not (0 <= reduction <= 1):
raise ValueError("Reduction must be between 0 and 1.")
min_lon, min_lat, max_lon, max_lat = bbox
# Compute centroid
centroid_lon = (min_lon + max_lon) / 2
centroid_lat = (min_lat + max_lat) / 2
# if reduction == 0:
# return (centroid_lon, centroid_lat)
# Compute reduced bounding box dimensions
lat_span = (max_lat - min_lat) * reduction
lon_span = (max_lon - min_lon) * reduction
return [
centroid_lon - lon_span / 2,
centroid_lat - lat_span / 2,
centroid_lon + lon_span / 2,
centroid_lat + lat_span / 2,
]Dataclass for carrying the defined inputs for data access like bounding box, time range, etc.
# this is used for defining inputs
@dataclass
class BenchmarkConfig:
data_id: str
bbox: List[float]
time_range: List[str]
spatial_res: int
crs: str
variables: List[str]Function to access data in the EOPF Zarr format via xcube-eopf.
def access_eopf(cfg: BenchmarkConfig):
"""
Use the dataclass BenchmarkConfig to read the EOPF zarr.
Parameters:
- cfg: BenchmarkConfig object, holding the information to read the data.
Returns:
- Loaded dataset as xarray
"""
return store_zarr.open_data(
data_id=cfg.data_id,
bbox=reproject_bbox(
cfg.bbox, "EPSG:4326", cfg.crs
), # has to be done for xcube # TODO: throws error with dfg.crs = EPSG:4326
time_range=cfg.time_range,
spatial_res=cfg.spatial_res,
crs=cfg.crs,
variables=cfg.variables,
).load()Function to access data in the SAFE format via xcube-stac.
def access_safe(cfg: BenchmarkConfig):
"""
Use the dataclass BenchmarkConfig to read the SAFE.
Parameters:
- cfg: BenchmarkConfig object, holding the information to read the data.
Returns:
- Loaded dataset as xarray
"""
return store_safe.open_data(
data_id=cfg.data_id,
bbox=reproject_bbox(
cfg.bbox, "EPSG:4326", cfg.crs
), # has to be done for xcube # TODO: throws error with dfg.crs = EPSG:4326
time_range=cfg.time_range,
spatial_res=cfg.spatial_res,
crs=cfg.crs,
asset_names=[v.upper() for v in cfg.variables],
).load()Function to track the benchmarking KPIs like duration, number of pixels, etc.
def benchmark_data_access(configs, access_fn):
"""
Loops through the given configs using the specified access function.
Parameters:
- configs: List of BenchmarkConfigs to loop over
- access_fn: Access function to use, here either access_eopf or access_safe
Returns:
- DataFrame with the access specifics and performance indicators
"""
results = []
for cfg in configs:
print(f"Running: {cfg}")
start = time.perf_counter()
ds = access_fn(cfg)
end = time.perf_counter()
n_pixels_xy = ds.sizes["x"] * ds.sizes["y"] # get pixel count
results.append(
{
"data_id": cfg.data_id,
"bbox": cfg.bbox,
"time_range": cfg.time_range,
"spat_res": cfg.spatial_res,
"crs": cfg.crs,
"variables": cfg.variables,
"n_pixels_xy": n_pixels_xy,
"duration_sec": round(end - start, 4),
}
)
return pd.DataFrame(results)Define Scenarios¶
Here we define scenarios for the performance tests.
First we define the dataset to be used.
url = "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2A_MSIL2A_20200607T101031_N0500_R022_T32TPS_20230502T194134"Here we get the original bbox of the data set. And the native CRS.
response = requests.get(url)
item = response.json()
bbox = item["bbox"]
print(bbox)
crs_native = item["properties"]["proj:code"] # "EPSG:32632"
print(crs_native)[10.290484068727272, 45.93349650488509, 11.755584916064224, 46.94616825397537]
EPSG:32632
Now we’ll define the parameters for the performance test. Options can be added and adjusted to different use case scenarios.
Note: The function benchmark_data_access() allows to loop through different configs. But after the first iteration the results will be cached and subsequent runtimes are drastically reduced.
Tip: Execute one setting. Then restart the kernel of the jupyter notebook.
# define data id
opt_data_id = ["sentinel-2-l2a"]
# define bboxes
# only in lat/lon, reprojection of bbox to chosen crs happens later in code
opt_bbox = [
# create_aoi(bbox, 256 / 10980), # ml patch approx 256*256
# create_aoi(bbox, 0.125), # eight of a full scene
create_aoi(bbox, 0.25), # quarter of a full scene
]
# define crs
# mandatory in xcube
# if it differs from native crs processing is enforced (reprojection, resampling)
opt_crs = [
crs_native,
]
# define times
opt_time_range = [
["2025-05-01", "2025-05-07"], # week
]
# define spatial resolution
# everything deviating from native resolution enforces processing (resampling)
opt_spatial_res = [
10,
]
# define band combinations
# choosing bands with different resolutions enforces processing (resampling)
opt_variables = [
["b02"],
# ["b02", "b04"],
]Create all combinations of parameters specified above
configs = [
BenchmarkConfig(data_id, bbox, time_range, spatial_res, crs, variables)
for data_id, bbox, time_range, spatial_res, crs, variables in product(
opt_data_id, opt_bbox, opt_time_range, opt_spatial_res, opt_crs, opt_variables
)
]
print(f"Number of configs: {len(configs)}")
print(f"Example config: {configs[0]}")Number of configs: 1
Example config: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])
Performance EOPF¶
We’ll carry out the performance tests on the EOPF zarr format by accessing the data via the given parameters and tracking their performance.
First we create a data store using the xcube library and its eopf extension.
store_zarr = new_data_store("eopf-zarr")Then we run the different scenarios.
df_benchm_eopf = benchmark_data_access(configs, access_eopf)Running: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])
Round the bbox for nicer plotting.
df_benchm_eopf["bbox"] = df_benchm_eopf["bbox"].apply(
lambda b: [round(x, 3) for x in b]
)And report the results.
df_benchm_eopfPerformance SAFE¶
We’ll carry out the performance tests on the SAFE format by accessing the data via the given parameters and tracking their performance.
First we need CDSE credentials to access the data. Get them here: CDSE S3 Access Credentials.
credentials = {
"key": "XXX",
"secret": "XXX",
}Then we create a data store using the xcube library and its SAFE extension.
store_safe = new_data_store("stac-cdse-ardc", **credentials)Then we run the perfomance tests.
df_benchm_safe = benchmark_data_access(configs, access_safe)Running: BenchmarkConfig(data_id='sentinel-2-l2a', bbox=[10.839896886478629, 46.313248410793946, 11.206172098312868, 46.56641634806651], time_range=['2025-05-01', '2025-05-07'], spatial_res=10, crs='EPSG:32632', variables=['b02'])
And report the results.
df_benchm_safe["bbox"] = df_benchm_safe["bbox"].apply(
lambda b: [round(x, 3) for x in b]
)df_benchm_safeReferences¶
Notebook References:
