Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

earth and related environmental sciences

EOPF zarr Performance Comparison - direct read

Compare the performance of the Sentinel EOPF zarr format to the SAFE format reading the files directly.

Authors
Affiliations
German Aerospace Center
Eurac Research
Brockmann Consult GmbH
ESA EOPF Zarr Logo

🚀 Launch in JupyterHub

Run this notebook interactively with all dependencies pre-installed

Introduction

This notebook provides a framework for performance tests and resource monitoring on the new EOPF zarr format by reading the files directly.

Setup

Start importing the necessary libraries

import xarray as xr
import time
import psutil
import logging
import pandas as pd

# for direct loading
import rioxarray
import fsspec
import s3fs

Define paths to files

path_eodc_zarr = "https://objectstore.eodc.eu:2222/e05ab01a9d56408d82ac32d69a5aae2a:202505-s02msil2a/03/products/cpm_v256/S2A_MSIL2A_20250503T103701_N0511_R008_T32UNE_20250503T173316.zarr"
path_eodc_safe = "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/SAFE/S2A_MSIL2A_20250503T103701_N0511_R008_T32UNE_20250503T173316.SAFE"
path_cdse_safe = "s3://eodata/Sentinel-2/MSI/L2A/2025/05/03/S2A_MSIL2A_20250503T103701_N0511_R008_T32UNE_20250503T173316.SAFE"

Functions

This sets up the fsspec logger which counts the http requests that are made when accessing the data. Zarr typically has many small requests, whereas SAFE has few large requests.

# Silent in-memory log capture
fsspec_logs = []


class ListHandler(logging.Handler):
    def __init__(self, storage):
        super().__init__()
        self.storage = storage

    def emit(self, record):
        self.storage.append(self.format(record))


# Disable root logger output to notebook
logging.getLogger().handlers.clear()

# Configure only fsspec.http logger
logger_fsspec = logging.getLogger("fsspec.http")
logger_fsspec.handlers.clear()
logger_fsspec.propagate = False  # <-- important! stops logs bubbling up
logger_fsspec.setLevel(logging.DEBUG)
logger_fsspec.addHandler(ListHandler(fsspec_logs))

This sets up the tracking of CPU and Mem.

# cpu tracking
process = psutil.Process()

This function takes an arbitrary reader function and adds the benchmarking around it.

def benchmark_run(func, repeats=1):
    """
    Benchmark any callable `func` and measure:
      - Wall time
      - CPU time (user + system)
      - Memory delta (MB)
      - HTTP requests (from global `fsspec_logs`)
    Returns a pandas DataFrame with all results.
    """
    process = psutil.Process()
    results = []

    for i in range(repeats):
        fsspec_logs.clear()

        mem_before = process.memory_info().rss
        cpu_before = process.cpu_times()
        t0 = time.perf_counter()

        func()  # run the actual workload

        t1 = time.perf_counter()
        cpu_after = process.cpu_times()
        mem_after = process.memory_info().rss

        results.append(
            {
                "run": i + 1,
                "time_s": t1 - t0,
                "cpu_user_s": cpu_after.user - cpu_before.user,
                "cpu_sys_s": cpu_after.system - cpu_before.system,
                "mem_delta_MB": (mem_after - mem_before) / (1024**2),
                "http_requests": len(fsspec_logs),
            }
        )

    df = pd.DataFrame(results)
    display(df)
    return df

Benchmarking

EOPF Zarr on EODC

Define the reader function for opening the zarr datatree.

def open_zarr():
    xr.open_datatree(path_eodc_zarr, engine="zarr", mask_and_scale=False, chunks={})

Execute benchmarking.

bm_zarr_datatree = benchmark_run(open_zarr, repeats=3)
Loading...

Assign the data to a variable to check whether the data has been accessed correctly.

dt = xr.open_datatree(path_eodc_zarr, engine="zarr", mask_and_scale=False, chunks={})

Define the reader function to actually load a band of the scene.

def load_band():
    _ = dt["measurements/reflectance/r10m"]["b04"].load()

Execute the benchmarking for loading a band. Note: After the first run cached results are used.

bm_zarr_band = benchmark_run(load_band, repeats=3)
Loading...

Check that the data is loaded correctly.

band_eodc_zarr = dt["measurements/reflectance/r10m"]["b04"].load()
band_eodc_zarr
Loading...

SAFE on EODC

Define the path to the band in the SAFE file.

# Full URL to the B04 10m band file, from f"{path_eopf_zarr}/manifest.safe"
b04_url = (
    f"{path_eodc_safe}/GRANULE/L2A_T32UNE_A051514_20250503T103937/IMG_DATA/R10m/"
    "T32UNE_20250503T103701_B04_10m.jp2"
)

Define the function to load a band from the SAFE format.

def load_safe_band():
    fs = fsspec.filesystem("http")
    with fs.open(b04_url, mode="rb") as f:
        _ = rioxarray.open_rasterio(f, masked=False).load()

Execute the benchmarking on loading a SAFE band.

bm_safe_eodc = benchmark_run(load_safe_band, repeats=3)
Loading...

Load the band and check the values.

fs = fsspec.filesystem("http")
with fs.open(b04_url) as f:
    band_eodc_safe = rioxarray.open_rasterio(f, masked=False).load()

band_eodc_safe
Loading...

SAFE on CDSE S3

Note: Counting of http requests is not possible on S3. Another logger would have to be set up to count S3 requests.

Insert your CDSE S3 credentials. These ones are only valid for a certain amount of time. Here’s a guide how generate them.

# Only valid for a certain amount of time. Check the guide above how to create your own.
credentials = {
    "key": "XXX",
    "secret": "XXX",
}

Set up the S3 file system.

fs = s3fs.S3FileSystem(
    key=credentials["key"],
    secret=credentials["secret"],
    client_kwargs={
        "region_name": "eu-central-1",
        "endpoint_url": "https://s3.dataspace.copernicus.eu",
    },
)

Define the path to the band within the SAFE file.

# Correct path from manifest.safe
band_path = (
    "eodata/Sentinel-2/MSI/L2A/2025/05/03/"
    "S2A_MSIL2A_20250503T103701_N0511_R008_T32UNE_20250503T173316.SAFE/"
    "GRANULE/L2A_T32UNE_A051514_20250503T103937/IMG_DATA/R10m/"
    "T32UNE_20250503T103701_B04_10m.jp2"
)

Measure timing of loading one band.

%%time
# Open the file from S3
with fs.open(band_path, mode="rb") as f:
    band_cdse_safe = rioxarray.open_rasterio(f, masked=False).load()
CPU times: user 13.7 s, sys: 142 ms, total: 13.8 s
Wall time: 4.89 s

Check the values.

band_cdse_safe
Loading...