EOPF zarr Performance Comparison - direct read
Compare the performance of the Sentinel EOPF zarr format to the SAFE format reading the files directly.

Table of contents¶
Run this notebook interactively with all dependencies pre-installed
Introduction¶
This notebook provides a framework for performance tests and resource monitoring on the new EOPF zarr format by reading the files directly.
Setup¶
Start importing the necessary libraries
import xarray as xr
import time
import psutil
import logging
import pandas as pd
# for direct loading
import rioxarray
import fsspec
import s3fsDefine paths to files
path_eodc_zarr = "https://objectstore.eodc.eu:2222/e05ab01a9d56408d82ac32d69a5aae2a:202505-s02msil2a/03/products/cpm_v256/S2A_MSIL2A_20250503T103701_N0511_R008_T32UNE_20250503T173316.zarr"
path_eodc_safe = "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/SAFE/S2A_MSIL2A_20250503T103701_N0511_R008_T32UNE_20250503T173316.SAFE"
path_cdse_safe = "s3://eodata/Sentinel-2/MSI/L2A/2025/05/03/S2A_MSIL2A_20250503T103701_N0511_R008_T32UNE_20250503T173316.SAFE"Functions¶
This sets up the fsspec logger which counts the http requests that are made when accessing the data. Zarr typically has many small requests, whereas SAFE has few large requests.
# Silent in-memory log capture
fsspec_logs = []
class ListHandler(logging.Handler):
def __init__(self, storage):
super().__init__()
self.storage = storage
def emit(self, record):
self.storage.append(self.format(record))
# Disable root logger output to notebook
logging.getLogger().handlers.clear()
# Configure only fsspec.http logger
logger_fsspec = logging.getLogger("fsspec.http")
logger_fsspec.handlers.clear()
logger_fsspec.propagate = False # <-- important! stops logs bubbling up
logger_fsspec.setLevel(logging.DEBUG)
logger_fsspec.addHandler(ListHandler(fsspec_logs))This sets up the tracking of CPU and Mem.
# cpu tracking
process = psutil.Process()This function takes an arbitrary reader function and adds the benchmarking around it.
def benchmark_run(func, repeats=1):
"""
Benchmark any callable `func` and measure:
- Wall time
- CPU time (user + system)
- Memory delta (MB)
- HTTP requests (from global `fsspec_logs`)
Returns a pandas DataFrame with all results.
"""
process = psutil.Process()
results = []
for i in range(repeats):
fsspec_logs.clear()
mem_before = process.memory_info().rss
cpu_before = process.cpu_times()
t0 = time.perf_counter()
func() # run the actual workload
t1 = time.perf_counter()
cpu_after = process.cpu_times()
mem_after = process.memory_info().rss
results.append(
{
"run": i + 1,
"time_s": t1 - t0,
"cpu_user_s": cpu_after.user - cpu_before.user,
"cpu_sys_s": cpu_after.system - cpu_before.system,
"mem_delta_MB": (mem_after - mem_before) / (1024**2),
"http_requests": len(fsspec_logs),
}
)
df = pd.DataFrame(results)
display(df)
return dfBenchmarking¶
EOPF Zarr on EODC¶
Define the reader function for opening the zarr datatree.
def open_zarr():
xr.open_datatree(path_eodc_zarr, engine="zarr", mask_and_scale=False, chunks={})Execute benchmarking.
bm_zarr_datatree = benchmark_run(open_zarr, repeats=3)Assign the data to a variable to check whether the data has been accessed correctly.
dt = xr.open_datatree(path_eodc_zarr, engine="zarr", mask_and_scale=False, chunks={})Define the reader function to actually load a band of the scene.
def load_band():
_ = dt["measurements/reflectance/r10m"]["b04"].load()Execute the benchmarking for loading a band. Note: After the first run cached results are used.
bm_zarr_band = benchmark_run(load_band, repeats=3)Check that the data is loaded correctly.
band_eodc_zarr = dt["measurements/reflectance/r10m"]["b04"].load()
band_eodc_zarrSAFE on EODC¶
Define the path to the band in the SAFE file.
# Full URL to the B04 10m band file, from f"{path_eopf_zarr}/manifest.safe"
b04_url = (
f"{path_eodc_safe}/GRANULE/L2A_T32UNE_A051514_20250503T103937/IMG_DATA/R10m/"
"T32UNE_20250503T103701_B04_10m.jp2"
)Define the function to load a band from the SAFE format.
def load_safe_band():
fs = fsspec.filesystem("http")
with fs.open(b04_url, mode="rb") as f:
_ = rioxarray.open_rasterio(f, masked=False).load()Execute the benchmarking on loading a SAFE band.
bm_safe_eodc = benchmark_run(load_safe_band, repeats=3)Load the band and check the values.
fs = fsspec.filesystem("http")
with fs.open(b04_url) as f:
band_eodc_safe = rioxarray.open_rasterio(f, masked=False).load()
band_eodc_safeSAFE on CDSE S3¶
Note: Counting of http requests is not possible on S3. Another logger would have to be set up to count S3 requests.
Insert your CDSE S3 credentials. These ones are only valid for a certain amount of time. Here’s a guide how generate them.
# Only valid for a certain amount of time. Check the guide above how to create your own.
credentials = {
"key": "XXX",
"secret": "XXX",
}Set up the S3 file system.
fs = s3fs.S3FileSystem(
key=credentials["key"],
secret=credentials["secret"],
client_kwargs={
"region_name": "eu-central-1",
"endpoint_url": "https://s3.dataspace.copernicus.eu",
},
)Define the path to the band within the SAFE file.
# Correct path from manifest.safe
band_path = (
"eodata/Sentinel-2/MSI/L2A/2025/05/03/"
"S2A_MSIL2A_20250503T103701_N0511_R008_T32UNE_20250503T173316.SAFE/"
"GRANULE/L2A_T32UNE_A051514_20250503T103937/IMG_DATA/R10m/"
"T32UNE_20250503T103701_B04_10m.jp2"
)Measure timing of loading one band.
%%time
# Open the file from S3
with fs.open(band_path, mode="rb") as f:
band_cdse_safe = rioxarray.open_rasterio(f, masked=False).load()CPU times: user 13.7 s, sys: 142 ms, total: 13.8 s
Wall time: 4.89 s
Check the values.
band_cdse_safe