Zarr vs SAFE Metadata
Learn how to extract Metadata from EOPF Zarr, in comparison with SAFE (Standard Archive Format for Europe) using GDAL & Xarray (Sentinel-2 L2A)

Table of Contents¶
Introduction¶
This notebook explores the most efficient techniques for extracting metadata from Sentinel-2 L2A data, comparing the EOPF Zarr and the legacy SAFE (Standard Archive Format for Europe) formats. It demonstrates how to use GDAL to extract metadata from both formats and investigates the use of Xarray for metadata extraction from Zarr. Additionally, the notebook highlights key differences between the two formats. Through practical examples, users will gain insights into optimizing metadata retrieval workflows for Earth observation data.
Setup¶
Start importing the necessary libraries
from osgeo import gdal
import xarray as xr
EOPF Zarr metadata extraction and interpretation with GDAL¶
We will demonstrate how to use GDAL to extract the main metadata fields for Sentinel-2 L2A.
The source file is now a Zarr file, which is the result of the original SAFE product converted using the eopf library.
zarr_path = 'ZARR:"/vsicurl/https://objectstore.eodc.eu:2222/e05ab01a9d56408d82ac32d69a5aae2a:sample-data/tutorial_data/cpm_v253/S2A_MSIL2A_20240101T102431_N0510_R065_T32TNT_20240101T144052.zarr"'
try:
dataset = gdal.OpenEx(
zarr_path,
gdal.OF_MULTIDIM_RASTER,
)
root_group = dataset.GetRootGroup()
attributes = {}
for attribute in root_group.GetAttributes():
attributes[attribute.GetName()] = attribute.Read()
except RuntimeError as e:
print(f"GDAL Error: {e}")
attributes.keys()
General Metadata¶
- Mission ID: Sentinel-2A, Sentinel-2B or Sentinel-2C
- Product ID: Unique identifier for the data product
- Product Type: (e.g., S2MSI1C -> Level-1C, S2MSI2A -> Level-2A)
- Processing Baseline: The software version used for processing
- Acquisition Start Time: Time when the sensor started acquiring data
- Acquisition Stop Time: Time when the sensor stopped acquiring data
- Processing Time: Time when the data was processed
- Cloud Cover Percentage: Estimated cloud coverage in the scene
mission_id = attributes["stac_discovery"]["properties"]["platform"]
product_id = attributes["stac_discovery"]["id"]
product_type = attributes["stac_discovery"]["properties"]["processing:level"]
processing_baseline = attributes["stac_discovery"]["properties"]["eopf:baseline"]
product_start_time = attributes["stac_discovery"]["properties"]["start_datetime"]
product_stop_time = attributes["stac_discovery"]["properties"]["end_datetime"]
processing_time = attributes["stac_discovery"]["properties"]["created"]
cloud_cover = attributes["stac_discovery"]["properties"]["eo:cloud_cover"]
print("General Metadata")
print(f"Mission ID: {mission_id}")
print(f"Product ID: {product_id}")
print(f"Product Type: {product_type}")
print(f"Processing Baseline: {processing_baseline}")
print(f"Acquisition Start Time: {product_start_time}")
print(f"Acquisition Stop Time: {product_stop_time}")
print(f"Processing Time: {processing_time}")
print(f"Cloud Cover Percentage: {cloud_cover} %")
Acquisition & Geolocation¶
- Tile ID: Sentinel-2 tiling grid reference (e.g., T31TCH)
- Relative Orbit Number: Orbit number within the 10-day repeat cycle
- Latitude & Longitude Bounds: Bounding box coordinates of the scene
- Mean Sun Zenith Angle: Average sun elevation during acquisition
- Mean Sun Azimuth Angle: Average sun azimuth angle during acquisition
tile_id = attributes["stac_discovery"]["id"].split("_")[-2]
relative_orbit_number = attributes["stac_discovery"]["properties"]["sat:relative_orbit"]
bbox = attributes["stac_discovery"]["bbox"]
mean_sun_zenith_angle = attributes["other_metadata"][
"mean_sun_zenith_angle_in_deg_for_all_bands_all_detectors"
]
mean_sun_azimuth_angle = attributes["other_metadata"][
"mean_sun_azimuth_angle_in_deg_for_all_bands_all_detectors"
]
print("\nAcquisition & Geolocation")
print(f"Tile ID: {tile_id}")
print(f"Relative Orbit Number: {relative_orbit_number}")
print(f"BBOX: {bbox}")
print(f"Mean Sun Zenith Angle: {mean_sun_zenith_angle}")
print(f"Mean Sun Azimuth Angle: {mean_sun_azimuth_angle}")
Spectral Bands Information¶
- Bands List: Names of available bands (e.g., B01–B12)
- Wavelength Range: Central wavelength and bandwidth for each band
- Spatial Resolution: 10m, 20m, or 60m depending on the band
print("\nSpectral & Band Information")
bands = attributes["stac_discovery"]["properties"]["bands"]
resolution_mapping = attributes["stac_discovery"]["properties"]["eopf:resolutions"]
# Process the resolution mapping into a dictionary where band names are the keys
formatted_resolution_mapping = {}
for key, value in resolution_mapping.items():
band_names = key.replace("bands ", "").split(", ")
for band in band_names:
band = band.strip().lower() # Standardize case and remove spaces
if (
band.isdigit() or band == "8a"
): # Ensure "b" prefix for numerical bands & "8a"
band = f"b{band}"
formatted_resolution_mapping[band] = value + "m"
# Loop through bands and print with resolution
for band in bands:
band_name = band["name"].strip().lower() # Ensure matching format
central_wavelength = band["center_wavelength"]
resolution = formatted_resolution_mapping.get(band_name, "Unknown")
print(
f"Band Name: {band_name}, Central Wavelength: {central_wavelength} nm, Resolution: {resolution}"
)
SAFE metadata extraction and interpretation with GDAL¶
We will demonstrate how to use GDAL to extract the main metadata fields for Sentinel-2 L2A.
The source file is a zip file hosted on an S3 object storage and we can get the information we need pointing to the specific .xml
metadata file.
This avoids downloading the full zip, which would take several GBs on your disk and some time to be downloaded. It is particulalry useful if we are only interested in the metadata.
# Open the Sentinel-2 SAFE metadata file using GDAL
file_path = "/vsizip//vsicurl/https://objectstore.eodc.eu:2222/e05ab01a9d56408d82ac32d69a5aae2a:sample-data/safe-data/s2_msil2a/S2A_MSIL2A_20240101T102431_N0510_R065_T32TNT_20240101T144052.SAFE.zip/eodc/private/jrc_gfm/gfm_scratch/work/eopf/work/safe4eurac/data/S2A_MSIL2A_20240101T102431_N0510_R065_T32TNT_20240101T144052.SAFE/MTD_MSIL2A.xml"
dataset = gdal.Open(file_path)
# Extract metadata
metadata = dataset.GetMetadata()
print("Available metadata fields: ", metadata)
General Metadata¶
- Mission ID: Sentinel-2A, Sentinel-2B or Sentinel-2C
- Product ID: Unique identifier for the data product
- Product Type: (e.g., S2MSI1C -> Level-1C, S2MSI2A -> Level-2A)
- Processing Baseline: The software version used for processing
- Acquisition Start Time: Time when the sensor started acquiring data
- Acquisition Stop Time: Time when the sensor stopped acquiring data
- Processing Time: Time when the data was processed
- Cloud Cover Percentage: Estimated cloud coverage in the scene
mission_id = metadata.get("DATATAKE_1_SPACECRAFT_NAME", "N/A")
product_id = metadata.get("PRODUCT_URI", "N/A")
product_type = metadata.get("PRODUCT_TYPE", "N/A")
processing_baseline = metadata.get("PROCESSING_BASELINE", "N/A")
product_start_time = metadata.get("PRODUCT_START_TIME", "N/A")
product_stop_time = metadata.get("PRODUCT_STOP_TIME", "N/A")
processing_time = metadata.get("GENERATION_TIME", "N/A")
cloud_cover_percentage = metadata.get("CLOUD_COVERAGE_ASSESSMENT", "N/A")
print("General Metadata")
print(f"Mission ID: {mission_id}")
print(f"Product ID: {product_id}")
print(f"Product Type: {product_type}")
print(f"Processing Baseline: {processing_baseline}")
print(f"Acquisition Start Time: {product_start_time}")
print(f"Acquisition Stop Time: {product_stop_time}")
print(f"Processing Time: {processing_time}")
print(f"Cloud Cover Percentage: {cloud_cover_percentage}%")
Acquisition & Geolocation¶
- Tile ID: Sentinel-2 tiling grid reference (e.g., T31TCH)
- Relative Orbit Number: Orbit number within the 10-day repeat cycle
- Latitude & Longitude Bounds: Bounding box coordinates of the scene
tile_id = metadata.get("PRODUCT_URI", "N/A").split("_")[-2]
relative_orbit_number = metadata.get("DATATAKE_1_SENSING_ORBIT_NUMBER", "N/A")
footprint = metadata.get("FOOTPRINT", "N/A")
print("\nAcquisition & Geolocation")
print(f"Tile ID: {tile_id}")
print(f"Relative Orbit Number: {relative_orbit_number}")
print(f"Latitude & Longitude Bounds: {footprint}")
Radiometric & Atmospheric Metadata¶
- Quantification Value: Scale factor to convert raw digital numbers (DN) to reflectance
- Radiometric Offset (introduced with processing baseline 5):
- GDAL can’t currently extract the metadata field
BOA_ADD_OFFSET
, which in the originalMTD_MSIL2A.xml
metadata file is a value specified for each band like this:<BOA_ADD_OFFSET band_id="0">-1000</BOA_ADD_OFFSET>
- GDAL can’t currently extract the metadata field
The mentioned values are necessary to compute the Bottom-Of-Atmosphere (BOA) values from the Digital Numbers (DN) (values in the JPEG2000 files) with the following formula:
Acronyms:
BOA: Bottom-Of-Atmosphere
AOT: Aerosol Optical Thickness
WVP: Scene-average Water Vapour
# Extract values safely
quantification_value_BOA = metadata.get("BOA_QUANTIFICATION_VALUE", "")
quantification_value_AOT = metadata.get("AOT_QUANTIFICATION_VALUE", "")
quantification_value_WVP = metadata.get("WVP_QUANTIFICATION_VALUE", "")
# Define units if available
boa_unit = metadata.get("BOA_QUANTIFICATION_VALUE_UNIT", "").replace("none", "")
aot_unit = metadata.get("AOT_QUANTIFICATION_VALUE_UNIT", "").replace("none", "")
wvp_unit = metadata.get("WVP_QUANTIFICATION_VALUE_UNIT", "").replace("none", "")
print("\nRadiometric & Athmospheric Metadata")
print(f"BOA Quantification Value: {quantification_value_BOA} {boa_unit}")
print(f"AOT Quantification Value: {quantification_value_AOT} {aot_unit}")
print(f"WVP Quantification Value: {quantification_value_WVP} {wvp_unit}")
Subdatasets¶
Here we can get a hint about the product structure, where different bands are available in different resolutions:
subdatasets = dataset.GetSubDatasets()
subdatasets
Zarr metadata extraction and interpretation with Xarray¶
We finally access the metadata using the Xarray Python library, which support opening Zarr products and can also be used to process the data easily.
ds = xr.open_datatree(
"https://objectstore.eodc.eu:2222/e05ab01a9d56408d82ac32d69a5aae2a:sample-data/tutorial_data/cpm_v253/S2A_MSIL2A_20240101T102431_N0510_R065_T32TNT_20240101T144052.zarr",
engine="zarr",
chunks={},
)
General Metadata¶
- Mission ID: Sentinel-2A, Sentinel-2B or Sentinel-2C
- Product ID: Unique identifier for the data product
- Product Type: (e.g., S2MSI1C -> Level-1C, S2MSI2A -> Level-2A)
- Processing Baseline: The software version used for processing
- Acquisition Start Time: Time when the sensor started acquiring data
- Acquisition Stop Time: Time when the sensor stopped acquiring data
- Processing Time: Time when the data was processed
- Cloud Cover Percentage: Estimated cloud coverage in the scene
mission_id = ds.attrs["stac_discovery"]["properties"]["platform"]
product_id = ds.attrs["stac_discovery"]["id"]
product_type = ds.attrs["stac_discovery"]["properties"]["processing:level"]
processing_baseline = ds.attrs["stac_discovery"]["properties"]["eopf:baseline"]
product_start_time = ds.attrs["stac_discovery"]["properties"]["start_datetime"]
product_stop_time = ds.attrs["stac_discovery"]["properties"]["end_datetime"]
processing_time = ds.attrs["stac_discovery"]["properties"]["created"]
cloud_cover = ds.attrs["stac_discovery"]["properties"]["eo:cloud_cover"]
print("General Metadata")
print(f"Mission ID: {mission_id}")
print(f"Product ID: {product_id}")
print(f"Product Type: {product_type}")
print(f"Processing Baseline: {processing_baseline}")
print(f"Acquisition Start Time: {product_start_time}")
print(f"Acquisition Stop Time: {product_stop_time}")
print(f"Processing Time: {processing_time}")
print(f"Cloud Cover Percentage: {cloud_cover} %")
Acquisition & Geolocation¶
- Tile ID: Sentinel-2 tiling grid reference (e.g., T31TCH)
- Relative Orbit Number: Orbit number within the 10-day repeat cycle
- Latitude & Longitude Bounds: Bounding box coordinates of the scene
- Mean Sun Zenith Angle: Average sun elevation during acquisition
- Mean Sun Azimuth Angle: Average sun azimuth angle during acquisition
tile_id = ds.attrs["stac_discovery"]["id"].split("_")[-2]
relative_orbit_number = ds.attrs["stac_discovery"]["properties"]["sat:relative_orbit"]
bbox = ds.attrs["stac_discovery"]["bbox"]
mean_sun_zenith_angle = ds.other_metadata[
"mean_sun_zenith_angle_in_deg_for_all_bands_all_detectors"
]
mean_sun_azimuth_angle = ds.other_metadata[
"mean_sun_azimuth_angle_in_deg_for_all_bands_all_detectors"
]
print("\nAcquisition & Geolocation")
print(f"Tile ID: {tile_id}")
print(f"Relative Orbit Number: {relative_orbit_number}")
print(f"BBOX: {bbox}")
print(f"Mean Sun Zenith Angle: {mean_sun_zenith_angle}")
print(f"Mean Sun Azimuth Angle: {mean_sun_azimuth_angle}")
Spectral Bands Information¶
- Bands List: Names of available bands (e.g., B01–B12)
- Wavelength Range: Central wavelength and bandwidth for each band
- Spatial Resolution: 10m, 20m, or 60m depending on the band
bands = ds.attrs["stac_discovery"]["properties"]["bands"]
resolution_mapping = ds.attrs["stac_discovery"]["properties"]["eopf:resolutions"]
# Process the resolution mapping into a dictionary where band names are the keys
formatted_resolution_mapping = {}
for key, value in resolution_mapping.items():
band_names = key.replace("bands ", "").split(", ")
for band in band_names:
band = band.strip().lower() # Standardize case and remove spaces
if (
band.isdigit() or band == "8a"
): # Ensure "b" prefix for numerical bands & "8a"
band = f"b{band}"
formatted_resolution_mapping[band] = value + "m"
print("Spectral Bands Information")
# Loop through bands and print with resolution
for band in bands:
band_name = band["name"].strip().lower() # Ensure matching format
central_wavelength = band["center_wavelength"]
resolution = formatted_resolution_mapping.get(band_name, "Unknown")
print(
f"Band Name: {band_name}, Central Wavelength: {central_wavelength} nm, Resolution: {resolution}"
)
