Random Forest Based Land-Cover Mapping with Sentinel-2

Table of Contents¶

Introduction
Training Dataset Generation
Sample points selection
Read the points in Python
Satellite data retrieval
Predictors/Features extraction
Random Forest
Model training
Model prediction
Prediction Evaluation
What’s next?

Run this notebook interactively with all dependencies pre-installed

Introduction¶

In this notebook, we will explore land cover mapping, a practice that has been integral to remote sensing since its inception. Land cover products are essential for studying the functional and morphological changes in Earth’s ecosystems and the environment, playing a crucial role in understanding climate change and carbon circulation (Congalton et al., 2014; Feddema et al., 2005; Sellers et al., 1997). Additionally, land cover mapping provides valuable information for policy development and a wide range of applications within the natural and life sciences, making it one of the most extensively studied applications in remote sensing (Yu et al., 2014; Tucker et al., 1985; Running, 2008; Yang et al., 2013). The diverse applications of land cover mapping lead to varied user needs. Depending on the specific use case, there can be significant differences in the desired target labels, the requested target year(s), the required output resolution, the feature set used, the stratification strategy employed, and more.

This exercise demonstrates how to efficiently use the Sentinel-2 EOPF Zarr products to implement a basic land cover classification.

Install the missing libraries

pip install xgboost contextily

Requirement already satisfied: xgboost in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (3.0.5)
Requirement already satisfied: contextily in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (1.6.2)
Requirement already satisfied: numpy in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from xgboost) (1.26.4)
Requirement already satisfied: nvidia-nccl-cu12 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from xgboost) (2.28.3)
Requirement already satisfied: scipy in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from xgboost) (1.16.1)
Requirement already satisfied: geopy in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from contextily) (2.4.1)
Requirement already satisfied: matplotlib in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from contextily) (3.10.0)
Requirement already satisfied: mercantile in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from contextily) (1.2.1)
Requirement already satisfied: pillow in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from contextily) (11.3.0)
Requirement already satisfied: rasterio in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from contextily) (1.4.3)
Requirement already satisfied: requests in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from contextily) (2.32.5)
Requirement already satisfied: joblib in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from contextily) (1.5.2)
Requirement already satisfied: xyzservices in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from contextily) (2025.4.0)
Requirement already satisfied: geographiclib<3,>=1.52 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from geopy->contextily) (2.1)
Requirement already satisfied: contourpy>=1.0.1 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from matplotlib->contextily) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from matplotlib->contextily) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from matplotlib->contextily) (4.59.2)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from matplotlib->contextily) (1.4.9)
Requirement already satisfied: packaging>=20.0 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from matplotlib->contextily) (25.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from matplotlib->contextily) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from matplotlib->contextily) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->contextily) (1.17.0)
Requirement already satisfied: click>=3.0 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from mercantile->contextily) (8.2.1)
Requirement already satisfied: affine in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from rasterio->contextily) (2.4.0)
Requirement already satisfied: attrs in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from rasterio->contextily) (25.3.0)
Requirement already satisfied: certifi in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from rasterio->contextily) (2025.8.3)
Requirement already satisfied: cligj>=0.5 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from rasterio->contextily) (0.7.2)
Requirement already satisfied: click-plugins in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from rasterio->contextily) (1.1.1.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from requests->contextily) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from requests->contextily) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages (from requests->contextily) (2.5.0)
Note: you may need to restart the kernel to use updated packages.

Training Dataset Generation¶

In this exercise, we will focus on 4 main classes:

artificial surfaces
agricultural areas
forests and seminatual areas
water bodies

The names of these classes comes from Corine Land Cover Level 0 classification scheme: https://wiki.openstreetmap.org/wiki/Corine_Land_Cover

Sample points selection¶

Generate 4 geoJSON files containing at least 10 points for each class using the website: https://geojson.io/#map=12/46.48882/11.33004

Follow this procedure for each land cover class:

Open the link: https://geojson.io/#map=12/46.48882/11.33004
Switch to satellite view selecting “Standard Satellite” as a base layer in the bottom left corner
Select the Draw Point tool and place a point in an area corresponding to the class
Repeat until you have at least 10 points
Save the geoJSON file with the top-left toolbar
Repeat for the other classes

Once that you have your 4 .geojson files, upload them to JupyterLab via drag & drop.

If you don’t want to do this small mapping exercise and skip directly to the next part, we prepared already the necessary files for the area of Bolzano, Italy.

Read the points in Python¶

Modify the file paths to match the files that you generated

urban_points_file = "./geojson/urban.geojson"
crops_points_file = "./geojson/crop.geojson"
forest_points_file = "./geojson/forest.geojson"
water_points_file = "./geojson/water.geojson"

Use GeoPandas to read the vector data. We also assign the corresponding Corine Land Cover code.

import geopandas as gpd
import numpy as np

df_urban = gpd.read_file(urban_points_file).assign(**{"label": 100})
df_crops = gpd.read_file(crops_points_file).assign(**{"label": 200})
df_forest = gpd.read_file(forest_points_file).assign(**{"label": 300})
df_water = gpd.read_file(water_points_file).assign(**{"label": 500})

ERROR 1: PROJ: proj_create_from_database: Open of /home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/share/proj failed

Inspect a GeoDataFrame object containing the sample points

df_urban.head(3)

Visualize a set of points to check if they are placed correctly on the corresponding class:

import contextily as cx

df_wm = df_forest.to_crs(epsg=3857)  # Points placed on forests
ax = df_wm.plot(figsize=(10, 10), alpha=0.5, edgecolor="k")
cx.add_basemap(ax)

Satellite data retrieval¶

We will use optical Sentinel-2 data from the months of June, July and August 2022.

The data is available via the public EOPF Zarr STAC Catalog, available at https://stac.core.eopf.eodc.eu .

The xcube-eopf package will handle the STAC API requests and the datacube building parts.

from xcube.core.store import new_data_store
from xcube_eopf.utils import reproject_bbox

store = new_data_store("eopf-zarr")

We would like to extract the bounds of the area containing our training points.

To do that we can combine the different GeoDataFrames into a single one and use the total_bounds method.

import pandas as pd

sample_points = pd.concat([df_urban, df_crops, df_forest, df_water])
bbox = sample_points.total_bounds
crs_utm = "EPSG:32632"  # If you select the points in a different part of the World, you might need to use a different CRS
bbox_utm = reproject_bbox(bbox, "EPSG:4326", crs_utm)
bbox_utm

(676214.7067171357, 5147669.36589368, 682626.7086097527, 5153320.106294904)

ds = store.open_data(
    data_id="sentinel-2-l2a",
    bbox=bbox_utm,
    time_range=["2025-06-01", "2025-08-31"],
    spatial_res=10,
    crs=crs_utm,
    variables=[
        "b04",
        "b03",
        "b02",
        "b08",
        # "scl"
    ],  # b04 = red, b03 = green, b02 = blue, b08 = near infrared
)
ds

Perform a temporal aggregation to get a monthly median of the data.

The median is capable to filter out data extremes, like clouds.
However, to get the best result is advisable to create and apply a cloud mask using the Scene Classification Layer, can you add this step before computing the median?

You can have a look at the example provided here: https://eopf-sample-service.github.io/eopf-sample-notebooks/sentinel-2-ndsi-snow-mapping/#cloud-snow

Remember to add the “scl” band when creating the datacube and to remove it before performing the feature extraction.

# Apply the cloud mask to the data here:

monthly_median = ds.groupby("time.month").median().to_dataarray(dim="bands")
monthly_median

The default area above Bolzano corresponds to a manageable datacube which can fit easily in memory and therefore we will now load it in memory to speed up the next steps.

monthly_median = monthly_median.compute()

/home/mclaus@eurac.edu/micromamba/envs/eopf-zarr/lib/python3.11/site-packages/dask/_task_spec.py:758: RuntimeWarning: All-NaN slice encountered
  return self.func(*new_argspec, **kwargs)

Visualize a sample image of the selected ones:

monthly_median.loc[dict(bands=["b04", "b03", "b02"])][:, 0].plot.imshow(
    vmin=0, vmax=0.25
)

Predictors/Features extraction¶

Given the selected sample points and the Sentinel-2 data, we can now extract the data values corresponding to each sample point.

Each data array extracted will become a so-called predictor or feature vector.

Points - Data projection alignment¶

The geoJSON files contain points with coordinates expressed in lat/lon (WGS84 / EPSG:4326 coordinate system).

However, the data doesn’t necessarily has the same projection. Therefore, we reproject points to match data projection.

monthly_median.rio.crs

CRS.from_wkt('PROJCS["WGS 84 / UTM zone 32N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",9],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32632"]]')

sample_points_reproj = sample_points.to_crs(monthly_median.rio.crs)

Feature Vector extraction¶

Given the reprojected points we can get the corresponding values and create feature vectors, which will be used for training.

We store the feature vectores in the same GeoDataFrame as the sampled points:

predictors = []
for idx, s in enumerate(sample_points_reproj.iterrows()):
    point = s[1]["geometry"]
    data_values = monthly_median.sel(x=point.x, y=point.y, method="nearest").values
    feature_vector = np.ndarray.flatten(data_values)
    predictors.append(feature_vector)
sample_points_reproj = sample_points_reproj.assign(**{"features": predictors})
sample_points_reproj

We perform some operations to:

Get the feature vector in the desired shape: NxM with N as the number of points and M the number of features (in this case bands*time)
Get the corresponding label vector with the correct format. The ML library we are using is expecting labels with values starting from 0 and therefore we have to map them.

dict_mapping = {}
[
    dict_mapping.update({x: lbl})
    for x, lbl in enumerate(np.unique(sample_points_reproj["label"].values))
]
dict_mapping

{0: 100, 1: 200, 2: 300, 3: 500}

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train = le.fit_transform(sample_points_reproj["label"].values)
x_train = np.asarray([x for x in sample_points_reproj["features"].values])

Random Forest¶

We use the xgboost ML library with its Random Forest implementation.

For more info please read the documentation here:

https://xgboost.readthedocs.io/en/stable/tutorials/rf.html#standalone-random-forest-with-scikit-learn-like-api

and for the colsample_by* hyperparameters explanation check this nice article here:

https://medium.com/analytics-vidhya/xgboost-colsample-by-hyperparameters-explained-6c0bac1bdc1

Model training¶

from xgboost import XGBRFClassifier

model = XGBRFClassifier(n_estimators=100, subsample=0.9)
model.fit(x_train, y_train)

Model prediction¶

To speed up the prediction process, we parallelize using xarray apply_ufunc, which allows to apply the same function call to multiple data chunks in parallel using Dask.

import xarray as xr


def predict_rf(x):
    d = np.reshape(x, (x.shape[0] * x.shape[1], x.shape[2] * x.shape[3]))
    pred = model.predict(d)
    return np.reshape(pred, (x.shape[0], x.shape[1]))


def predict(obj, dims):
    return xr.apply_ufunc(
        predict_rf,
        obj,
        input_core_dims=[dims],
        dask="parallelized",
        output_dtypes=[int],
    )


predicted = predict(monthly_median, ["bands", "month"])

Prediction Evaluation¶

Now that our model has been trained and the prediction generated based on it, we can evaluate the result.

We didn’t do any split of the training set due to the limited number of samples, so we don’t have any test or validation set to use.

For this simple example, please evaluate the result using QGIS (or equivalent GIS software) and a satellite layer as a reference.

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.patches as mpatches
import numpy as np

class_labels = {0: "Urban", 1: "Crops", 2: "Forest", 3: "Water"}

# Define a colormap with as many colors as classes
cmap = mcolors.ListedColormap(["red", "orange", "green", "blue"])

# Define normalization: boundaries between class values
bounds = [-0.5, 0.5, 1.5, 2.5, 3.5]  # half-way between class values
norm = mcolors.BoundaryNorm(bounds, cmap.N)

# Show the image
plt.imshow(predicted.values, cmap=cmap, norm=norm)

# Build custom legend
patches = [
    mpatches.Patch(color=cmap(i), label=label)
    for i, label in enumerate(class_labels.values())
]
plt.legend(handles=patches, bbox_to_anchor=(1.05, 1), loc="upper left")

plt.show()

What’s next?¶

Now that you learned the basics, you can try to:

Create and apply a cloud mask to the data in the Satellite data retrieval section.
Apply the model to a new, unseen location and evaluate the results.
Apply the model to a different year compared to the training set.

Welcome

NDVI Time-Series Analysis of Potato Fields

Welcome

Surface Water Level Monitoring with Sentinel-2