Model Validation for Carbon Monoxide Column (XCO)#

This Jupyter notebook describes the validation steps for the SEARS XCO CrIS ML model for Carbon Monoxide Column (XCO) prediction.

The name of the model used is “Swift EARth Science XCO CrIS ML”, also known as “SEARS XCO CrIS ML”, and the version used is v1.9 which at the time of this writing is the latest version of the model. The model was trained using TROPESS data.

This code runs a test for January 2nd, 2024 and compares the TROPESS XCO column (“Truth”) with the SEARS XCO CrIS ML XCO column (“Prediction”) for the same day.

The test shows that the SEARS XCO CrIS ML model can accurately predict the Carbon Monoxide column (XCO) globally from CriS JPSS-1 / NOAA-20 radiance and geolocation data.

Import packages#

%%capture

import os

# data i/o
import xarray as xr
import numpy as np
import pandas as pd

from joblib import load

# Machine Learning
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error
from keras.models import load_model

# Plotting
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
2024-05-11 09:49:26.373373: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Select model version#

# model version
version = 'v1.9'

Read test data#

Read data#

ds = xr.open_dataset('./data/TRPSYL2COCRS1FS.1/2024-01-02/tropess_cris_rad.nc')

# microwindows
x_mw = ds['x_mw'].values

# get the radiances
rad_sw = ds['rad_sw'].values
rad_mw = ds['rad_mw'].values
rad_lw = ds['rad_lw'].values

# get the wavenumber grids
wnum_lw = ds['wnum_lw'].values
wnum_mw = ds['wnum_mw'].values
wnum_sw = ds['wnum_sw'].values

# select the wavenumber range within the x_mw range
index_sw = np.where((wnum_sw >= x_mw[0, 0]) & (wnum_sw <= x_mw[0, -1]))[0]
index_mw = np.where((wnum_mw >= x_mw[0, 0]) & (wnum_mw <= x_mw[0, -1]))[0]
index_lw = np.where((wnum_lw >= x_mw[0, 0]) & (wnum_lw <= x_mw[0, -1]))[0]

# select the radiances within the x_mw range
rad_sw = rad_sw[:, index_sw]
rad_mw = rad_mw[:, index_mw]
rad_lw = rad_lw[:, index_lw]

# concatenate the radiances
rad_all = np.concatenate((rad_sw, rad_mw, rad_lw), axis=1)

Put everything in a DataFrame#

df = pd.DataFrame(rad_all, columns=[f'freq_{i}' for i in range(rad_all.shape[1])])
df['sol_zen'] = ds['sol_zen'].values
df['view_ang'] = ds['view_ang'].values
df['lat'] = ds['lat'].values
df['lon'] = ds['lon'].values
df['surf_alt'] = ds['surf_alt'].values
df['x_col'] = ds['x_col'].values
df
freq_0 freq_1 freq_2 freq_3 freq_4 freq_5 freq_6 freq_7 freq_8 freq_9 ... freq_27 freq_28 freq_29 freq_30 sol_zen view_ang lat lon surf_alt x_col
0 1.728163 1.877602 1.868290 1.487039 1.684189 1.831528 1.614480 1.712737 1.533962 1.192963 ... 1.130634 1.071487 1.139017 0.981569 126.349365 10.695498 -28.976671 14.779457 0.000000 68.560249
1 1.649486 1.698336 1.633244 1.305192 1.466797 1.604789 1.490178 1.549921 1.319077 1.174590 ... 0.955425 0.936192 0.961850 0.917609 127.896027 32.346649 -28.220890 10.905661 0.000000 65.140305
2 1.502273 1.681431 1.603127 1.244511 1.400010 1.578782 1.347940 1.441621 1.259302 0.975720 ... 0.874194 0.813178 0.886241 0.761095 121.915581 41.584274 -30.538448 24.387032 1480.430786 60.777603
3 1.585017 1.788456 1.718454 1.324924 1.513300 1.699426 1.445408 1.541950 1.371280 1.021333 ... 0.927304 0.856228 0.955061 0.834249 122.145279 39.115875 -30.598688 23.613056 1201.312378 61.305073
4 1.720866 1.889747 1.871431 1.504817 1.657643 1.841449 1.634289 1.700501 1.541654 1.176711 ... 1.140300 1.070431 1.134980 0.971587 124.923309 6.004320 -29.820435 17.160793 148.928574 69.753883
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
40465 0.516140 0.525526 0.510668 0.455970 0.474169 0.484743 0.489949 0.454427 0.466745 0.417149 ... 0.337447 0.339147 0.335459 0.338968 91.250351 13.073485 -65.800491 -1.297861 0.000000 38.749565
40466 0.517676 0.546703 0.510474 0.460309 0.480100 0.507740 0.490774 0.489769 0.448977 0.439844 ... 0.346478 0.335610 0.327897 0.317088 91.156967 20.751623 -65.773331 -4.446333 0.000000 41.074181
40467 0.389643 0.392439 0.385829 0.345680 0.364458 0.349169 0.381284 0.356691 0.326714 0.330235 ... 0.270910 0.265628 0.275966 0.275979 90.281052 19.819241 -66.617737 -5.267918 0.000000 39.520351
40468 0.362302 0.374521 0.359075 0.333816 0.350586 0.353371 0.335240 0.349497 0.322690 0.324345 ... 0.266414 0.250868 0.253618 0.259445 90.053825 26.098942 -66.655052 -8.386524 0.000000 40.093941
40469 0.721977 0.794511 0.727240 0.604583 0.612928 0.723280 0.624739 0.654282 0.562119 0.450953 ... 0.337577 0.346104 0.359273 0.357260 91.867027 48.364948 -63.438637 -19.670118 0.000000 42.956348

40470 rows × 37 columns

Define model inputs and outputs#

The first 31 + 5 columns featuring spectral radiance values for 31 wavelengths as well as solar zenith angle, sensor view angle, latitude, longitude, and surface altitude will serve as input (X) variables. The XCO column Carbon Monoxide information serves as output (y) variable, which the model should be able to predict.

input_size = rad_all.shape[1] + 5

X = df.values[:, 0:input_size]
y = df.values[:, input_size:input_size+1]
X.shape, y.shape
((40470, 36), (40470, 1))

Load the model#

Load the model and the input scaler:

# load model
model = load_model(f'./models/{version}_model_xco_cris.keras')

# load the input scaler
input_scaler = load(f'./models/{version}_model_xco_cris_input_scaler.joblib')

Run the model#

%%capture

# Disable TensorFlow debugging logs 
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

#   Level | Level for Humans | Level Description                  
#  -------|------------------|------------------------------------ 
#   0     | INFO             | [Default] Print all messages       
#   1     | WARNING          | Filter out INFO messages           
#   2     | ERROR            | Filter out INFO & WARNING messages 
#   3     | NONE             | Filter out all messages      


# scale the input
X_scaled = input_scaler.transform(X)

# make a prediction
yhat = model.predict(X_scaled)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1715446169.558927 14720122 service.cc:145] XLA service 0x7ff4011dc3f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1715446169.558965 14720122 service.cc:153]   StreamExecutor device (0): Host, Default Version
I0000 00:00:1715446169.589399 14720122 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

Calculate a few metrics#

mae = mean_absolute_error(y, yhat)
rmse = root_mean_squared_error(y, yhat)
pearson = np.sqrt(r2_score(y, yhat))

print(f'Mean Absolute Error: {round(float(mae), 3)} ppbv')
print(f'Root Mean Squared Error: {round(float(rmse), 3)} ppbv')
print(f'Pearson Correlation Coefficient: {round(float(pearson), 3)}')
Mean Absolute Error: 3.792 ppbv
Root Mean Squared Error: 5.536 ppbv
Pearson Correlation Coefficient: 0.974

Compare mean and standard deviation#

print('\nEvaluation of mean and standard deviation of the estimated \nvalues of Carbon Monoxide (XCO) column compared to the actual ones: \n')

print(f'Carbon Monoxide (XCO) Column True mean: {round(float(np.mean(y)), 3)} ppbv')
print(f'Carbon Monoxide (XCO) Column Estimated mean: {round(float(np.mean(yhat)), 3)} ppbv \n')

print(f'Carbon Monoxide (XCO) Column True std: {round(float(np.std(y)), 3)} ppbv')
print(f'Carbon Monoxide (XCO) Column Estimated std: {round(float(np.std(yhat)), 3)} ppbv')
Evaluation of mean and standard deviation of the estimated 
values of Carbon Monoxide (XCO) column compared to the actual ones: 

Carbon Monoxide (XCO) Column True mean: 90.599 ppbv
Carbon Monoxide (XCO) Column Estimated mean: 91.174 ppbv 

Carbon Monoxide (XCO) Column True std: 24.556 ppbv
Carbon Monoxide (XCO) Column Estimated std: 23.689 ppbv

Linear Correlation#

from sklearn.metrics import r2_score

# Calculate R-squared value
r2 = r2_score(y, yhat)

plt.figure(figsize=(8, 6))
plt.scatter(y, yhat, color='black', marker='.', s=15, alpha=0.5)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('True vs Predicted Values')

# Set the limits of x and y to be the same
lims = [
    np.min([plt.xlim(), plt.ylim()]),  # min of both axes
    np.max([plt.xlim(), plt.ylim()]),  # max of both axes
]

plt.xlim(lims)
plt.ylim(lims)

# Draw the diagonal line
plt.plot(lims, lims, 'r-', alpha=0.75, zorder=0)

# Add R^2 value to the plot
plt.annotate(f'$R^2$ = {r2:.2f}', xy=(50, 300))

plt.show()
../_images/4bc2c77ae296836920cfa4f525d16be6c239c6db0fd670a1eaa7554be8bdb292.png

Linear Correlation (Clusters)#

from matplotlib import cm

# Calculate R-squared value
r2 = r2_score(y, yhat)

plt.figure(figsize=(8, 6))
plt.hexbin(y, yhat, gridsize=50, cmap=cm.jet, bins='log')
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('True vs Predicted Values (Clustered)')

cb = plt.colorbar()
cb.set_label('log10(N)')

# Set the limits of x and y to be the same
lims = [
    np.min([plt.xlim(), plt.ylim()]),  # min of both axes
    np.max([plt.xlim(), plt.ylim()]),  # max of both axes
]

plt.xlim(lims)
plt.ylim(lims)

# Draw the diagonal line
plt.plot(lims, lims, 'r-', alpha=0.75, zorder=0)

# Add R^2 value to the plot
plt.annotate(f'$R^2$ = {r2:.2f}', xy=(50, 300))

plt.show()
../_images/f91b94fc876a9f0e48ea8deeebdfd92dcb17a142c78f70cff7880db2beb44243.png

Plot TROPESS XCO (True Value)#

latitude = df['lat']
longitude = df['lon']
x_col = y
x_col_hat = yhat

# Specify figure size (in inches)
fig = plt.figure(figsize=(12, 8))  

# Create a basemap instance
m = Basemap(projection='cyl', resolution='l', 
            llcrnrlat=-90, urcrnrlat=90,  # set latitude limits to -90 and 90
            llcrnrlon=-180, urcrnrlon=180)  # set longitude limits to -180 and 180

m.drawcoastlines(linewidth=0.2)
m.drawcountries(linewidth=0.2)

# Draw parallels (latitude lines) and meridians (longitude lines)
parallels = np.arange(-90., 91., 30.)
m.drawparallels(parallels, labels=[True,False,False,False], linewidth=0.3)

meridians = np.arange(-180., 181., 60.)
m.drawmeridians(meridians, labels=[False,False,False,True], linewidth=0.3)    

# Standard catter plot
# Transform lat and lon to map projection coordinates
xlon, ylat = m(longitude, latitude)

# Plot the data using scatter (you may want to choose a different colormap and normalization)
sc = m.scatter(xlon, ylat, c=x_col, cmap='jet', s=10, vmin=40.0, vmax=200.0, marker='.')

# Add a colorbar
cbar = m.colorbar(sc, location='bottom', pad="10%")
cbar.set_label('x_col')

plt.show()
../_images/5d5e250d2ca15c173c795f0b86417f1f3ddf84b7724989c5080836303744d473.png

Plot Model XCO (Predicted Value)#

# Specify figure size (in inches)
fig = plt.figure(figsize=(12, 8))  

m = Basemap(projection='cyl', resolution='l', 
            llcrnrlat=-90, urcrnrlat=90,  # set latitude limits to -90 and 90
            llcrnrlon=-180, urcrnrlon=180)  # set longitude limits to -180 and 180

m.drawcoastlines(linewidth=0.2)
m.drawcountries(linewidth=0.2)

# Draw parallels (latitude lines) and meridians (longitude lines)
parallels = np.arange(-90., 91., 30.)
m.drawparallels(parallels, labels=[True,False,False,False], linewidth=0.3)

meridians = np.arange(-180., 181., 60.)
m.drawmeridians(meridians, labels=[False,False,False,True], linewidth=0.3)    

# Standard catter plot
# Transform lat and lon to map projection coordinates
xlon, ylat = m(longitude, latitude)

# Plot the data using scatter (you may want to choose a different colormap and normalization)
sc = m.scatter(xlon, ylat, c=x_col_hat, cmap='jet', s=10, vmin=40.0, vmax=200.0, marker='.')

# Add a colorbar
cbar = m.colorbar(sc, location='bottom', pad="10%")
cbar.set_label('x_col_hat')

plt.show()
../_images/8b219c1ad887b239a5cc2fcab6e5df1947c9f302d992032a8c6e4a197e5d2d80.png

Plot Difference (Predicted - True)#

Plot the percentage difference between TROPESS XCO (Truth) and SEARS XCO CrIS ML v1.9 (Predicted):

ydiff = 100 * (x_col_hat - x_col) / x_col
print (f'# of observations: {len(ydiff)}')

# Specify figure size (in inches)
fig = plt.figure(figsize=(12, 8))  

m = Basemap(projection='cyl', resolution='l', 
            llcrnrlat=-90, urcrnrlat=90,  # set latitude limits to -90 and 90
            llcrnrlon=-180, urcrnrlon=180)  # set longitude limits to -180 and 180

m.drawcoastlines(linewidth=0.2)
m.drawcountries(linewidth=0.2)

# Draw parallels (latitude lines) and meridians (longitude lines)
parallels = np.arange(-90., 91., 30.)
m.drawparallels(parallels, labels=[True,False,False,False], linewidth=0.3)

meridians = np.arange(-180., 181., 60.)
m.drawmeridians(meridians, labels=[False,False,False,True], linewidth=0.3)    

# Standard catter plot
# Transform lat and lon to map projection coordinates
xlon, ylat = m(longitude, latitude)

# Plot the data using scatter (you may want to choose a different colormap and normalization)
sc = m.scatter(xlon, ylat, c=ydiff, cmap='jet', s=10, vmin=-30.0, vmax=+30.0, marker='.')

# Add a colorbar
cbar = m.colorbar(sc, location='bottom', pad="10%")
cbar.set_label('100 * (x_col_hat - x_col) / x_col')

plt.show()
# of observations: 40470
../_images/8232d1aa3d0c8cfdddd7e1955a474dbe4edb9e5a38a484d276a5b057d7881d49.png

Plot Difference (less than 10%)#

Plot the percentage difference between the TROPESS XCO (Truth) and the SEARS XCO CrIS ML v1.9 (Predicted) where the difference is less than 10%:

ydiff = 100 * (x_col_hat - x_col) / x_col
print (f'# of observations: {len(ydiff)}')

ydiff_10_index = np.where(abs(ydiff) < 10)[0]
ydiff_10 = ydiff[ydiff_10_index]
print (f'# of observations where difference < 10%: {len(ydiff_10)}')

latitude_10 = latitude[ydiff_10_index]
longitude_10 = longitude[ydiff_10_index]

# Specify figure size (in inches)
fig = plt.figure(figsize=(12, 8))  

m = Basemap(projection='cyl', resolution='l', 
            llcrnrlat=-90, urcrnrlat=90,  # set latitude limits to -90 and 90
            llcrnrlon=-180, urcrnrlon=180)  # set longitude limits to -180 and 180

m.drawcoastlines(linewidth=0.2)
m.drawcountries(linewidth=0.2)

# Draw parallels (latitude lines) and meridians (longitude lines)
parallels = np.arange(-90., 91., 30.)
m.drawparallels(parallels, labels=[True,False,False,False], linewidth=0.3)

meridians = np.arange(-180., 181., 60.)
m.drawmeridians(meridians, labels=[False,False,False,True], linewidth=0.3)    

# Standard catter plot
# Transform lat and lon to map projection coordinates
xlon, ylat = m(longitude_10, latitude_10)

# Plot the data using scatter (you may want to choose a different colormap and normalization)
sc = m.scatter(xlon, ylat, c=ydiff_10, cmap='jet', s=30, vmin=-30.0, vmax=+30.0, marker='.')

# Add a colorbar
cbar = m.colorbar(sc, location='bottom', pad="10%")
cbar.set_label('100 * (x_col_hat - x_col) / x_col < 10%')

plt.show()
# of observations: 40470
# of observations where difference < 10%: 37666
../_images/96161a6f3de0e82cf07f1e07742c68cdd649c3146c9319ef3f1d52fccacb12d7.png