Using 'open virtual dataset' capability to work with TEMPO Level 3 data¶

Summary¶

In this tutorial, we will use the earthaccess.open_virtual_mfdataset() function to open a week's worth of granules from the Nitrogen Dioxide (NO₂) Level-3 data collection of the TEMPO air quality mission (link). We will then calculate means for a subset of the data and visualize the results.

Note that this same approach can be used for a date range of any length, within the mission's duration. Running this notebook for a year's worth of TEMPO Level-3 data took approximately 15 minutes.

Prerequisites¶

AWS US-West-2 Environment: This tutorial has been designed to run in an AWS cloud compute instance in AWS region us-west-2. However, if you want to run it from your laptop or workstation, everything should work just fine but without the speed benefits of in-cloud access.
Earthdata Account: A (free!) Earthdata Login account is required to access data from the NASA Earthdata system. Before requesting TEMPO data, we first need to set up our Earthdata Login authentication, as described in the Earthdata Cookbook's earthaccess tutorial (link).
Packages:
- cartopy
- dask
- earthaccess version 0.14.0 or greater
- matplotlib
- numpy
- xarray

Setup¶

In [1]:

Copied!





import cartopy.crs as ccrs
import earthaccess
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from matplotlib import rcParams

%config InlineBackend.figure_format = 'jpeg'
rcParams["figure.dpi"] = (
    80  # Reduce figure resolution to keep the saved size of this notebook low.
)
import cartopy.crs as ccrs
import earthaccess
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from matplotlib import rcParams

%config InlineBackend.figure_format = 'jpeg'
rcParams["figure.dpi"] = (
    80  # Reduce figure resolution to keep the saved size of this notebook low.
)

In [2]:

Copied!

auth = earthaccess.login()

if not auth.authenticated:
    # Ask for credentials and persist them in a .netrc file.
    auth.login(strategy="interactive", persist=True)

print(earthaccess.__version__)
auth = earthaccess.login()

if not auth.authenticated:
    # Ask for credentials and persist them in a .netrc file.
    auth.login(strategy="interactive", persist=True)

print(earthaccess.__version__)

0.14.0

Search for data granules¶

We search for TEMPO Nitrogen Dioxide (NO₂) data for a week-long period (note: times are in UTC), between January 11th and 18th, 2024.

In [3]:

Copied!





results = earthaccess.search_data(
    short_name="TEMPO_NO2_L3",
    version="V03",
    temporal=("2024-01-11 12:00", "2024-01-18 12:00"),
)
print(f"Number of results: {len(results)}")
results = earthaccess.search_data(
    short_name="TEMPO_NO2_L3",
    version="V03",
    temporal=("2024-01-11 12:00", "2024-01-18 12:00"),
)
print(f"Number of results: {len(results)}")

Number of results: 81

Opening Virtual Multifile Datasets¶

First we set the argument options to be used by earthaccess.open_virtual_mfdataset.

load argument considerations:

load=True works. Within earthaccess.open_virtual_mfdataset, a temporary virtual reference file (a "virtual dataset") is created and then immediately loaded with kerchunk. This is because the function assumes the user is making this request for the first time and the combined manifest file needs to be generated first. In the future, however, earthaccess.open_virtual_mfdataset may provide a way to save the combined manifest file, at which point you could then avoid repeating these steps, and proceed directly to loading with kerchunk/virtualizarr.
load=False results in KeyError: "no index found for coordinate 'longitude'" because it creates ManifestArrays without indexes (see the earthaccess documentation here (link))

In [4]:

Copied!





open_options = {
    "access": "direct",
    "load": True,
    "concat_dim": "time",
    "coords": "minimal",
    "compat": "override",
    "combine_attrs": "override",
    "parallel": True,
}
open_options = {
    "access": "direct",
    "load": True,
    "concat_dim": "time",
    "coords": "minimal",
    "compat": "override",
    "combine_attrs": "override",
    "parallel": True,
}

Because TEMPO data are processed and archived in a netCDF4 format using a group hierarchy, we open each group – i.e., 'root', 'product', and 'geolocation' – and then afterwards merge them together.

In [5]:

Copied!





%%time
result_root = earthaccess.open_virtual_mfdataset(granules=results, **open_options)
result_product = earthaccess.open_virtual_mfdataset(
    granules=results, group="product", **open_options
)
result_geolocation = earthaccess.open_virtual_mfdataset(
    granules=results, group="geolocation", **open_options
)
%%time
result_root = earthaccess.open_virtual_mfdataset(granules=results, **open_options)
result_product = earthaccess.open_virtual_mfdataset(
    granules=results, group="product", **open_options
)
result_geolocation = earthaccess.open_virtual_mfdataset(
    granules=results, group="geolocation", **open_options
)

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
File <timed exec>:1

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py:127, in open_virtual_mfdataset(granules, group, access, load, preprocess, parallel, **xr_combine_nested_kwargs)
    125     vds = vdatasets[0]
    126 else:
--> 127     vds = xr.combine_nested(vdatasets, **xr_combine_nested_kwargs)
    128 if load:
    129     refs = vds.virtualize.to_kerchunk(filepath=None, format="dict")

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:588, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    585     concat_dim = [concat_dim]
    587 # The IDs argument tells _nested_combine that datasets aren't yet sorted
--> 588 return _nested_combine(
    589     datasets,
    590     concat_dims=concat_dim,
    591     compat=compat,
    592     data_vars=data_vars,
    593     coords=coords,
    594     ids=False,
    595     fill_value=fill_value,
    596     join=join,
    597     combine_attrs=combine_attrs,
    598 )

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:367, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
    364 _check_shape_tile_ids(combined_ids)
    366 # Apply series of concatenate or merge operations along each dimension
--> 367 combined = _combine_nd(
    368     combined_ids,
    369     concat_dims,
    370     compat=compat,
    371     data_vars=data_vars,
    372     coords=coords,
    373     fill_value=fill_value,
    374     join=join,
    375     combine_attrs=combine_attrs,
    376 )
    377 return combined

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:246, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
    242 # Each iteration of this loop reduces the length of the tile_ids tuples
    243 # by one. It always combines along the first dimension, removing the first
    244 # element of the tuple
    245 for concat_dim in concat_dims:
--> 246     combined_ids = _combine_all_along_first_dim(
    247         combined_ids,
    248         dim=concat_dim,
    249         data_vars=data_vars,
    250         coords=coords,
    251         compat=compat,
    252         fill_value=fill_value,
    253         join=join,
    254         combine_attrs=combine_attrs,
    255     )
    256 (combined_ds,) = combined_ids.values()
    257 return combined_ds

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:278, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
    276     combined_ids = dict(sorted(group))
    277     datasets = combined_ids.values()
--> 278     new_combined_ids[new_id] = _combine_1d(
    279         datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
    280     )
    281 return new_combined_ids

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:301, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    299 if concat_dim is not None:
    300     try:
--> 301         combined = concat(
    302             datasets,
    303             dim=concat_dim,
    304             data_vars=data_vars,
    305             coords=coords,
    306             compat=compat,
    307             fill_value=fill_value,
    308             join=join,
    309             combine_attrs=combine_attrs,
    310         )
    311     except ValueError as err:
    312         if "encountered unexpected variable" in str(err):

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/concat.py:277, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    264     return _dataarray_concat(
    265         objs,
    266         dim=dim,
   (...)    274         create_index_for_new_dim=create_index_for_new_dim,
    275     )
    276 elif isinstance(first_obj, Dataset):
--> 277     return _dataset_concat(
    278         objs,
    279         dim=dim,
    280         data_vars=data_vars,
    281         coords=coords,
    282         compat=compat,
    283         positions=positions,
    284         fill_value=fill_value,
    285         join=join,
    286         combine_attrs=combine_attrs,
    287         create_index_for_new_dim=create_index_for_new_dim,
    288     )
    289 else:
    290     raise TypeError(
    291         "can only concatenate xarray Dataset and DataArray "
    292         f"objects, got {type(first_obj)}"
    293     )

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/concat.py:676, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    674         result_vars[k] = v
    675 else:
--> 676     combined_var = concat_vars(
    677         vars, dim_name, positions, combine_attrs=combine_attrs
    678     )
    679     # reindex if variable is not present in all datasets
    680     if len(variable_index) < concat_index_size:

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/core/variable.py:3018, in concat(variables, dim, positions, shortcut, combine_attrs)
   2970 def concat(
   2971     variables,
   2972     dim="concat_dim",
   (...)   2975     combine_attrs="override",
   2976 ):
   2977     """Concatenate variables along a new or existing dimension.
   2978 
   2979     Parameters
   (...)   3016         along the given dimension.
   3017     """
-> 3018     variables = list(variables)
   3019     if all(isinstance(v, IndexVariable) for v in variables):
   3020         return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/concat.py:598, in _dataset_concat.<locals>.ensure_common_dims(vars, concat_dim_lengths)
    596 if var.dims != common_dims:
    597     common_shape = tuple(dims_sizes.get(d, dim_len) for d in common_dims)
--> 598     var = var.set_dims(common_dims, common_shape)
    599 yield var

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/util/deprecation_helpers.py:143, in deprecate_dims.<locals>.wrapper(*args, **kwargs)
    135     emit_user_level_warning(
    136         f"The `{old_name}` argument has been renamed to `dim`, and will be removed "
    137         "in the future. This renaming is taking place throughout xarray over the "
   (...)    140         PendingDeprecationWarning,
    141     )
    142     kwargs["dim"] = kwargs.pop(old_name)
--> 143 return func(*args, **kwargs)

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/core/variable.py:1395, in Variable.set_dims(self, dim, shape)
   1388 elif shape is None or all(
   1389     s == 1 for s, e in zip(shape, dim, strict=True) if e not in self_dims
   1390 ):
   1391     # "Trivial" broadcasting, i.e. simply inserting a new dimension
   1392     # This is typically easier for duck arrays to implement
   1393     # than the full "broadcast_to" semantics
   1394     indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,)
-> 1395     expanded_data = self.data[indexer]
   1396 else:  # elif shape is not None:
   1397     dims_map = dict(zip(dim, shape, strict=True))

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/virtualizarr/manifests/array.py:226, in ManifestArray.__getitem__(self, key)
    224     return self
    225 else:
--> 226     raise NotImplementedError(f"Doesn't support slicing with {indexer}")

NotImplementedError: Doesn't support slicing with (None, slice(None, None, None))

Merge root groups with subgroups.

In [6]:

Copied!

result_merged = xr.merge([result_root, result_product, result_geolocation])
result_merged
result_merged = xr.merge([result_root, result_product, result_geolocation])
result_merged

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 result_merged = xr.merge([result_root, result_product, result_geolocation])
      2 result_merged

NameError: name 'result_root' is not defined

Temporal Mean - a map showing an annual average¶

In [7]:

Copied!





temporal_mean_ds = (
    result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
    .where(result_merged["main_data_quality_flag"] == 0)
    .mean(dim=("time"))
)
temporal_mean_ds
temporal_mean_ds = (
    result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
    .where(result_merged["main_data_quality_flag"] == 0)
    .mean(dim=("time"))
)
temporal_mean_ds

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 2
      1 temporal_mean_ds = (
----> 2     result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
      3     .where(result_merged["main_data_quality_flag"] == 0)
      4     .mean(dim=("time"))
      5 )
      6 temporal_mean_ds

NameError: name 'result_merged' is not defined

In [8]:

Copied!

%%time
mean_vertical_column_trop = temporal_mean_ds["vertical_column_troposphere"].compute()
%%time
mean_vertical_column_trop = temporal_mean_ds["vertical_column_troposphere"].compute()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:1

NameError: name 'temporal_mean_ds' is not defined

In [9]:

Copied!





fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})

mean_vertical_column_trop.squeeze().plot.contourf(ax=ax)
ax.coastlines()
ax.gridlines(
    draw_labels=True,
    dms=True,
    x_inline=False,
    y_inline=False,
)
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})

mean_vertical_column_trop.squeeze().plot.contourf(ax=ax)
ax.coastlines()
ax.gridlines(
    draw_labels=True,
    dms=True,
    x_inline=False,
    y_inline=False,
)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 3
      1 fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})
----> 3 mean_vertical_column_trop.squeeze().plot.contourf(ax=ax)
      4 ax.coastlines()
      5 ax.gridlines(
      6     draw_labels=True,
      7     dms=True,
      8     x_inline=False,
      9     y_inline=False,
     10 )

NameError: name 'mean_vertical_column_trop' is not defined

No description has been provided for this image

Spatial mean - a time series of area averages¶

In [10]:

Copied!





spatial_mean_ds = (
    result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
    .where(result_merged["main_data_quality_flag"] == 0)
    .mean(dim=("longitude", "latitude"))
)
spatial_mean_ds
spatial_mean_ds = (
    result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
    .where(result_merged["main_data_quality_flag"] == 0)
    .mean(dim=("longitude", "latitude"))
)
spatial_mean_ds

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 spatial_mean_ds = (
----> 2     result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
      3     .where(result_merged["main_data_quality_flag"] == 0)
      4     .mean(dim=("longitude", "latitude"))
      5 )
      6 spatial_mean_ds

NameError: name 'result_merged' is not defined

In [11]:

Copied!





%%time
spatial_mean_vertical_column_trop = spatial_mean_ds[
    "vertical_column_troposphere"
].compute()
%%time
spatial_mean_vertical_column_trop = spatial_mean_ds[
    "vertical_column_troposphere"
].compute()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:1

NameError: name 'spatial_mean_ds' is not defined

In [12]:

Copied!

spatial_mean_vertical_column_trop.plot()
plt.show()
spatial_mean_vertical_column_trop.plot()
plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 spatial_mean_vertical_column_trop.plot()
      2 plt.show()

NameError: name 'spatial_mean_vertical_column_trop' is not defined

Single scan subset¶

In [13]:

Copied!





subset_ds = result_merged.sel(
    {
        "longitude": slice(-78, -74),
        "latitude": slice(35, 39),
        "time": slice(
            np.datetime64("2024-01-11T13:00:00"), np.datetime64("2024-01-11T14:00:00")
        ),
    }
).where(result_merged["main_data_quality_flag"] == 0)
subset_ds
subset_ds = result_merged.sel(
    {
        "longitude": slice(-78, -74),
        "latitude": slice(35, 39),
        "time": slice(
            np.datetime64("2024-01-11T13:00:00"), np.datetime64("2024-01-11T14:00:00")
        ),
    }
).where(result_merged["main_data_quality_flag"] == 0)
subset_ds

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 subset_ds = result_merged.sel(
      2     {
      3         "longitude": slice(-78, -74),
      4         "latitude": slice(35, 39),
      5         "time": slice(
      6             np.datetime64("2024-01-11T13:00:00"), np.datetime64("2024-01-11T14:00:00")
      7         ),
      8     }
      9 ).where(result_merged["main_data_quality_flag"] == 0)
     10 subset_ds

NameError: name 'result_merged' is not defined

In [14]:

Copied!

%%time
subset_vertical_column_trop = subset_ds["vertical_column_troposphere"].compute()
%%time
subset_vertical_column_trop = subset_ds["vertical_column_troposphere"].compute()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:1

NameError: name 'subset_ds' is not defined

In [15]:

Copied!





fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})

subset_vertical_column_trop.squeeze().plot.contourf(ax=ax)
ax.coastlines()
ax.gridlines(
    draw_labels=True,
    dms=True,
    x_inline=False,
    y_inline=False,
)
plt.show()
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})

subset_vertical_column_trop.squeeze().plot.contourf(ax=ax)
ax.coastlines()
ax.gridlines(
    draw_labels=True,
    dms=True,
    x_inline=False,
    y_inline=False,
)
plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 3
      1 fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})
----> 3 subset_vertical_column_trop.squeeze().plot.contourf(ax=ax)
      4 ax.coastlines()
      5 ax.gridlines(
      6     draw_labels=True,
      7     dms=True,
      8     x_inline=False,
      9     y_inline=False,
     10 )

NameError: name 'subset_vertical_column_trop' is not defined

In [ ]: