Using 'open virtual dataset' capability to work with TEMPO Level 3 data¶
Summary¶
In this tutorial, we will use the earthaccess.open_virtual_mfdataset()
function to open a week's worth of granules from the Nitrogen Dioxide (NO2) Level-3 data collection of the TEMPO air quality mission (link). We will then calculate means for a subset of the data and visualize the results.
Note that this same approach can be used for a date range of any length, within the mission's duration. Running this notebook for a year's worth of TEMPO Level-3 data took approximately 15 minutes.
Prerequisites¶
AWS US-West-2 Environment: This tutorial has been designed to run in an AWS cloud compute instance in AWS region us-west-2. However, if you want to run it from your laptop or workstation, everything should work just fine but without the speed benefits of in-cloud access.
Earthdata Account: A (free!) Earthdata Login account is required to access data from the NASA Earthdata system. Before requesting TEMPO data, we first need to set up our Earthdata Login authentication, as described in the Earthdata Cookbook's earthaccess tutorial (link).
Packages:
cartopy
dask
earthaccess
version 0.14.0 or greatermatplotlib
numpy
xarray
Setup¶
import cartopy.crs as ccrs
import earthaccess
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from matplotlib import rcParams
%config InlineBackend.figure_format = 'jpeg'
rcParams["figure.dpi"] = (
80 # Reduce figure resolution to keep the saved size of this notebook low.
)
Login using the Earthdata Login¶
auth = earthaccess.login()
if not auth.authenticated:
# Ask for credentials and persist them in a .netrc file.
auth.login(strategy="interactive", persist=True)
print(earthaccess.__version__)
0.14.0
Search for data granules¶
We search for TEMPO Nitrogen Dioxide (NO2) data for a week-long period (note: times are in UTC), between January 11th and 18th, 2024.
results = earthaccess.search_data(
short_name="TEMPO_NO2_L3",
version="V03",
temporal=("2024-01-11 12:00", "2024-01-18 12:00"),
)
print(f"Number of results: {len(results)}")
Number of results: 81
Opening Virtual Multifile Datasets¶
First we set the argument options to be used by earthaccess.open_virtual_mfdataset
.
load
argument considerations:
load=True
works. Withinearthaccess.open_virtual_mfdataset
, a temporary virtual reference file (a "virtual dataset") is created and then immediately loaded with kerchunk. This is because the function assumes the user is making this request for the first time and the combined manifest file needs to be generated first. In the future, however,earthaccess.open_virtual_mfdataset
may provide a way to save the combined manifest file, at which point you could then avoid repeating these steps, and proceed directly to loading with kerchunk/virtualizarr.load=False
results inKeyError: "no index found for coordinate 'longitude'"
because it createsManifestArray
s without indexes (see the earthaccess documentation here (link))
open_options = {
"access": "direct",
"load": True,
"concat_dim": "time",
"coords": "minimal",
"compat": "override",
"combine_attrs": "override",
"parallel": True,
}
Because TEMPO data are processed and archived in a netCDF4 format using a group hierarchy, we open each group ā i.e., 'root', 'product', and 'geolocation' ā and then afterwards merge them together.
%%time
result_root = earthaccess.open_virtual_mfdataset(granules=results, **open_options)
result_product = earthaccess.open_virtual_mfdataset(
granules=results, group="product", **open_options
)
result_geolocation = earthaccess.open_virtual_mfdataset(
granules=results, group="geolocation", **open_options
)
--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) File <timed exec>:1 File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py:127, in open_virtual_mfdataset(granules, group, access, load, preprocess, parallel, **xr_combine_nested_kwargs) 125 vds = vdatasets[0] 126 else: --> 127 vds = xr.combine_nested(vdatasets, **xr_combine_nested_kwargs) 128 if load: 129 refs = vds.virtualize.to_kerchunk(filepath=None, format="dict") File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:588, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs) 585 concat_dim = [concat_dim] 587 # The IDs argument tells _nested_combine that datasets aren't yet sorted --> 588 return _nested_combine( 589 datasets, 590 concat_dims=concat_dim, 591 compat=compat, 592 data_vars=data_vars, 593 coords=coords, 594 ids=False, 595 fill_value=fill_value, 596 join=join, 597 combine_attrs=combine_attrs, 598 ) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:367, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs) 364 _check_shape_tile_ids(combined_ids) 366 # Apply series of concatenate or merge operations along each dimension --> 367 combined = _combine_nd( 368 combined_ids, 369 concat_dims, 370 compat=compat, 371 data_vars=data_vars, 372 coords=coords, 373 fill_value=fill_value, 374 join=join, 375 combine_attrs=combine_attrs, 376 ) 377 return combined File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:246, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs) 242 # Each iteration of this loop reduces the length of the tile_ids tuples 243 # by one. It always combines along the first dimension, removing the first 244 # element of the tuple 245 for concat_dim in concat_dims: --> 246 combined_ids = _combine_all_along_first_dim( 247 combined_ids, 248 dim=concat_dim, 249 data_vars=data_vars, 250 coords=coords, 251 compat=compat, 252 fill_value=fill_value, 253 join=join, 254 combine_attrs=combine_attrs, 255 ) 256 (combined_ds,) = combined_ids.values() 257 return combined_ds File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:278, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs) 276 combined_ids = dict(sorted(group)) 277 datasets = combined_ids.values() --> 278 new_combined_ids[new_id] = _combine_1d( 279 datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs 280 ) 281 return new_combined_ids File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/combine.py:301, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs) 299 if concat_dim is not None: 300 try: --> 301 combined = concat( 302 datasets, 303 dim=concat_dim, 304 data_vars=data_vars, 305 coords=coords, 306 compat=compat, 307 fill_value=fill_value, 308 join=join, 309 combine_attrs=combine_attrs, 310 ) 311 except ValueError as err: 312 if "encountered unexpected variable" in str(err): File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/concat.py:277, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim) 264 return _dataarray_concat( 265 objs, 266 dim=dim, (...) 274 create_index_for_new_dim=create_index_for_new_dim, 275 ) 276 elif isinstance(first_obj, Dataset): --> 277 return _dataset_concat( 278 objs, 279 dim=dim, 280 data_vars=data_vars, 281 coords=coords, 282 compat=compat, 283 positions=positions, 284 fill_value=fill_value, 285 join=join, 286 combine_attrs=combine_attrs, 287 create_index_for_new_dim=create_index_for_new_dim, 288 ) 289 else: 290 raise TypeError( 291 "can only concatenate xarray Dataset and DataArray " 292 f"objects, got {type(first_obj)}" 293 ) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/concat.py:676, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim) 674 result_vars[k] = v 675 else: --> 676 combined_var = concat_vars( 677 vars, dim_name, positions, combine_attrs=combine_attrs 678 ) 679 # reindex if variable is not present in all datasets 680 if len(variable_index) < concat_index_size: File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/core/variable.py:3018, in concat(variables, dim, positions, shortcut, combine_attrs) 2970 def concat( 2971 variables, 2972 dim="concat_dim", (...) 2975 combine_attrs="override", 2976 ): 2977 """Concatenate variables along a new or existing dimension. 2978 2979 Parameters (...) 3016 along the given dimension. 3017 """ -> 3018 variables = list(variables) 3019 if all(isinstance(v, IndexVariable) for v in variables): 3020 return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/structure/concat.py:598, in _dataset_concat.<locals>.ensure_common_dims(vars, concat_dim_lengths) 596 if var.dims != common_dims: 597 common_shape = tuple(dims_sizes.get(d, dim_len) for d in common_dims) --> 598 var = var.set_dims(common_dims, common_shape) 599 yield var File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/util/deprecation_helpers.py:143, in deprecate_dims.<locals>.wrapper(*args, **kwargs) 135 emit_user_level_warning( 136 f"The `{old_name}` argument has been renamed to `dim`, and will be removed " 137 "in the future. This renaming is taking place throughout xarray over the " (...) 140 PendingDeprecationWarning, 141 ) 142 kwargs["dim"] = kwargs.pop(old_name) --> 143 return func(*args, **kwargs) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/xarray/core/variable.py:1395, in Variable.set_dims(self, dim, shape) 1388 elif shape is None or all( 1389 s == 1 for s, e in zip(shape, dim, strict=True) if e not in self_dims 1390 ): 1391 # "Trivial" broadcasting, i.e. simply inserting a new dimension 1392 # This is typically easier for duck arrays to implement 1393 # than the full "broadcast_to" semantics 1394 indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,) -> 1395 expanded_data = self.data[indexer] 1396 else: # elif shape is not None: 1397 dims_map = dict(zip(dim, shape, strict=True)) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1037/lib/python3.11/site-packages/virtualizarr/manifests/array.py:226, in ManifestArray.__getitem__(self, key) 224 return self 225 else: --> 226 raise NotImplementedError(f"Doesn't support slicing with {indexer}") NotImplementedError: Doesn't support slicing with (None, slice(None, None, None))
Merge root groups with subgroups.
result_merged = xr.merge([result_root, result_product, result_geolocation])
result_merged
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[6], line 1 ----> 1 result_merged = xr.merge([result_root, result_product, result_geolocation]) 2 result_merged NameError: name 'result_root' is not defined
Temporal Mean - a map showing an annual average¶
temporal_mean_ds = (
result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
.where(result_merged["main_data_quality_flag"] == 0)
.mean(dim=("time"))
)
temporal_mean_ds
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[7], line 2 1 temporal_mean_ds = ( ----> 2 result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)}) 3 .where(result_merged["main_data_quality_flag"] == 0) 4 .mean(dim=("time")) 5 ) 6 temporal_mean_ds NameError: name 'result_merged' is not defined
%%time
mean_vertical_column_trop = temporal_mean_ds["vertical_column_troposphere"].compute()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) File <timed exec>:1 NameError: name 'temporal_mean_ds' is not defined
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})
mean_vertical_column_trop.squeeze().plot.contourf(ax=ax)
ax.coastlines()
ax.gridlines(
draw_labels=True,
dms=True,
x_inline=False,
y_inline=False,
)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[9], line 3 1 fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()}) ----> 3 mean_vertical_column_trop.squeeze().plot.contourf(ax=ax) 4 ax.coastlines() 5 ax.gridlines( 6 draw_labels=True, 7 dms=True, 8 x_inline=False, 9 y_inline=False, 10 ) NameError: name 'mean_vertical_column_trop' is not defined
Spatial mean - a time series of area averages¶
spatial_mean_ds = (
result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
.where(result_merged["main_data_quality_flag"] == 0)
.mean(dim=("longitude", "latitude"))
)
spatial_mean_ds
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[10], line 2 1 spatial_mean_ds = ( ----> 2 result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)}) 3 .where(result_merged["main_data_quality_flag"] == 0) 4 .mean(dim=("longitude", "latitude")) 5 ) 6 spatial_mean_ds NameError: name 'result_merged' is not defined
%%time
spatial_mean_vertical_column_trop = spatial_mean_ds[
"vertical_column_troposphere"
].compute()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) File <timed exec>:1 NameError: name 'spatial_mean_ds' is not defined
spatial_mean_vertical_column_trop.plot()
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[12], line 1 ----> 1 spatial_mean_vertical_column_trop.plot() 2 plt.show() NameError: name 'spatial_mean_vertical_column_trop' is not defined
Single scan subset¶
subset_ds = result_merged.sel(
{
"longitude": slice(-78, -74),
"latitude": slice(35, 39),
"time": slice(
np.datetime64("2024-01-11T13:00:00"), np.datetime64("2024-01-11T14:00:00")
),
}
).where(result_merged["main_data_quality_flag"] == 0)
subset_ds
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[13], line 1 ----> 1 subset_ds = result_merged.sel( 2 { 3 "longitude": slice(-78, -74), 4 "latitude": slice(35, 39), 5 "time": slice( 6 np.datetime64("2024-01-11T13:00:00"), np.datetime64("2024-01-11T14:00:00") 7 ), 8 } 9 ).where(result_merged["main_data_quality_flag"] == 0) 10 subset_ds NameError: name 'result_merged' is not defined
%%time
subset_vertical_column_trop = subset_ds["vertical_column_troposphere"].compute()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) File <timed exec>:1 NameError: name 'subset_ds' is not defined
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})
subset_vertical_column_trop.squeeze().plot.contourf(ax=ax)
ax.coastlines()
ax.gridlines(
draw_labels=True,
dms=True,
x_inline=False,
y_inline=False,
)
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[15], line 3 1 fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()}) ----> 3 subset_vertical_column_trop.squeeze().plot.contourf(ax=ax) 4 ax.coastlines() 5 ax.gridlines( 6 draw_labels=True, 7 dms=True, 8 x_inline=False, 9 y_inline=False, 10 ) NameError: name 'subset_vertical_column_trop' is not defined