Introduction to AirVeg
AirVeg.Rmd
The goal of AirVeg is to make it easy to import and manipulate data from air quality models and measurements. Validation of simulations with respect to observations can be performed with the provided statistical scores.
In this example, we will perform a simple validation of predicted (model) data with observed measurements of ground level O3 concentrations in the city of Bologna. Please note that the dataset cited in the following are not available inside the package due to the size of the files.
Import data
Firstly, we load the AirVeg
library. We also load the
data.table
package as it will be used to manipulate columns
and values of the dataset:
library(AirVeg)
library(data.table)
data.table
provides a high-performance version of base
R’s data.frame
. If you are not familiar with it, please
check the documentation.
Let’s assume we have a dataset of model data in the NetCDF
file BO_c_2015_SVR_hourly_01.nc
. We can check the content
of the archive with the AirVeg::nc_summary()
function:
nc_summary("../devtemp/data/BO/FARM/BO_c_2015_SVR_hourly_01.nc")
#>
#> --- netCDF archive info ---
#> CF Convention : COARDS
#> First deadline : 01/01/2015 03:00:00 UTC
#> Last deadline : 01/02/2015 00:00:00 UTC
#> Deadline interval (s) : 3600
#> Number of deadlines : 742
#> Number of gridpoints (x, y, z): 50, 50, 0
#> Coord of SW corner (km, km) : 660.500 4905.500
#> Coord of NE corner (km, km) : 709.500 4954.500
#> Grid cell sizes (x, y) : 1.00 1.00
#> List of 2D variables ( 4) : c_O3 c_NO2 c_PM25 c_PM10
We notice that the archive contains hourly values of four pollutants
for the single month of January 2015. If we want to read the
Ozone in the the c_O3 variable, we can use the
AirVeg::nc_get_variable()
function, as:
farmData <- nc_get_variable(input = "../devtemp/data/BO/FARM/BO_c_2015_SVR_hourly_01.nc",
variable = "c_O3",
tz = "Etc/GMT-1")
We can use the data.table
notation to check for the
content of farmData
:
farmData[]
#> time y x c_O3
#> <POSc> <num> <num> <num>
#> 1: 2015-01-01 03:00:00 4905.5 660.5 13.69436
#> 2: 2015-01-01 03:00:00 4905.5 661.5 15.36455
#> 3: 2015-01-01 03:00:00 4905.5 662.5 16.47361
#> 4: 2015-01-01 03:00:00 4905.5 663.5 16.27375
#> 5: 2015-01-01 03:00:00 4905.5 664.5 15.30639
#> ---
#> 1854996: 2015-02-01 00:00:00 4954.5 705.5 20.36149
#> 1854997: 2015-02-01 00:00:00 4954.5 706.5 21.08920
#> 1854998: 2015-02-01 00:00:00 4954.5 707.5 21.57969
#> 1854999: 2015-02-01 00:00:00 4954.5 708.5 21.88763
#> 1855000: 2015-02-01 00:00:00 4954.5 709.5 22.32390
We see that the variable is composed of 4 columns: time
with date and time; x
and y
with grid
coordinates; c_O3
with the concentration values of O3. Each line of the table contains all
the information about the given deadline.
The grid coordinates are expressed in kilometres. As it will
be clear in the following, it is convenient to convert then to
metres. In order to do so, we again use the compact notation of
data.table
and perform an in-place
transformation.
farmData[, `:=`(x = x * 1000, y = y * 1000)]
farmData[]
#> time y x c_O3
#> <POSc> <num> <num> <num>
#> 1: 2015-01-01 03:00:00 4905500 660500 13.69436
#> 2: 2015-01-01 03:00:00 4905500 661500 15.36455
#> 3: 2015-01-01 03:00:00 4905500 662500 16.47361
#> 4: 2015-01-01 03:00:00 4905500 663500 16.27375
#> 5: 2015-01-01 03:00:00 4905500 664500 15.30639
#> ---
#> 1854996: 2015-02-01 00:00:00 4954500 705500 20.36149
#> 1854997: 2015-02-01 00:00:00 4954500 706500 21.08920
#> 1854998: 2015-02-01 00:00:00 4954500 707500 21.57969
#> 1854999: 2015-02-01 00:00:00 4954500 708500 21.88763
#> 1855000: 2015-02-01 00:00:00 4954500 709500 22.32390
To perform a very simple validation of the modeled data, we need a
corresponding observations dataset. In this case we can use
AirVeg::import_data_brace
to read observations data from a
local file with a subset of the BRACE database:
obsO3 <- import_data_brace("../devtemp/data/BO/BRACE/O3_2015_BRACE/",
tz = "Etc/GMT-1")
obsO3[]
#> Key: <cod_brace>
#> cod_brace datetime obs x y nomestaz
#> <int> <POSc> <int> <num> <num> <char>
#> 1: 803708 2015-01-01 00:00:00 23 687271.6 4928255 GIARDINI_MARGHERITA
#> 2: 803708 2015-01-01 01:00:00 23 687271.6 4928255 GIARDINI_MARGHERITA
#> 3: 803708 2015-01-01 02:00:00 NA 687271.6 4928255 GIARDINI_MARGHERITA
#> 4: 803708 2015-01-01 03:00:00 23 687271.6 4928255 GIARDINI_MARGHERITA
#> 5: 803708 2015-01-01 04:00:00 29 687271.6 4928255 GIARDINI_MARGHERITA
#> ---
#> 17516: 803719 2015-12-31 19:00:00 6 681741.6 4929951 VIA_CHIARINI
#> 17517: 803719 2015-12-31 20:00:00 4 681741.6 4929951 VIA_CHIARINI
#> 17518: 803719 2015-12-31 21:00:00 3 681741.6 4929951 VIA_CHIARINI
#> 17519: 803719 2015-12-31 22:00:00 4 681741.6 4929951 VIA_CHIARINI
#> 17520: 803719 2015-12-31 23:00:00 3 681741.6 4929951 VIA_CHIARINI
#> regione tipo zona quota lat lon cod_airbase
#> <char> <char> <char> <int> <num> <num> <char>
#> 1: Emilia-Romagna background urban 43 44.48333 11.35500 IT0892A
#> 2: Emilia-Romagna background urban 43 44.48333 11.35500 IT0892A
#> 3: Emilia-Romagna background urban 43 44.48333 11.35500 IT0892A
#> 4: Emilia-Romagna background urban 43 44.48333 11.35500 IT0892A
#> 5: Emilia-Romagna background urban 43 44.48333 11.35500 IT0892A
#> ---
#> 17516: Emilia-Romagna background suburban 56 44.50000 11.28611 IT2075A
#> 17517: Emilia-Romagna background suburban 56 44.50000 11.28611 IT2075A
#> 17518: Emilia-Romagna background suburban 56 44.50000 11.28611 IT2075A
#> 17519: Emilia-Romagna background suburban 56 44.50000 11.28611 IT2075A
#> 17520: Emilia-Romagna background suburban 56 44.50000 11.28611 IT2075A
The resulting output is again a data.table
object which
contains measurements for two stations, as it can be easily
verified:
uniqueN(obsO3[, .(cod_brace)])
#> [1] 2
Namely the stations are “Giardini Margherita” and “Via Chiarini” respectively of background/urban and background/suburban type:
unique(obsO3[, .(nomestaz, tipo, zona), by = .(cod_brace)])
#> Key: <cod_brace>
#> cod_brace nomestaz tipo zona
#> <int> <char> <char> <char>
#> 1: 803708 GIARDINI_MARGHERITA background urban
#> 2: 803719 VIA_CHIARINI background suburban
Model data extraction
We now need to extract the model data at the location of the two
measurement stations. This task can be accomplished with the help the
AirVeg::nc_extract()
function. Here we need to specify the
predicted and observed dataset and the names of the relevant
columns.
farmDataBrace <- nc_extract(mydata = farmData,
name.x = "x", name.y = "y", name.time = "time", name.value = "c_O3",
stations = obsO3,
st.x = "x", st.y = "y", st.time = "datetime",
st.value = "obs", st.code = "cod_brace")
In the first group of arguments, information about the model data are provided. The corresponding information for the observations are given in the second group of arguments.
In the example, it is used the default extraction method
(method="simple"
) i.e. the values of cell in which a point
falls are returned.
Alternatively a bilinear interpolation of the four nearest neighbours
can be performed by choosing the "bilinear"
method.
farmDataBrace
is a new data.table
object
obtained merging the two orignal tables with both the observed data and
the model values extracted at the stations coordinates:
colnames(farmDataBrace)
#> [1] "cod_brace" "time" "x" "y" "c_O3"
#> [6] "obs" "nomestaz" "regione" "tipo" "zona"
#> [11] "quota" "lat" "lon" "cod_airbase"
To improve readability, we can drop the unnecessary columns:
farmDataBrace[, `:=`(cod_airbase = NULL, lat = NULL, lon = NULL, quota = NULL,
regione = NULL, tipo = NULL, zona = NULL)]
farmDataBrace[]
#> Key: <cod_brace, time, x, y>
#> cod_brace time x y c_O3 obs
#> <int> <POSc> <num> <num> <num> <int>
#> 1: 803708 2015-01-01 03:00:00 687271.6 4928255 1.234904e+01 23
#> 2: 803708 2015-01-01 04:00:00 687271.6 4928255 2.263929e+01 29
#> 3: 803708 2015-01-01 05:00:00 687271.6 4928255 2.417113e+01 26
#> 4: 803708 2015-01-01 06:00:00 687271.6 4928255 1.723468e+01 21
#> 5: 803708 2015-01-01 07:00:00 687271.6 4928255 2.061162e+00 30
#> ---
#> 1480: 803719 2015-01-31 20:00:00 681741.6 4929951 2.941550e+00 4
#> 1481: 803719 2015-01-31 21:00:00 681741.6 4929951 3.731866e-01 5
#> 1482: 803719 2015-01-31 22:00:00 681741.6 4929951 2.186908e-04 5
#> 1483: 803719 2015-01-31 23:00:00 681741.6 4929951 2.199765e-02 5
#> 1484: 803719 2015-02-01 00:00:00 681741.6 4929951 2.679052e-01 5
#> nomestaz
#> <char>
#> 1: GIARDINI_MARGHERITA
#> 2: GIARDINI_MARGHERITA
#> 3: GIARDINI_MARGHERITA
#> 4: GIARDINI_MARGHERITA
#> 5: GIARDINI_MARGHERITA
#> ---
#> 1480: VIA_CHIARINI
#> 1481: VIA_CHIARINI
#> 1482: VIA_CHIARINI
#> 1483: VIA_CHIARINI
#> 1484: VIA_CHIARINI
The dataset now includes all the necessary columns to perform some
data validation. Specifically, the observed measurements are in the
obs
column while the predicted data from FARM model are in
the c_O3
column.
Data validation
Now we can compute some validation scores, grouping the data by
station, using the compact data.table
syntax:
farmDataBrace[, .(Ndata = countComplete(obs, c_O3),
bias = bias(obs = obs, pred = c_O3, use = "complete.obs"),
rmse = rmse(obs = obs, pred = c_O3, use = "complete.obs"),
corr = corr(obs = obs, pred = c_O3, use = "na.or.complete"),
nmse = nmse(obs = obs, pred = c_O3, use = "complete.obs"),
fb = fb(obs = obs, pred = c_O3, use = "complete.obs"),
fac2 = fac2(obs = obs, pred = c_O3, use = "complete.obs"),
ioa = ioa(obs = obs, pred = c_O3, use = "complete.obs")),
by = .(cod_brace)]
#> Key: <cod_brace>
#> cod_brace Ndata bias rmse corr nmse fb fac2
#> <int> <int> <num> <num> <num> <num> <num> <num>
#> 1: 803708 656 -2.788975 17.27184 0.4919275 1.820024 -0.2165626 0.2743902
#> 2: 803719 711 -2.350166 12.46522 0.5916661 1.034304 -0.1908693 0.3248945
#> ioa
#> <num>
#> 1: 0.6922993
#> 2: 0.7598378
In many of the used statistical functions, we need to set the
"complete.obs"
option to select only the rows where both
observed and predicted data are present. Otherwise, if some value is
missing (Not Available, NA), the result would be null:
farmDataBrace[, .(Ndata = countComplete(obs, c_O3),
bias = bias(obs = obs, pred = c_O3),
rmse = rmse(obs = obs, pred = c_O3),
corr = corr(obs = obs, pred = c_O3),
nmse = nmse(obs = obs, pred = c_O3),
fb = fb(obs = obs, pred = c_O3),
fac2 = fac2(obs = obs, pred = c_O3),
ioa = ioa(obs = obs, pred = c_O3)),
by = .(cod_brace)]
#> Key: <cod_brace>
#> cod_brace Ndata bias rmse corr nmse fb fac2 ioa
#> <int> <int> <num> <num> <num> <num> <num> <num> <num>
#> 1: 803708 656 NA NA NA NA NA NA NA
#> 2: 803719 711 NA NA NA NA NA NA NA
Data visualization
Eventually we can use the AirVeg::scatterPlot()
graphic
function to visualize the observed and predicted (model) data we just
validated:
library(ggplot2)
scatterPlot(data = farmDataBrace, obs = "obs", mod = "c_O3") +
scale_x_continuous(limits = c(0, 100),
name = expression("Observated data ["~mu~g/m^3~"]")) +
scale_y_continuous(limits = c(0, 100), name = expression("Model data ["~mu~g/m^3~"]"))
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Warning: Removed 117 rows containing missing values or values outside the scale range
#> (`geom_point()`).
Please note that scatterPlot
returns an object of class
ggplot
, therefore it can be further customized by the
directives of the ggplot2
library, as with
scale_x_continuous
and scale_y_continuous
. In
order to do so, we had to explicitly load the library before running the
command. This is the case for all the graphics functions provided in
AirVeg
.