Motivation

GAMs are useful when the type of relations between a response variable and explanatory variables are complex or nonlinear.

  • The relation between the explanatory variables and the response variable is described by a smooth pattern that can be nonlinear

  • Several smoothed relations can be added to estimate the response

We will use the package mgcv developed by Simon Wood


An Example

Suppose we have the following (simulated) variables and we which to explain or predict variations in y using x.

First, let’s simulate some data.


set.seed(123) # make the example reproducible
n<-200 # number of points
x<-sort(runif(n,0,10)) 

f<-function(i){ # build a complex function
  cos(i)+exp(0.2*i)
}
y<-f(x)+rnorm(length(x),0,0.5) # add some random variation
d<-data.frame(x,y) # build a data.frame

Here is what the data looks like. The line is the true generating function.


plot(y~x,data=d,xlab="x",ylab="y",col="grey60")
v<-seq(min(d$x),max(d$x),by=0.1)
lines(v,f(v),col="grey60",lwd=2) 


Obviously, a linear regression does not make sense as a model.


plot(y~x,data=d,xlab="x",ylab="y",col="grey60")
lines(v,f(v),col="grey60",lwd=2) ## generating function (TRUE function)
m<-lm(y~x,data=d)
p<-predict(m,data.frame(x=v))
lines(v,p,lwd=1)


This is also obvious when checking linear model assumptions.

par(mfrow=c(2,2))
plot(m)


We could also decide to fit polynomial regressions


plot(y~x,data=d,xlab="x",ylab="y",col="grey60")
lines(v,f(v),col="grey60",lwd=2) ## generating function (TRUE function)

m2<-lm(y~poly(x,2),data=d)
p<-predict(m2,data.frame(x=v))
lines(v,p,col=2)

m3<-lm(y~poly(x,3),data=d)
p<-predict(m3,data.frame(x=v))
lines(v,p,col=3)

m4<-lm(y~poly(x,4),data=d)
p<-predict(m4,data.frame(x=v))
lines(v,p,col=4)

legend("topleft",lwd=1,cex=2,col=2:4,legend=c(expression(x^2),expression(x^3),expression(x^4)),bty="n")


Now, let’s use a GAM model with the gam function from package mgcv and a smooth over x using the function s. We can compare the fit of the GAM model to the data generating function. The fit is quite close to the true generating function.

library(mgcv)

# plot the data and the true curve
plot(y ~ x, data = d, xlab = "x", ylab = "y", col = "grey60")
lines(v, f(v), col = "grey60", lwd = 2)

# fit the gam model
g <- gam(y ~ s(x), data = d)

# add the line
p <- predict(g, data.frame(x = v))
lines(v, p, col = "black")


CHALLENGE 1

Generate you own complex single variable function and try to describe it using gam and the smooth function s.


Under the Hood

(but just a bit…)

Model formulation

General linear model

A general linear model is a linear combination of explanatory variables.

\[y \sim \mathcal{N}(\mu, \sigma^2)\]

\[\mu=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n\]


Generalized linear model

A generalized linear model is a linear combination of explanatory variables, but there is a link function \(\mathcal{g}\) that relates the response to the linear predictor and the error follows a distribution from the exponential family.

\[y \sim \mathcal{ExpFam}(\mu, ...)\]

\[\mathcal{g}(\mu)=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n\]


Generalized additive models

A generalized additive model has the same component as a GLM, but the linear predictor is made of non-linear functions (smooths) of the explanatory variables.

\[y \sim \mathcal{ExpFam}(\mu,...)\]

\[\mathcal{g}(\mu)=\beta_0 + \mathcal{f_1} (X_1) + \mathcal{f_2} (X_2) + ... + \mathcal{f_n}(X_n)\]

There are many ways to construct these functions. Each smooth function \(\mathcal{f}\) has a basis of a given dimension \(\mathcal{q}\) and for a given \(X\) can be written as:

\[\mathcal{f} (X)=\sum_{i=1}^{\mathcal{q}} \mathcal{b_i}(X)\beta_i\]

where the \(\beta_i\) are arbitrary functions. Thus, a GAM is a linear combination of smooth terms. An example for a basis could be a simple cubic polynomial.

\[\mathcal{f} (X)=\beta_1 + \beta_2 X + \beta_3 X^2 + \beta_4 X^3\]


GAM construction

To have a slightly better understanding of what a GAM may represent, let’s decompose the data we studied in the example into sections and fit cubic polynomial to each.


plot(y ~ x, data = d, xlab = "x", ylab = "y", col = "grey60")
knots <- seq(min(d$x), max(d$x), length.out = 4)
b <- cut(d$x, breaks = knots)
abline(v = knots, lty = 2, col = "grey60")

d$b <- b

l <- split(d, d$b)

invisible(lapply(l, function(i) {
    m <- lm(y ~ poly(x, 3), data = i)
    v <- seq(min(i$x), max(i$x), by = 0.1)
    p <- predict(m, data.frame(x = v))
    lines(v, p)
}))


We now have different pieces fitted to each section. But the pieces need to be smoothly connected together to form splines which are piecewise polynomial curves. Without entering into details, these pieces are connected using constraints on first- and second-order derivatives to ensure smooth connections. These connections are made at knots. There are several ways to determine the number of knots and it may also depend on the type of splines used.

Now, let’s go back to back to our gam model.

# extract the bases
mm <- predict(g, type = "lpmatrix")

# plot them
matplot(d$x, mm[, -1], type = "l", ylab = "Basis functions")

# add a legend
nn <- ncol(mm[, -1])
legend("top", horiz = TRUE, colnames(mm[, -1]), cex = 0.8, col = seq_len(nn), 
    lty = seq_len(nn), inset = c(0, -0.1), xpd = TRUE, bty = "n")


plot(y ~ x, data = d)
lines(d$x, mm %*% g$coefficients)  # combine the linear predictor


In Practice

mgcv

As taken from the package help pages, mgcv is a Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation and GAMMs by REML/PQL.

The package name mgcv stands for Mixed GAM Computation Vehicle.


There are a lot of sources of information for using GAMs with mgcv. A list of useful links is given at the end of this document.

Contrary to many packages, the help files for the mgcv package are extremely detailed (even if a bit complex) and one should not underestimate the amount of information contained in the help files. They also contain info on special introductory and more advanced topics. A couple of examples.

?mgcv general info on the package

?gam general info on the function

?gam.models how to specify a gam model

?gam.selection model selection with gam models

?mgcv.FAQ frequently asked questions concerning the package


GCV stands for Generalized Cross-Validation. This is one of the procedure that determines the optimal amount of smoothing and it is done automatically by the mgcv algorithms.

Generally speaking, cross-validation is a model validation technique for assessing how well a model will perform on an independent dataset. For example, this can be done by running a model on a random partition of the data (say 90%, the training data) and then predict the rest of the data (10%, the validation data) and calcute the prediction errors. There are several different ways of doing this. Another is to run the model by removing a single observation at a time and then try to predict that observation with the model (leave-one-out cross-validation). Intuitively, GCV is similar, but it is done in a more computationally efficient way.

Thus, one important advantage of using gam and mgcv is that the optimal level of smoothness is automatically determined. Beware that the level of smoothness determined by this method may not be appropriate when the is collinearity, non-independence or for small sample sizes.

Other methods than GCV are now available. Type ?gam.selection for more detailed information on the different methods possible.


The gam function

Standard non-smoothed variables can go in the gam formula. In this case, the ouput returned is similar to the lm output.

g <- gam(y ~ x, data = d)
summary(g)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ x
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03701    0.12106  -0.306     0.76    
## x            0.61577    0.02104  29.270   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.811   Deviance explained = 81.2%
## GCV = 0.66796  Scale est. = 0.66128   n = 200

What does the summary output looks like with a smooth term?

g <- gam(y ~ s(x), data = d)
summary(g)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.08121    0.03391   90.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value    
## s(x) 7.582  8.506 333.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.934   Deviance explained = 93.7%
## GCV = 0.24022  Scale est. = 0.22991   n = 200
  1. The first part deals with parametric coefficients (factors, non-smoothed variables)

  2. The second part deals with smoothed terms and report the edf which stands for estimated degrees of freedom.

Without going into details, it roughly represents the level of smooth complexity required for modeling the response.

  1. The third part reports \(R^2\)-type measures and the generalized cross validation value.

Several methods (functions) are available to deal with gam objects:

  • print
  • summary
  • anova
  • plot
  • predict
  • residuals

Smooths


s()

gam(y~s(x),data=d)

Used for univariate smooths, isotropic smooths of several numeric variables (with variables on the same scale, e.g., XY coordinates), interactions, simple random effects.



te()

gam(y~te(x,z),data=d)

Tensor product smooths of several variables (interactions) e.g.



ti()

gam(y~ti(x)+ti(z)+ti(x,z),data=d)

Tensor product smooths with marginal smooths excluded (and lower order interaction terms)



The list of available smooth terms is given in ?smooth.terms


Set up a model

All families available in GLMs are also available in GAMs and others are also available.

?family.mgcv


CHALLENGE 2

Download mean daily temperature data from a weather station in Québec city. Can you estimate the mean trend for the year 2015? To download the data, you will need to install the rclimateca package. The following code gives you the instructions to get the data.

The ec_climate_search_locations function searches for weather stations near a location.

library(rclimateca)

ec_climate_search_locations("quebec", timeframe = "daily", year = 2015)
## Search results for ec_climate_search_locations(
##   query = "quebec"
##   timeframe = "daily"
##   year = 2015
## ) 
## [1] QUEBEC/JEAN LESAGE INTL A QC 5251 (daily 1943-2017)
## [2] QUEBEC/JEAN LESAGE INTL QC 26892 (daily 1992-2018)

The ec_climate_data function downloads the data for the station for a given time frame. We turn this to a data.frame and add the julian date to the data


x <- "QUEBEC/JEAN LESAGE INTL QC 26892"

d <- ec_climate_data(x, timeframe = "daily", start = "2015-01-01", end = "2015-12-31")

d <- as.data.frame(d)  # transform to a data.frame

d$jul <- as.numeric(format(d$date, "%j"))  # turn the date to a julian date

d$temp <- d$mean_temp_c  # choose a specific temperature (here the mean)

Arguments

Here is a list of some important arguments to the smooth function s or te.

bs: For choosing the basis type (?smooth.terms)

by: For allowing the smooths to vary according to a variable. It can be a numeric variable or a categorical variable. For the latter, the simple term usually also has to be added to the model (e.g. y~treatment+s(x,by=treatment)).

k: The dimension of the basis used to construct the smooth term. See ?choose.k for good advices. Higher values allow more complex relations.


Bases

Cubic regression spline (cr)

  • Use regular knots

Cyclic cubic regression spline (cc)

  • For cyclic (seasonal) response

Thin plate spline (tp)

  • Does not use regular knots

Thin plate spline with shrinkage (ts)

  • Can be used for variable selection

Cubic regression spline with shrinkage (cs)

  • Can be used for variable selection

Plotting

Plotting is essential with GAMs because it is impossible to interpret coefficients to and have an idea of the effect of variables. There are several functions for producing gam plots.

plot

plot(g, residuals = TRUE, pch = 1, shade = TRUE)


vis.gam

This function is for visualizing two variables with 3D or contour plots.


visreg::visreg

visreg is a very useful package for visualizing a lot of models (LM, GLM, randomForest, etc.) including GAMs


library(visreg)
visreg(g, "jul")


itsadug::plot.smooth


library(itsadug)
plot_smooth(g, view = "jul")
## Summary:
##  * jul : numeric predictor; with 30 values ranging from 1.000000 to 365.000000.


predict

If you want to have full control over the plot you are producing, you can use predict to generate your values and then use your favorite plotting tools to construct your plot.


Random Effects

There are several ways to include random effects in GAMs

  1. Use gam and bs="re" for simple random effects.

  2. Use gamm and specify your random effects just like in lme. The function uses lme behing the scene.

  3. Use the package gamm4 which relies on package lme4, but it is less developed.

For more info on random effects with mgcv see ?random.effects and the help page for ?gamm


Temporal Autocorrelation

Since gamm uses lme and the nlme package, all functionalities related to temporal autocorrelation are also available to gamm (corAR1, corARMA, etc.)

See also the package itsadug which specializes on GAMs and time-series.


GAMLSS

GAMLSS (Generalized Additive Models for Location, Scale and Shape ) are an extension to GAMS where all parameters of an assumed distribution for the response variable can be modeled as an additive model of the explanatory variables. This means for example that in addition to the modeling of the mean response, the variance can also be modeled. They also allow for the use of distributions beyond the exponential family.


CHALLENGE 3

  1. Compare the climate between two different towns in Canada in 2015 and illustrate their differences. You might need the by argument.

ec_climate_search_locations(c("quebec", "vancouver"), timeframe = "daily", year = 2015)

x <- c("QUEBEC/JEAN LESAGE INTL QC 26892", "VANCOUVER INTL A BC 51442")

d <- as.data.frame(ec_climate_data(x, timeframe = "daily", start = "2015-01-01", 
    end = "2015-12-31"))

d$jul <- as.numeric(format(d$date, "%j"))
d$temp <- d$mean_temp_c
d$location <- as.factor(d$location)

  1. Estimate the mean daily temperature yearly trend for the 2000-2015 period at the Québec station using a cyclical basis or seasonal basis (e.g. "cc").

  1. Across the region of Québec city, explore the temperature for January 1, 2017 using the weather station coordinates.

# search 200 locations aroun
l <- as.data.frame(ec_climate_geosearch_locations("quebec QC", year = 2017, 
    timeframe = "daily", limit = 200, province == "QUEBEC"))

# restrict locations kept
l <- l[l$latitude < 47.299 & l$longitude > (-76.269), ]

# download data for each location
d <- as.data.frame(ec_climate_data(l$location, timeframe = "daily", start = "2017-01-01", 
    end = "2017-01-01"))

# add latlon the the data
m <- match(d$location, l$location)
d$lon <- l$longitude[m]
d$lat <- l$latitude[m]

# convert to julian date
d$jul <- as.numeric(format(d$date, "%j"))
d$temp <- d$mean_temp_c



Are there more questions you could answer with the weather data using GAMs?