--- title: "Spectral modelling and predictions using the `caret` package" author: "Pierre Roudier" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Spectral modelling and prediction using the caret package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = 'center', fig.height = 5, fig.width = 5 ) ``` The `spectacles` package focuses on making the handling of spectral data (along with its associated attribute data) easy: *by design*, the tasks of tuning and fitting prediction models (either for regression or classification) are out-of-scope. Rather than re-implementing those routines, `spectacles` delegates these tasks to the numerous R packages that facilitates this. In particular, the package works very well with [the `caret` package](https://topepo.github.io/caret/). ```{r libs, message=FALSE} library(dplyr) library(spectacles) library(caret) ``` Here we demonstrate a simple example of tuning and fitting a prediction model for the soil organic carbon content field of the `australia` dataset that is shipped with the `spectacles` package. This vignette assumes some basic understanding of the `caret` package. The reader is in particular referred to [the short introduction to `caret` vignette](https://cran.r-project.org/package=caret/vignettes/caret.html). # Loading the example dataset The `australia` dataset, shipped with the `spectacles` package is a collection of 100 visible near-infrared spectra collected on air-dried soils. More information about this dataset is available on its manual page (`?australia`). The dataset can be loaded quickly using the `load_oz` function: ```{r data} # This loads the "australia" example dataset oz <- load_oz() ``` This creates the object `oz`, of class `SpectraDataFrame`. A dedicated vignette will be available to explain the creation of `SpectraDataFrame` objects from scratch. # Pre-processing Pre-processing will be the focus of a dedicated vignette. Here we keep things simple, and limit pre-processingto remove the splice steps that affect the spectra. This is done using the `splice` function: ```{r splice} oz <- splice(oz) ``` # Data split First, the `australia` dataset is split into a calibration and a validation set. Here we keep things simple, and use a 75%--25% random split: ```{r split} set.seed(1) # To make the split reproducible idx <- sample(1:nrow(oz), size = 75) oz_calib <- oz[idx, ] oz_valid <- oz[-idx, ] ``` Note that `SpectraDataFrame` objects can be subsetted simply by using the `[` operator. # Parametrisation of a PLS model The main `spectacles` function used to interface with `caret` is the `spectra` function, which extracts the spectral matrix that is associated with analytical data. This matrix represents the predictors used to predict a given outcome ("`x`"), while the outcome of the model ("`y`") is an analytical attribute, and can be extracted from the `SpectraDataFrame` object using the `$` operator: ```{r extracts} # The `spectra` function extracts the spectral matrix... spec_mat <- spectra(oz) big.head(spec_mat) # ... while analytical data can be accessed using `$` oz$carbon ``` Therefore, the `train` function can simply be used by populating its `x` and `y` arguments: ```{r fit_1} fit1 <- train( # The `spectra` function extract the spectra matrix x = spectra(oz_calib), # analytical data can be extracted using `$` y = oz_calib$carbon, # Here we choose the PLS regression method method = "pls", # The train function will try 3 possible parameters for the PLS tuneLength = 3 ) ``` # Using spectroscopy-specific performance metrics ## As part of the model tuning The `spectacles` even provide a summary functions akin to those in `caret`, but that work better for spectroscopy. The `spectroSummary` function works like the `defaultSummary` function in `caret`, but adds indicators that are popular in spectroscopy, such as RPD, RPIQ, or CCC: ```{r fit_2} fit2 <- train( x = spectra(oz_calib), y = oz_calib$carbon, method = "pls", tuneLength = 10, trControl = trainControl( # Here we can specify the summary function used during parametrisation summaryFunction = spectroSummary ), # Here we can specifiy which metric to use to optimise the model parameters metric = "RPIQ" ) ``` The parametrisation of the resulting model can be plotted and inspected using the usual `caret` tools: ```{r summarySpectro, } plot(fit2) print(fit2) ``` Different models can also be compared: ```{r 2mods} preds <- extractPrediction( # Here we specify the `caret` models we want to compare models = list( pls1 = fit1, pls2 = fit2 ), testX = spectra(oz_valid), testY = oz_valid$carbon ) # necessary so 2 PLS model can be compared in `plotObsVsPred` preds$model <- preds$object plotObsVsPred(preds) ``` The model `fit2` outperforms the model `fit1`: hardly a surprise as we limited the latter to 3 latent variables, which is clearly too few in this instance. ## During the assessment of the performance of the model Finally, those specific performance indicators can also be used to asses the final results. PLS predictions can be generated using the `predict` function from the `caret` package, and its result passed to `postResampleSpectro`: ```{r postResampleSpectro1} # Simple example for the entire dataset postResampleSpectro( pred = predict(fit2, spectra(oz)), obs = oz_valid$carbon ) ``` Again, this function mimics its `caret` equivalent, `postResample`. A more useful thing to do, from a modelling standpoint, is to compare those performance results on the calibration, validation, and bootstrapped sets (especially the two latter ones): ```{r postResampleSpectro2} # Run model predictions and extract performance statistics for # caliration and validation res_calibration <- postResampleSpectro(pred = predict(fit2, spectra(oz_calib)), obs = oz_calib$carbon) res_validation <- postResampleSpectro(pred = predict(fit2, spectra(oz_valid)), obs = oz_valid$carbon) # Bootstrapped results can be extracted from the `train` object: res_boot <- fit2$results %>% filter(ncomp == fit2$bestTune$ncomp) %>% select(names(res_calibration)) # Assemble the calibration, validation, and # bootstrapped results in a single data.frame res <- rbind( data.frame(type = "Calibration", t(res_calibration)), data.frame(type = "Validation", t(res_validation)), data.frame(type = "Bootstrap", res_boot) ) ``` Which gives the following results: ```{r res_table, results='asis', echo=FALSE} knitr::kable(res, digits = 2) ```