--- title: "Introduction to conditioned Latin hypercube sampling with the clhs package" author: "Pierre Roudier" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > # %\VignetteIndexEntry{intro-clhs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r make_things_reproducible, echo=FALSE, eval=TRUE} suppressWarnings(RNGversion("3.5.0")) set.seed(42) ``` # A simple example ```{r load_diamonds} data(diamonds, package = 'ggplot2') diamonds <- data.frame(diamonds) head(diamonds) nrow(diamonds) ``` In this example we sample the `diamonds` data set and pick a subset of 100 individuals using the cLHS method. To reduce the length of the optimisation step to 1000 iterations to save computing time. This is controlled through the `iter` option. The progress bar is disabled because it doesn't renders well in the vignette. By default, the index of the selected individuals in the original object are returned. ```{r simple_clhs, echo=TRUE, eval=TRUE} library(clhs) res <- clhs(diamonds, size = 100, use.cpp = TRUE) str(res) ``` # Tweaking the parameters (work in progress) # Including existing samples in the sampling routine This functionality is controlled by the `include` option, which can be used to specify the row indices of samples that needs to be included in the final sampled set. ```{r existing_samples, echo=TRUE, eval=TRUE} suppressWarnings(RNGversion("3.5.0")) set.seed(1) candidates_samples <- data.frame( x = runif(500), y = rnorm(500, mean = 0.5, sd = 0.5) ) existing_samples <- data.frame( x = runif(5), y = runif(5) ) res <- clhs( x = rbind(existing_samples, candidates_samples), size = 10, include = 1:5 ) ``` In this case we have 5 individuals (red triangles) that need to be retained in the selected set of samples: ```{r plot_mandatory_1, echo=FALSE, fig=TRUE, fig.height=6, fig.width=6} suppressPackageStartupMessages(library(ggplot2)) res_df <- rbind(existing_samples, candidates_samples) res_df$mandatory_sample <- c(rep(TRUE, length.out = 5), rep(FALSE, length.out = 500)) res_df$selected_sample <- FALSE res_df$selected_sample[res] <- TRUE p0 <- ggplot(data = res_df, mapping = aes(x = x, y = y)) + geom_point(aes(colour = mandatory_sample, size = mandatory_sample, shape = mandatory_sample)) + scale_colour_manual(values = c("grey70", "red")) + scale_size_manual(values = c(2, 4)) + theme_bw() print(p0) ``` The red individuals are the selected samples. Note the triangles showing the samples that were compulsory: ```{r plot_mandatory_2, echo=FALSE, fig=TRUE, fig.height=6, fig.width=6} p1 <- ggplot(data = res_df, mapping = aes(x = x, y = y)) + geom_point(aes(colour = selected_sample, size = selected_sample, shape = mandatory_sample)) + scale_colour_manual(values = c("grey70", "red")) + scale_size_manual(values = c(2, 4)) + theme_bw() print(p1) ``` # Cost-constrained implementation (work in progress) ```{r cost_clhs, echo=TRUE, eval=TRUE} diamonds$cost <- runif(nrow(diamonds)) res_cost <- clhs(diamonds, size = 100, progress = FALSE, iter = 1000, cost = 'cost') ``` # DLHS (work in progress) # Plotting the results If you want to report on the cLHS results, e.g. plot the evolution of the objective function, or compare the distribution of attributes in the initial object and in the sampled subset, you need to switch the `simple` option to `FALSE`. Instead f simply returning a numeric vector giving the index of the sampled individuals in the original object, a specific, more complex will be returned. This object can be handled by a specific `plot` method: ```{r plot_clhs_1, echo=TRUE, fig=TRUE, fig.height=8, fig.width=8} res <- clhs(diamonds, size = 100,cost = "cost", simple = FALSE, progress = FALSE, iter = 2000) plot(res,c("obj","cost")) ``` The default plotting method plots the evolution of the objective function with the number of iterations. However, you can get more details using the `modes` option, which controls which indicators are plotted. Three `modes` can be simultaneously plotted: - `obj`: evolution of the objective function (default) - `cost`: evolution of the cost function (if present) - `dens` OR `box` OR `hist`: comparison of the distributions of each attribute in the original object and in the proposed sample, respectively using probability density functions, boxplots or histograms. Note that categorical attributes are always reported using dotplots. These modes should be given as a vector of characters. ```{r plot_clhs_3, echo=TRUE, fig=TRUE, fig.height=8, fig.width=12} res_cost <- clhs(diamonds, size = 100, progress = FALSE, iter = 1000, cost = 'cost', simple = FALSE, use.cpp = F) plot(res_cost, c('obj', 'cost')) ``` ```{r plot_clhs_4, echo=TRUE, fig=TRUE, fig.height=8, fig.width=12, warning=FALSE} plot(res_cost, c('obj', 'cost', 'box')) ```