--- title: "Sampling" output: rmarkdown::html_vignette description: > Learn how to use sample_* functions. vignette: > %\VignetteIndexEntry{Sampling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{css,echo=FALSE} .infobox { padding: 1em 1em 1em 4em; margin-bottom: 10px; border: 2px solid #43602d; border-radius: 10px; background: #738e41 5px center/3em no-repeat; } .caution { background-image: url("https://tgoodbody.github.io/sgsR/logo.png"); } ``` ```{r,warning=F,message=F,echo=FALSE} library(sgsR) library(terra) library(dplyr) par(mar = c(1, 1, 1, 1)) #--- Load mraster and access files ---# r <- system.file("extdata", "mraster.tif", package = "sgsR") #--- load the mraster using the terra package ---# mraster <- terra::rast(r) a <- system.file("extdata", "access.shp", package = "sgsR") #--- load the access vector using the sf package ---# access <- sf::st_read(a, quiet = TRUE) #--- apply quantiles algorithm to metrics raster ---# sraster <- strat_quantiles( mraster = mraster$zq90, # use mraster as input for sampling nStrata = 4 ) # algorithm will produce 4 strata #--- apply stratified sampling algorithm ---# existing <- sample_strat( sraster = sraster, # use mraster as input for sampling nSamp = 200, # request 200 samples be taken mindist = 100 ) # define that samples must be 100 m apart sf::st_crs(existing) <- terra::crs(sraster) #--- algorithm table ---# a <- c("`sample_srs()`", "`sample_systematic()`", "`sample_strat()`", "`sample_sys_strat()`", "`sample_nc()`", "`sample_clhs()`", "`sample_balanced()`", "`sample_ahels()`", "`sample_existing()`") d <- c("Simple random", "Systematic", "Stratified", "Systematic Stratified", "Nearest centroid", "Conditioned Latin hypercube", "Balanced sampling", "Adapted hypercube evaluation of a legacy sample", "Sub-sampling an `existing` sample") s <- c("", "", "[Queinnec, White, & Coops (2021)](https://www.sciencedirect.com/science/article/pii/S0034425721002303)", "", "[Melville & Stone (2016)](https://doi.org/10.1080/00049158.2016.1218265)", "[Minasny & McBratney (2006)](https://www.sciencedirect.com/science/article/pii/S009830040500292X?via%3Dihub)", "[Grafström, A. Lisic, J (2018)](http://www.antongrafstrom.se/balancedsampling/)", "[Malone, Minasny, & Brungard (2019)](https://peerj.com/articles/6451/)", "") urls <- c("#srs", "#systematic", "#sstrat", "#sysstrat", "#nc", "#clhs", "#balanced", "#ahels", "#samp-existing") df <- data.frame(Algorithm = a, Description = d, Reference = s) df$Algorithm <- paste0("[", df$Algorithm, "](", urls, ")") ``` Currently, there are 9 functions associated with the `sample` verb in the `sgsR` package: ```{r echo=FALSE} knitr::kable(df, align = "c") ``` ## `sample_srs` {#srs .unnumbered} We have demonstrated a simple example of using the `sample_srs()` function in `vignette("sgsR")`. We will demonstrate additional examples below. ::: {.infobox .caution data-latex="{caution}"} **`raster`** The input required for `sample_srs()` is a `raster`. This means that `sraster` and `mraster` are supported for this function. ::: ```{r,warning=F,message=F} #--- perform simple random sampling ---# sample_srs( raster = sraster, # input sraster nSamp = 200, # number of desired sample units plot = TRUE ) # plot ``` ```{r,warning=F,message=F} sample_srs( raster = mraster, # input mraster nSamp = 200, # number of desired sample units access = access, # define access road network mindist = 200, # minimum distance sample units must be apart from one another buff_inner = 50, # inner buffer - no sample units within this distance from road buff_outer = 200, # outer buffer - no sample units further than this distance from road plot = TRUE ) # plot ``` ## `sample_systematic` {#systematic .unnumbered} The `sample_systematic()` function applies systematic sampling across an area with the `cellsize` parameter defining the resolution of the tessellation. The tessellation shape can be modified using the `square` parameter. Assigning `TRUE` (default) to the `square` parameter results in a regular grid and assigning `FALSE` results in a hexagonal grid. The location of sample units can also be adjusted using the `locations` parameter, where `centers` takes the center, `corners` takes all corners, and `random` takes a random location within each tessellation. Random start points and translations are applied when the function is called. ```{r,warning=F,message=F} #--- perform grid sampling ---# sample_systematic( raster = sraster, # input sraster cellsize = 1000, # grid distance plot = TRUE ) # plot ``` ```{r,warning=F,message=F} #--- perform grid sampling ---# sample_systematic( raster = sraster, # input sraster cellsize = 500, # grid distance square = FALSE, # hexagonal tessellation location = "random", # randomly sample within tessellation plot = TRUE ) # plot ``` ```{r,warning=F,message=F} sample_systematic( raster = sraster, # input sraster cellsize = 500, # grid distance access = access, # define access road network buff_outer = 200, # outer buffer - no sample units further than this distance from road square = FALSE, # hexagonal tessellation location = "corners", # take corners instead of centers plot = TRUE ) ``` ## `sample_strat` {#sstrat .unnumbered} The `sample_strat()` contains two `method`s to perform sampling: * `"Queinnec"` - Hierarchical sampling using a focal window to isolate contiguous groups of stratum pixels, which was originally developed by Martin Queinnec. * `"random"` - Traditional stratified random sampling. This `method` ignores much of the functionality of the algorithm to allow users the capability to use standard stratified random sampling approaches without the use of a focal window to locate contiguous stratum cells. ### `method = "Queinnec"` {#queinnec .unnumbered} _Queinnec, M., White, J. C., & Coops, N. C. (2021). Comparing airborne and spaceborne photon-counting LiDAR canopy structural estimates across different boreal forest types. Remote Sensing of Environment, 262(August 2020), 112510._ This algorithm uses moving window (`wrow` and `wcol` parameters) to filter the input `sraster` to prioritize sample unit allocation to where stratum pixels are spatially grouped, rather than dispersed individuals across the landscape. Sampling is performed using 2 rules: * **Rule 1** - Sample within spatially grouped stratum pixels. Moving window defined by `wrow` and `wcol`. * **Rule 2** - If no additional sample units exist to satisfy desired sample size(`nSamp`), individual stratum pixels are sampled. The rule applied to a select each sample unit is defined in the `rule` attribute of output samples. We give a few examples below: ```{r,warning=F,message=F} #--- perform stratified sampling random sampling ---# sample_strat( sraster = sraster, # input sraster nSamp = 200 ) # desired sample size # plot ``` In some cases, users might want to include an `existing` sample within the algorithm. In order to adjust the total number of sample units needed per stratum to reflect those already present in `existing`, we can use the intermediate function `extract_strata()`. This function uses the `sraster` and `existing` sample units and extracts the stratum for each. These sample units can be included within `sample_strat()`, which adjusts total sample units required per class based on representation in `existing`. ```{r,warning=F,message=F} #--- extract strata values to existing samples ---# e.sr <- extract_strata( sraster = sraster, # input sraster existing = existing ) # existing samples to add strata value to ``` ::: {.infobox .caution data-latex="{caution}"} **TIP!** `sample_strat()` requires the `sraster` input to have an attribute named `strata` and will give an error if it doesn't. ::: ```{r,warning=F,message=F} sample_strat( sraster = sraster, # input sraster nSamp = 200, # desired sample size access = access, # define access road network existing = e.sr, # existing sample with strata values mindist = 200, # minimum distance sample units must be apart from one another buff_inner = 50, # inner buffer - no sample units within this distance from road buff_outer = 200, # outer buffer - no sample units further than this distance from road plot = TRUE ) # plot ``` The code in the example above defined the `mindist` parameter, which specifies the minimum euclidean distance that new sample units must be apart from one another. Notice that the sample units have `type` and `rule` attributes which outline whether they are `existing` or `new`, and whether `rule1` or `rule2` were used to select them. If `type` is _existing_ (a user provided `existing` sample), `rule` will be _existing_ as well as seen above. ```{r,warning=F,message=F} sample_strat( sraster = sraster, # input nSamp = 200, # desired sample size access = access, # define access road network existing = e.sr, # existing samples with strata values include = TRUE, # include existing sample in nSamp total buff_outer = 200, # outer buffer - no samples further than this distance from road plot = TRUE ) # plot ``` The `include` parameter determines whether `existing` sample units should be included in the total sample size defined by `nSamp`. By default, the `include` parameter is set as `FALSE`. ### `method = "random` {#stratrandom .unnumbered} Stratified random sampling with equal probability for all cells (using default algorithm values for `mindist` and no use of `access` functionality). In essence this method perform the `sample_srs` algorithm for each stratum separately to meet the specified sample size. ```{r,warning=F,message=F} #--- perform stratified sampling random sampling ---# sample_strat( sraster = sraster, # input sraster method = "random", # stratified random sampling nSamp = 200, # desired sample size plot = TRUE ) # plot ``` ## `sample_sys_strat` {#sysstrat .unnumbered} `sample_sys_strat()` function implements systematic stratified sampling on an `sraster`. This function uses the same functionality as `sample_systematic()` but takes an `sraster` as input and performs sampling on each stratum iteratively. ```{r} #--- perform grid sampling on each stratum separately ---# sample_sys_strat( sraster = sraster, # input sraster with 4 strata cellsize = 1000, # grid size plot = TRUE # plot output ) ``` Just like with `sample_systematic()` we can specify where we want our samples to fall within our tessellations. We specify `location = "corners"` below. Note that the tesselations are all saved to a list file when `details = TRUE` should the user want to save them. ```{r} sample_sys_strat( sraster = sraster, # input sraster with 4 strata cellsize = 500, # grid size square = FALSE, # hexagon tessellation location = "corners", # samples on tessellation corners plot = TRUE # plot output ) ``` This sampling approach could be especially useful incombination with `strat_poly()` to ensure consistency of sampling accross specific management units. ```{r} #--- read polygon coverage ---# poly <- system.file("extdata", "inventory_polygons.shp", package = "sgsR") fri <- sf::st_read(poly) #--- stratify polygon coverage ---# #--- specify polygon attribute to stratify ---# attribute <- "NUTRIENTS" #--- specify features within attribute & how they should be grouped ---# #--- as a single vector ---# features <- c("poor", "rich", "medium") #--- get polygon stratification ---# srasterpoly <- strat_poly( poly = fri, attribute = attribute, features = features, raster = sraster ) #--- systematatic stratified sampling for each stratum ---# sample_sys_strat( sraster = srasterpoly, # input sraster from strat_poly() with 3 strata cellsize = 500, # grid size square = FALSE, # hexagon tessellation location = "random", # randomize plot location plot = TRUE # plot output ) ``` ## `sample_nc` {#nc .unnumbered} `sample_nc()` function implements the Nearest Centroid sampling algorithm described in [Melville & Stone (2016)](https://doi.org/10.1080/00049158.2016.1218265). The algorithm uses kmeans clustering where the number of clusters (centroids) is equal to the desired sample size (`nSamp`). Cluster centers are located, which then prompts the nearest neighbour `mraster` pixel for each cluster to be selected (assuming default `k` parameter). These nearest neighbours are the output sample units. ```{r} #--- perform simple random sampling ---# sample_nc( mraster = mraster, # input nSamp = 25, # desired sample size plot = TRUE ) ``` Altering the `k` parameter leads to a multiplicative increase in output sample units where total output samples = $nSamp * k$. ```{r} #--- perform simple random sampling ---# samples <- sample_nc( mraster = mraster, # input k = 2, # number of nearest neighbours to take for each kmeans center nSamp = 25, # desired sample size plot = TRUE ) #--- total samples = nSamp * k (25 * 2) = 50 ---# nrow(samples) ``` Visualizing what the kmeans centers and sample units looks like is possible when using `details = TRUE`. The `$kplot` output provides a quick visualization of where the centers are based on a scatter plot of the first 2 layers in `mraster`. Notice that the centers are well distributed in covariate space and chosen sample units are the closest pixels to each center (nearest neighbours). ```{r} #--- perform simple random sampling with details ---# details <- sample_nc( mraster = mraster, # input nSamp = 25, # desired sample number details = TRUE ) #--- plot ggplot output ---# details$kplot ``` ## `sample_clhs` {#clhs .unnumbered} `sample_clhs()` function implements conditioned Latin hypercube (clhs) sampling methodology from the [`clhs`](https://CRAN.R-project.org/package=clhs/) package. ::: {.infobox .caution data-latex="{caution}"} **TIP!** A number of other functions in the `sgsR` package help to provide guidance on clhs sampling including `calculate_pop()` and `calculate_lhsOpt()`. Check out these functions to better understand how sample numbers could be optimized. ::: The syntax for this function is similar to others shown above, although parameters like `iter`, which define the number of iterations within the Metropolis-Hastings process are important to consider. In these examples we use a low `iter` value for efficiency. Default values for `iter` within the `clhs` package are 10,000. ```{r,eval = FALSE} sample_clhs( mraster = mraster, # input nSamp = 200, # desired sample size plot = TRUE, # plot iter = 100 ) # number of iterations ``` ```{r,warning=F,message=F,echo=F,results = FALSE} sample_clhs( mraster = mraster, # input nSamp = 200, # desired sample size plot = TRUE, # plot iter = 100 ) # number of iterations ``` The `cost` parameter defines the `mraster` covariate, which is used to constrain the clhs sampling. An example could be the distance a pixel is from road `access` (e.g. from `calculate_distance()` see example below), terrain slope, the output from `calculate_coobs()`, or many others. ```{r,warning=F,message=F} #--- cost constrained examples ---# #--- calculate distance to access layer for each pixel in mr ---# mr.c <- calculate_distance( raster = mraster, # input access = access, # define access road network plot = TRUE ) # plot ``` ```{r,eval=F} sample_clhs( mraster = mr.c, # input nSamp = 250, # desired sample size iter = 100, # number of iterations cost = "dist2access", # cost parameter - name defined in calculate_distance() plot = TRUE ) # plot ``` ```{r,warning=F,message=F,echo=F,results = FALSE} sample_clhs( mraster = mr.c, # input nSamp = 250, # desired sample size iter = 100, # number of iterations cost = "dist2access", # cost parameter - name defined in calculate_distance() plot = TRUE ) # plot ``` ## `sample_balanced` {#balanced .unnumbered} The `sample_balanced()` algorithm performs a balanced sampling methodology from the [`stratifyR / SamplingBigData`](http://www.antongrafstrom.se/balancedsampling/) packages. ```{r,warning=F,message=F} sample_balanced( mraster = mraster, # input nSamp = 200, # desired sample size plot = TRUE ) # plot ``` ```{r,warning=F,message=F} sample_balanced( mraster = mraster, # input nSamp = 100, # desired sample size algorithm = "lcube", # algorithm type access = access, # define access road network buff_inner = 50, # inner buffer - no sample units within this distance from road buff_outer = 200 ) # outer buffer - no sample units further than this distance from road ``` ## `sample_ahels` {#ahels .unnumbered} The `sample_ahels()` function performs the adapted Hypercube Evaluation of a Legacy Sample (ahels) algorithm using`existing` sample data and an `mraster`. New sample units are allocated based on quantile ratios between the `existing` sample and `mraster` covariate dataset. This algorithm was adapted from that presented in the paper below, which we highly recommend. *_Malone BP, Minansy B, Brungard C. 2019. Some methods to improve the utility of conditioned Latin hypercube sampling. PeerJ 7:e6451 DOI 10.7717/peerj.6451_* This algorithm: 1. Determines the quantile distributions of `existing` sample units and `mraster` covariates. 2. Determines quantiles where there is a disparity between sample units and covariates. 3. Prioritizes sampling within those quantile to improve representation. To use this function, user must first specify the number of quantiles (`nQuant`) followed by either the `nSamp` (total number of desired sample units to be added) or the `threshold` (sampling ratio vs. covariate coverage ratio for quantiles - default is 0.9) parameters. ```{r,eval = FALSE} #--- remove `type` variable from existing - causes plotting issues ---# existing <- existing %>% select(-type) sample_ahels( mraster = mraster, existing = existing, # existing sample plot = TRUE ) # plot ``` ```{r,warning=F,message=F,echo=FALSE, results = FALSE} s <- sample_ahels( mraster = mraster, existing = existing ) # existing samples ``` ```{r,echo=FALSE} s ``` ::: {.infobox .caution data-latex="{caution}"} **TIP!** Notice that no `threshold`, `nSamp`, or `nQuant` were defined. That is because the default setting for `threshold = 0.9` and `nQuant = 10`. ::: The first matrix output shows the quantile ratios between the sample and the covariates. A value of 1.0 indicates that the sample is representative of quantile coverage. Values > 1.0 indicate over representation of sample units, while < 1.0 indicate under representation. ```{r,eval = FALSE} sample_ahels( mraster = mraster, existing = existing, # existing sample nQuant = 20, # define 20 quantiles nSamp = 300 ) # desired sample size ``` ```{r,warning=F,message=F,echo=FALSE, results = FALSE} s <- sample_ahels( mraster = mraster, existing = existing, # existing sample nQuant = 20, # define 20 quantiles nSamp = 300 ) # plot ``` ```{r,echo=FALSE} s ``` Notice that the total number of samples is 500. This value is the sum of existing units (200) and number of sample units defined by `nSamp = 300`. ## `sample_existing` {#samp-existing .unnumbered} Acknowledging that `existing` sample networks are common is important. There is significant investment into these samples, and in order to keep inventories up-to-date, we often need to collect new data for sample units. The `sample_existing` algorithm provides the user with methods for sub-sampling an `existing` sample network should the financial / logistical resources not be available to collect data at all sample units. The functions allows users to choose between algorithm types using (`type = "clhs"` - default, `type = "balanced"`, `type = "srs"`, `type = "strat"`). Differences in type result in calling internal `sample_existing_*()` functions (`sample_existing_clhs()` (default), `sample_existing_balanced()`, `sample_existing_srs()`, `sample_existing_strat()`). These functions are not exported to be used stand-alone, however they employ the same functionality as their `sample_clhs()` etc counterparts. While using `sample_existing()`, should the user wish to specify algorithm specific parameters (e.g. `algorithm = "lcube"` in `sample_balanced()` or `allocation = "equal"` in `sample_strat()`), they can specify within `sample_existing()` as if calling the function directly. I give applied examples for all methods below that are based on the following scenario: * We have a systematic sample where sample units are 200m apart. * We know we only have resources to sample 300 of them. * We have some ALS data available (`mraster`), which we can use to improve knowledge of the metric populations. See our `existing` sample for the scenario below. ```{r,warning=F,message=F} #--- generate existing samples and extract metrics ---# existing <- sample_systematic(raster = mraster, cellsize = 200, plot = TRUE) #--- sub sample using ---# e <- existing %>% extract_metrics(mraster = mraster, existing = .) ``` ### `sample_existing(type = "clhs")` The algorithm is unique in that it has two fundamental approaches: 1. Sample exclusively using `existing` and the attributes it contains. ```{r,warning=F,message=F} #--- sub sample using ---# sample_existing(existing = e, nSamp = 300, type = "clhs") ``` 2. Sub-sampling using `raster` distributions Our systematic sample of ~900 plots is fairly comprehensive, however we can generate a true population distribution through the inclusion of the ALS metrics in the sampling process. The metrics will be included in internal latin hypercube sampling to help guide sub-sampling of `existing`. ```{r,warning=F,message=F} #--- sub sample using ---# sample_existing( existing = existing, # our existing sample nSamp = 300, # desired sample size raster = mraster, # include mraster metrics to guide sampling of existing plot = TRUE ) # plot ``` The sample distribution again mimics the population distribution quite well! Now lets try using a cost variable to constrain the sub-sample. ```{r,warning=F,message=F} #--- create distance from roads metric ---# dist <- calculate_distance(raster = mraster, access = access) ``` ```{r,warning=F,message=F} #--- sub sample using ---# sample_existing( existing = existing, # our existing sample nSamp = 300, # desired sample size raster = dist, # include mraster metrics to guide sampling of existing cost = 4, # either provide the index (band number) or the name of the cost layer plot = TRUE ) # plot ``` Finally, should the user wish to further constrain the sample based on `access` like other sampling approaches in `sgsR` that is also possible. ```{r,warning=F,message=F} #--- ensure access and existing are in the same CRS ---# sf::st_crs(existing) <- sf::st_crs(access) #--- sub sample using ---# sample_existing( existing = existing, # our existing sample nSamp = 300, # desired sample size raster = dist, # include mraster metrics to guide sampling of existing cost = 4, # either provide the index (band number) or the name of the cost layer access = access, # roads layer buff_inner = 50, # inner buffer - no sample units within this distance from road buff_outer = 300, # outer buffer - no sample units further than this distance from road plot = TRUE ) # plot ``` ::: {.infobox .caution data-latex="{caution}"} **TIP!** The greater constraints we add to sampling, the less likely we will have strong correlations between the population and sample, so its always important to understand these limitations and plan accordingly. ::: ### `sample_existing(type = "balanced")` When `type = "balanced"` users can define all parameters that are found within `sample_balanced()`. This means that one can change the `algorithm`, `p` etc. ```{r,warning=F,message=F} sample_existing(existing = e, nSamp = 300, type = "balanced") ``` ```{r,warning=F,message=F} sample_existing(existing = e, nSamp = 300, type = "balanced", algorithm = "lcube") ``` ### `sample_existing(type = "srs")` The simplest, `type = srs`, randomly selects sample units. ```{r,warning=F,message=F} sample_existing(existing = e, nSamp = 300, type = "srs") ``` ### `sample_existing(type = "strat")` When `type = "strat"`, `existing` must have an attribute named `strata` (just like how `sample_strat()` requires a `strata` layer). If it doesnt exist you will get an error. Lets define an `sraster` so that we are compliant. ```{r,warning=F,message=F} sraster <- strat_kmeans(mraster = mraster, nStrata = 4) e_strata <- extract_strata(sraster = sraster, existing = e) ``` When we do have a strata attribute, the function works very much the same as `sample_strat()` in that is allows the user to define the `allocation` method (`"prop"` - defaults, `"optim"`, `"manual"`, `"equal"`). ```{r,warning=F,message=F} #--- proportional stratified sampling of existing ---# sample_existing(existing = e_strata, nSamp = 300, type = "strat", allocation = "prop") ``` ::: {.infobox .caution data-latex="{caution}"} **TIP!** Remember that when `allocation = "equal"`, the `nSamp` value will be allocated for each strata. ::: We get 400 sample units in our output below because we have 4 strata and `nSamp = 100`. ```{r,warning=F,message=F} #--- equal stratified sampling of existing ---# sample_existing(existing = e_strata, nSamp = 100, type = "strat", allocation = "equal") ``` ```{r,warning=F,message=F} #--- manual stratified sampling of existing with user defined weights ---# s <- sample_existing(existing = e_strata, nSamp = 100, type = "strat", allocation = "manual", weights = c(0.2, 0.6, 0.1, 0.1)) ``` We can check the proportion of samples from each strata with: ```{r,warning=F,message=F} #--- check proportions match weights ---# table(s$strata) / 100 ``` Finally, `type = "optim` allows for the user to define a `raster` metric to be used to optimize within strata variances. ```{r,warning=F,message=F} #--- manual stratified sampling of existing with user defined weights ---# sample_existing(existing = e_strata, nSamp = 100, type = "strat", allocation = "optim", raster = mraster, metric = "zq90") ``` We see from the output that we get 300 sample units that are a sub-sample of `existing`. The plotted output shows cumulative frequency distributions of the population (all `existing` samples) and the sub-sample (the 300 samples we requested).