--- title: "Calculating" output: rmarkdown::html_vignette description: > Learn how to use calculate_* functions. vignette: > %\VignetteIndexEntry{Calculating} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{css,echo=FALSE} .infobox { padding: 1em 1em 1em 4em; margin-bottom: 10px; border: 2px solid #43602d; border-radius: 10px; background: #738e41 5px center/3em no-repeat; } .caution { background-image: url("https://tgoodbody.github.io/sgsR/logo.png"); } ``` ```{r,warning=F,message=F,echo=FALSE,results=FALSE} library(sgsR) library(terra) #--- Load mraster and access files ---# r <- system.file("extdata", "mraster.tif", package = "sgsR") #--- load the mraster using the terra package ---# mraster <- terra::rast(r) a <- system.file("extdata", "access.shp", package = "sgsR") #--- load the access vector using the sf package ---# access <- sf::st_read(a, quiet = TRUE) #--- apply quantiles algorithm to metrics raster ---# sraster <- strat_quantiles( mraster = mraster$zq90, # use mraster as input for sampling nStrata = 4 ) # algorithm will produce 4 strata #--- apply stratified sampling algorithm ---# existing <- sample_strat( sraster = sraster, # use mraster as input for sampling nSamp = 200, # request 200 samples be taken mindist = 100 ) # define that samples must be 100 m apart #--- algorithm table ---# a <- c("`calculate_representation()`", "`calculate_distance()`", "`calculate_pcomp()`", "`calculate_sampsize()`", "`calculate_allocation()`", "`calculate_coobs()`", "`calculate_pop()`", "`calculate_lhsOpt()`") d <- c("Determine representation of strata in existing sample unit", "Per pixel distance to closest `access` vector", "Principal components of input `mraster`", "Estimated sample sizes based on relative standard error", "Proportional / optimal / equal / manual allocation", "Detail how `existing` sample units are distributed among `mraster`", "Population level information (PCA / quantile matrix / covariance matrix) of `mraster`", "Determine optimal Latin hypercube sampling parameters including sample size") urls <- c("#rep", "#dist", "#pcomp", "#sampsize", "#allocation", "#coobs", "#lhspop", "#lhsopt") df <- data.frame(Algorithm = a, Description = d) df$Algorithm <- paste0("[", df$Algorithm, "](", urls, ")") ``` Currently, there are 8 functions associated with the `calculate` verb in the `sgsR` package: ```{r echo=FALSE} knitr::kable(df, align = "c") ``` ## `calculate_representation()` {#rep .unnumbered} `calculate_representation()` function allows the users to verify how a stratification is spatially represented in an `existing` sample. Result in tabular and graphical (if `plot = TRUE`) outputs that compare strata coverage frequency and sampling frequency. ```{r,warning=F,message=F} #--- quantile sraster ---# quantiles <- strat_quantiles( mraster = mraster$zq90, nStrata = 8 ) #--- random samples ---# srs <- sample_srs( raster = sraster, nSamp = 50 ) #--- calculate representation ---# calculate_representation( sraster = quantiles, existing = srs, plot = TRUE ) ``` The tabular output presents the frequency of coverage for each strata (`srasterFreq`) (what % of the landscape does the strata cover) and the sampling frequency within each strata (`sampleFreq`) (what % of total `existing` samples are in the strata). The difference (`diffFreq`) between coverage frequency and sampling frequency determines whether the values are over-represented (positive numbers) or under-represented (negative numbers). This value translates to a discrete `need` attribute that defines whether there is a need to add or remove samples to meet the number of samples necessary to be considered representative of the spatial coverage of strata inputted in `sraster`. Performing the algorithm on a sample set derived using `sample_strat()` exhibits proportional sampling to strata coverage given that samples were allocated proportionally to spatial coverage. ```{r,warning=F,message=F} calculate_representation( sraster = sraster, existing = existing, plot = TRUE ) ``` ::: {.infobox .caution data-latex="{caution}"} **TIP!** Presence of very small (negligible) differences between `srasterFreq` and `sampleFreq` is common. __In these situations, it is important for the user to determine whether to add or remove the samples.__ ::: ## `calculate_distance` {#dist .unnumbered} `calculate_distance()` uses input `raster` and `access` data and outputs the per pixel distance to the nearest access point. This algorithm has a specific value for constraining sampling protocols, such as with `sample_clhs()`, where the output raster layer can be used as the `cost` for the constraint. The output raster consists of the input appended with the calculated distance layer (`dist2access`). The `slope` parameters also exists to calculate slope distance instead of geographic distance, which becomes very handy in the case of steep mountainous areas and is faster from a computational point of view. If `slope == TRUE`, the `mraster` provided should be a digital terrain model. ```{r,warning=F,message=F} calculate_distance( raster = sraster, # input access = access, # define access road network plot = TRUE ) # plot ``` _This function performs considerably slower when `access` has many features. Consider mergeing features for improved efficiency._ ## `calculate_pcomp` {#pcomp .unnumbered} `calculate_pcomp()` uses `mraster` as the input and performs principal component analysis. The number of components defined by the `nComp` parameter specifies the number of components that will be rasterized. ```{r,warning=F,message=F} calculate_pcomp( mraster = mraster, # input nComp = 3, # number of components to output plot = TRUE, # plot details = TRUE ) # details about the principal component analysis appended ``` ## `calculate_sampsize` {#sampsize .unnumbered} `calculate_sampsize()` function allows the user to estimate an appropriate sample size using the relative standard error (`rse`) of input metrics. If the input `mraster` contains multiple layers, the sample sizes will be determined for all layers. If `plot = TRUE` and `rse` is defined, a sequence of `rse` values will be visualized with the indicators and the values for the matching sample size. ```{r, warning = FALSE} #--- determine sample size based on relative standard error (rse) of 1% ---# calculate_sampsize( mraster = mraster, rse = 0.01 ) ``` ```{r, warning = FALSE} #--- change default threshold sequence values ---# #--- if increment and rse are not divisible the closest value will be taken ---# p <- calculate_sampsize( mraster = mraster, rse = 0.025, start = 0.01, end = 0.08, increment = 0.01, plot = TRUE ) p ``` ## `calculate_allocation` {#allocation .unnumbered} `calculate_allocation()` determines how sample units are allocated based on the desired sample size (`nSamp`) and the input `sraster`. It is used internally in a number of algorithms, including [`sample_strat`](#strat). Currently, there are four allocation methods: proportional (`prop`; default), optimal (`optim`), equal (`equal`), and manual (`manual`). ### Proportional allocation {#proportional .unnumbered} Samples are allocated based on the spatial coverage area of the strata. This is the default allocation method. ```{r,warning=F,message=F} #--- perform grid sampling ---# calculate_allocation( sraster = sraster, nSamp = 200 ) ``` In this case the sraster has equally sized strata, so the number of allocated sample units is always the same. ```{r,warning=F,message=F} #--- calculate existing samples to include ---# e.sr <- extract_strata( sraster = sraster, existing = existing ) calculate_allocation( sraster = sraster, nSamp = 200, existing = e.sr ) ``` At times, values under `total` can be negative. Negative values indicate that `existing` sample units over represent those strata and that some sample units could removed to prevent over-representation. `$total` indicates the number of sample units that could be added or removed. ### Optimal Allocation {#optimal .unnumbered} Sample units are allocated based on within strata variance. The optimal allocation method uses the variation within the strata metric to allocate sample units. This means that in addition to providing an `sraster`, that a specific metric (`mraster`) must be provided to calculate variation to optimally allocate sample units. ```{r, warning=F,message=F} calculate_allocation( sraster = sraster, # stratified raster nSamp = 200, # desired sample number existing = e.sr, # existing samples allocation = "optim", # optimal allocation mraster = mraster$zq90, # metric raster force = TRUE ) # force nSamp number ``` ### Equal allocation {#equal .unnumbered} There may be situations where the user wants an equal number of sample units allocated to each strata. In these situations use `allocation = equal`. In this case, `nSamp` refers to the total number of sample units per strata, instead of the overall total number. ```{r} calculate_allocation( sraster = sraster, # stratified raster nSamp = 20, # desired sample number allocation = "equal" ) # optimal allocation ``` The code in the demonstration above yields a total of 80 samples (20 `nSamp` for each of the 4 strata in `sraster`). ### Manual allocation {#manual .unnumbered} The user may wish to manually assign weights to strata. In this case, `allocation = manual` can be used and `weights` must be provided as a numeric vector (e.g. `weights = c(0.2, 0.2, 0.2, 0.4)` where `sum(weights) == 1`). In this case, `nSamp` will be allocated based on `weights`. ```{r} weights <- c(0.2, 0.2, 0.2, 0.4) calculate_allocation( sraster = sraster, # stratified raster nSamp = 20, # desired sample number allocation = "manual", # manual allocation weights = weights ) # weights adding to 1 ``` The code above yields a total of 20 sample units with plots being allocated based on the `weights` provided in ascending strata order. ## Sample evaluation algorithms {#sampeval .unnumbered} The following algorithms were initially developed by Dr. Brendan Malone from the University of Sydney. Dr. Malone and his colleagues supplied an in depth description of the functionality of these algorithms, which were originally developed to improve soil sampling strategies. These functions were modified and implemented to be used for structurally guided sampling approaches. Many thanks to Dr. Malone for being a proponent of open source algorithms. Please consult the original reference for these scripts and ideas as their publication holds helpful and valuable information to understand their rationale for sampling and algorithm development. _Malone BP, Minansy B, Brungard C. 2019. Some methods to improve the utility of conditioned Latin hypercube sampling. PeerJ 7:e6451 DOI 10.7717/peerj.6451_ ### `calculate_coobs` {#coobs .unnumbered} `calculate_coobs()` function performs the **CO**unt of **OBS**ervations (coobs) algorithm using an `existing` sample and `mraster`. This algorithm helps the user understand how an `existing` sample is distributed among the landscape in relation to `mraster` data. ::: {.infobox .caution data-latex="{caution}"} **TIP!** The output coobs raster can be used to constrain clhs sampling using the `sample_clhs()` function to the areas that are under-represented. ::: The coobs raster determines how many observations are similar in terms of the covariate space at every pixel, and takes advantage of parallel processing routines. ```{r,warning=F,message=F, eval = FALSE} calculate_coobs( mraster = mraster, # input existing = existing, # existing samples cores = 4, # parallel cores to use details = TRUE, # provide details from algorithm output plot = TRUE ) # plot ``` ## Latin hypercube sampling evaluation algorithms {#lhseval .unnumbered} The following 2 algorithms present the means to maximize the effectiveness of the [latin hypercube sampling](#clhs) protocols. ### `calculate_pop` {#lhspop .unnumbered} `calculate_pop()` calculates population level statistics of an `mraster`, including calculating the principal components, quantile & covariate distributions, and Kullback-Leibler divergence testing. The outputs produced from this functions are required to use the `calculate_lhsOpt()` function described in the following section. ::: {.infobox .caution data-latex="{caution}"} **TIP!** This algorithm can be pre-emptively used to calculate `matQ` and `MatCov`, two values that are used for the `sample_ahels()` function. This will save processing time during sampling. ::: ```{r,warning=F,message=F, eval = FALSE} #--- by default all statistical data are calculated ---# calculate_pop(mraster = mraster) # input ``` The output list contains the following: * `$values` - Pixel values from `mraster` * `$pcaLoad` - PCA loadings * `$matQ` - Quantile matrix * `$matCov` - Covariate matrix ```{r,warning=F,message=F, eval = FALSE} #--- statistical analyses can be chosen by setting their parameter to `FALSE` ---# mat <- calculate_pop( mraster = mraster, # input nQuant = 10 ) # desired number of quantiles #--- use matrix output within sample ahels algorithm ---# sample_ahels( mraster = mraster, # input existing = existing, # existing sample nQuant = 10, # number of desired quantiles nSamp = 50, # desired sample size matCov = mat ) # covariance matrix ``` ### `calculate_lhsOpt` {#lhsopt .unnumbered} `calculate_lhsOpt()` function performs a bootstrapped latin hypercube sampling approach where population level analysis of `mraster` data is performed to determine the optimal latin hypercube sample size. `calculate_pop()` and varying sample sizes defined by `minSamp`, `maxSamp`, `step` and `rep`. Sampling protocols are conducted and statistical effectiveness of those sampling outcomes are evaluated to determine where sample size is minimized and statistical representation is maximized. ```{r,warning=F,message=F, eval = FALSE} #--- calculate lhsPop details ---# poplhs <- calculate_pop(mraster = mr) calculate_lhsOpt(popLHS = poplhs) ``` ```{r,warning=F,message=F, eval = FALSE} calculate_lhsOpt( popLHS = poplhs, PCA = FALSE, iter = 200 ) ```