--- title: "Ordination in the tidyverse" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Ordination in the tidyverse} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r knitr options, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", fig.width = 6, fig.height = 5 ) ``` This vignette introduces the goals and functionality of the **ordr** package. Users should have some familiarity with the class of _ordination_ models powered by singular value decomposition, such as principal components analysis, correspondence analysis, and linear discriminant analysis, and with the _biplot_ statistical graphic used to visualize these models. Users should also be familiar with the **tidyverse** R package collection for data science and parts of the **tidymodels** collection for statistical modeling, most notably **tibble**, **dplyr**, **broom**, and **ggplot2**. Briefly, **ordr** incorporates ordination models into a "tidy" workflow. Specifically, for fitted ordination models of a variety of classes, users can * augment the model with row- or column-level diagnostics and annotations, * summarize the model or its components as tidy data frames, and * produce biplots of the model using a layered grammar of graphics. As an example, this vignette performs a correspondence analysis (CA) of the `HairEyeColor` data set installed with R, using the fitting engine `corresp()` provided by the **MASS** package and its base plotting methods, then showcases the more flexible and elegant methods provided by **ordr**. While some techniques specific to CA are not as natural in **ordr**, most can be reproduced through the principled use of general steps. ```{r setup} data(HairEyeColor) library(MASS) library(ordr) ``` ## the hair and eye color data We begin with an inspection of the data using base R. For more information about the data set, call `help(HairEyeColor)`. ```{r} print(HairEyeColor) plot(HairEyeColor) ``` The data were collected by students in one of Ronald Snee's statistics courses.[^snee] They consist of the hair color and eye color, each binned into four groups, of 592 subjects. The data are also stratified by sex, forming a 3-way array. The (default) mosaic plot reveals only subtle differences by sex, so we lose little by flattening the array into a $4 \times 4$ matrix. The resulting count table is suitable for correspondence analysis, and we fit this model next. [^snee]: Snee RD (1974) "Graphical Display of Two-way Contingency Tables". _The American Statistician_ **28**(1), 9-12. ## correspondence analysis using **MASS** The implementation `MASS::corresp()` returns an object of class 'correspondence'. In addition to the information included in its `print()` method, we can use the canonical correlations to calculate the proportion of variance along each dimension: ```{r} haireye <- apply(HairEyeColor, c(1L, 2L), sum) haireye_ca <- corresp(haireye, nf = 3L) print(haireye_ca) # proportion of variance in each dimension haireye_ca$cor^2 / sum(haireye_ca$cor^2) ``` The variation in the table, in terms of the $\chi^2$ distances between the distributions of hair color among people with the same eye color (or, equivalently, vice-versa), lies largely ($89\%$) along a single dimension, with the remaining variation largely ($9.5\%$) along a single orthogonal dimension. The first dimension best distinguishes between subjects with black hair and brown eyes from those with blond hair and blue eyes. Subjects with brown and red hair, or with hazel and green eyes, lie between these extremes. The second dimension distinguishes subjects with black or blond hair, and with brown or blue eyes, from those with brown or red hair, and with hazel or green eyes. Subjects with red hair and green eyes are especially distinguished along this dimension. This disrupts the impression from the first dimension alone that subjects lie along a spectrum from black hair--brown eyes to blond hair--blue eyes, which may accurately include an intermediate phenotype (brown hair--hazel eyes), and reveals a phenotype (red hair--green eyes) that diverges from this spectrum. As an exercise, we can recover the row and column standard coordinates returned by `corresp()` from direct computations, e.g. following [the Wikipedia article on correspondence analysis](https://en.wikipedia.org/wiki/Correspondence_analysis), starting from the data matrix (count table) $X$ with total count $n = 1^\top X 1$: 1. The _correspondence matrix_ (the matrix of relative frequencies) $P = \frac{1}{n} X$ 2. The row and column weights $r = \frac{1}{n} X 1$ and $c = \frac{1}{n} 1^\top X$ 3. The diagonals of inverse weights $D_r = \operatorname{diag}(r)$ and $D_c = \operatorname{diag}(c)$ 4. The matrix of _standardized residuals_ $S = D_r (P - rc) D_c$ 5. The singular value decomposition $S = U \Sigma V^\top$ 6. The row and column standard coordinates $F = D_r U$ and $G = D_c V$ ```{r} # correspondence matrix (matrix of relative frequencies) (haireye_p <- haireye / sum(haireye)) # row and column weights (haireye_r <- rowSums(haireye) / sum(haireye)) (haireye_c <- colSums(haireye) / sum(haireye)) # matrix of standardized residuals (haireye_s <- diag(1 / sqrt(haireye_r)) %*% (haireye_p - haireye_r %*% t(haireye_c)) %*% diag(1 / sqrt(haireye_c))) # singular value decomposition haireye_svd <- svd(haireye_s) # row and column standard coordinates diag(1 / sqrt(haireye_r)) %*% haireye_svd$u[, 1:3] diag(1 / sqrt(haireye_c)) %*% haireye_svd$v[, 1:3] ``` We can generate a biplot display via the `biplot()` method for the 'correspondence' class, also provided by **MASS**: ```{r, fig.height=6} biplot( haireye_ca, type = "symmetric", cex = .8, main = "Correspondence analysis of subjects' hair & eye colors" ) ``` This _symmetric biplot_ evenly distributes the inertia between the rows and columns. Distances between points in the same matrix factor do not approximate their $\chi^2$ distances, but inner products between row and column points approximate their standardized residuals. The row and column profile markers are resized to represent the masses of the groups. ## **ordr** methods for CA models **ordr** provides a new class, 'tbl_ord', that wraps ordination objects like those of class 'prcomp' without directly modifying them. (The original model can be recovered with `un_tbl_ord()`.) ```{r} (haireye_ca_ord <- as_tbl_ord(haireye_ca)) ``` The `print()` method for 'tbl_ord' is based on that of tibbles. It prints two tibbles, like that for the 'tbl_graph' class of [**tidygraph**](https://tidygraph.data-imaginist.com/), one for each matrix factor. The header reminds us of the dimensions of the matrix factors and how the inertia is distributed. In 'correspondence' objects, by default both row and column profiles are in standard coordinates: $D_r F$ and $D_c G$, but these can be reassigned to any pair of proportions $D_r S {D_c}^\top = (D_r U \Sigma^{p}) (D_c V \Sigma^{q})^\top$, even if $p + q \neq 1$. By assigning `"symmetric"` inertia, we distribute half of the inertia to each matrix factor: ```{r} get_conference(haireye_ca_ord) confer_inertia(haireye_ca_ord, c(.25, .75)) confer_inertia(haireye_ca_ord, c(1, 1)) (haireye_ca_ord <- confer_inertia(haireye_ca_ord, "symmetric")) ``` `broom::glance()` returns a single-row tibble summary of a model object. It was designed for analysis pipelines involving multiple models (e.g. model selection), to facilitate summaries of multiple models at once. 'tbl_ord' objects wrap a potentially huge variety of models, for which only a few summary statistics will usually be useful. This method includes the rank of the matrix factorization; the proportion of inertia/variance in the first two dimensions, which characterize the fidelity of a biplot to the complete data; and the original object class. ```{r} glance(haireye_ca_ord) ``` Analogous to `broom::augment()`, this tbl_ord-specific function preserves the 'tbl_ord' class but augments the row and column tibbles with any metadata or diagnostics found in the model object. Vertical bars separate the coordinate matrices from annotations. ```{r} augment_ord(haireye_ca_ord) ``` Additional row- and column-level variables can also be augmented and manipulated using a handful of **dplyr**-like verbs, each specific to the matrix factor being affected (`rows` or `cols`). Each tibble is split between the shared coordinates on the left and any additional annotation columns on the right. The `broom::tidy()` method for tbl_ords returns a tibble with one row per artificial coordinate.[^tidy] In CA, these are variably called dimensions or components. The 'correspondence' object contains a `$cor` vector of canonical correlations, which is included in the result; other coordinate-level attributes vary by model object class. [^tidy]: Note that `ordr::tidy()` takes precedence for 'tbl_ord' objects over the class of the underlying model because the 'tbl_ord' class precedes this class. ```{r tidy} tidy(haireye_ca_ord) ``` The `.inertia` and `.prop_var` fields are calculated from the singular values or eigenvalues contained in the ordination object and always appear when defined. This means that `tidy()` prepares any ordination object derived from such a decomposition for a scree plot with `ggplot2::ggplot()`: ```{r scree plot} ggplot(tidy(haireye_ca_ord), aes(x = name, y = inertia)) + geom_col() + labs(x = "Component", y = "Inertia") + ggtitle("Correspondence analysis of subjects' hair & eye colors", "Decomposition of inertia") ``` While `ggplot2::fortify()` may rarely be called directly, is plays a special role in **ordr** by converting a 'tbl_ord' object to a 'tbl_df' object. To do this, the fortifier row-binds the two matrix factor tibbles and adds an additional `.matrix` column to remember which was which: ```{r fortify} fortify(haireye_ca_ord) ``` This fortifier will also preserve any row and column annotations, so it can be composed with `augment_ord()` or with the row- and column-specific verbs. `NA`s are introduced when an annotation is present for one matrix factor but not the other. (The `.element` column becomes important when a model produces supplementary as well as active elements.) The `.matrix` column also plays a key role in `ggbiplot()`: The row- and column-specific ploy layers, which take the form `geom_rows_*()` or `stat_cols_*()`, for example, use this column to subset the data internally. This enables the layered grammar of graphics of **ggplot2** to apply separately to the two matrix factors and their annotation. Though note that it can also be applied to the entire fortified data frame using conventional plot layers, as below when the `.matrix` column is used to distinguish the colors and shapes of the row and column profile markers: ```{r} haireye_ca_ord %>% augment_ord() %>% fortify() %>% transform(feature = ifelse(.matrix == "rows", "Hair", "Eye")) %>% ggbiplot(aes(color = feature, shape = feature, label = name), clip = "off") + theme_biplot() + geom_origin() + geom_rows_point() + geom_cols_point() + geom_rows_text(vjust = -1, hjust = 0, size = 3) + geom_cols_text(vjust = -1, hjust = 0, size = 3) + scale_color_brewer(type = "qual", palette = "Dark2") + scale_size_area() + ggtitle("Correspondence analysis of subjects' hair & eye colors", "Symmetric biplot") ``` Note a few conveniences: * The position aesthetics are assumed to be the first and second artificial coordinates, unless otherwise specified. Biplots can also be specified by setting the `x` and `y` aesthetics to integers, which are converted to the corresponding artificial coordinates. * By default, the aspect ratio is set to 1. This is essential for biplots, which rely on distances between markers and angles between vectors to convey information. * The partial theme `theme_biplot()` removes several plot elements that are usually not important to biplots, most notably gridlines, while retaining other properties of the current theme. * The layer `geom_origin()` is one of two shortcuts for plotting elements commonly used in biplots, the other being `geom_unit_circle()`. ## session info ```{r} sessioninfo::session_info() ```