## Overview

In the Introduction vignette users learned how to perform the most fundamental computations and visualizations for phylotranscriptomics analyses using the myTAI package. Especially in the Introduction vignette we demonstrated how to perform Phylostratigraphy and Divergence Stratigraphy as well as the construction of PhyloExpressionSets and DivergenceExpressionSets using the MatchMap() function.

In the Intermediate vignette users learned about more detailed analyses and more specialized techniques to investigate the observed phylotranscriptomics patterns (TAI, TDI, RE, etc.).

## Investigating Age or Divergence Category Specific Expression Level Distributions

Gene expression levels are a fundamental aspect of phylotranscriptomics studies. In detail, phylotranscriptomic measures aim to quantify the expression intensity of genes deriving from common age or divergence categories to detect stages of evolutionary constraints. Hence, the gene expression distribution of age or divergence categories as well as their differences within and between stages or categories allow us to investigate the age (PS) or divergence (DS) category specific contribution to the corresponding transcriptome.

For this purpose, the PlotCategoryExpr() aims to visualize the expression level distribution of each phylostratum during each time point or experiment as barplot, dot plot, or violin plot enabling users to quantify the age (PS) or divergence (DS) category specific contribution to the corresponding transcriptome.

This way of visualizing the gene expression distribution of each age (PS) or divergence (DS) category during all developmental stages or experiments allows users to detect specific age or divergence categories contributing significant levels of gene expression to the underlying biological process (transcriptome).

library(myTAI)
data(PhyloExpressionSetExample)
# category-centered visualization of PS specific expression level distributions (log-scale)
PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample,
legendName    = "PS",
test.stat     = TRUE,
type          = "category-centered",
distr.type    = "boxplot",
log.expr      = TRUE)
                  Zygote Quadrant Globular Heart Torpedo Bent  Mature
category-centered "***"  "***"    "***"    "***" "***"   "***" "***"

The resulting boxplot illustrates the log expression levels of each phylostratum during each developmental stage. Additionally, a Kruskal-Wallis Rank Sum Test as well as a Benjamini & Hochberg p-value adjustment for multiple comparisons is performed (test.stat = TRUE) to statistically quantify the differences between expression levels of different age or divergence categories. This type of analysis allows users to detect stages or experiments that show high diviation between age or divergence category contributions to the overall transcriptome or no significant deviations of age or divergence categories, suggesting equal age or divergence category contributions to the overall transcriptome. The corresponding P-values are printed to the console using the following notation:

• ’*’ = P-Value $$\leq$$ 0.05

• ’**’ = P-Value $$\leq$$ 0.005

• ’***’ = P-Value $$\leq$$ 0.0005

• ‘n.s.’ = not significant = P-Value > 0.05

In this case all developmental stages show significant differences in phylostratum specific gene expression.

Please notice that users need to define the legendName argument as PS or DS to specify whether the input ExpressionSet is a PhyloExpressionSet (legendName = 'PS') or DivergenceExpressionSet (legendName = 'DS').

Alternatively, users can investigate the differences of gene expression between all stages or experiments for each age or divergence category by specifying type = 'stage-centered'.

library(myTAI)
data(PhyloExpressionSetExample)
# stage-centered visualization of PS specific expression level distributions (log-scale)
PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample,
legendName    = "PS",
test.stat     = TRUE,
type          = "stage-centered",
distr.type    = "boxplot",
log.expr      = TRUE)
               PS1   PS2   PS3    PS4 PS5    PS6 PS7    PS8    PS9    PS10  PS11   PS12
stage-centered "***" "***" "n.s." "*" "n.s." "*" "n.s." "n.s." "n.s." "***" "n.s." "***"

Here, the Kruskal-Wallis Rank Sum Test (with Benjamini & Hochberg p-value adjustment) quantifies whether or not the gene expression distribution of a single age or divergence category significantly changes throughout development or experiments. This type of analysis allows users to detect specific age or divergence categories that significantly change their expression levels throughout development or experiments.

In this case, users will observe that PS3,5,7-9,11 do not show significant differences of gene expression between developmental stages suggesting that their contribution to the overall transcriptome remains constant throughout development.

Finally, users can choose the following plot types to visualize expression distributions:

Argument: distr.type

• distr.type = "boxplot" This specification allows users to visualize the expression distribution of all PS or DS as boxplot.

• distr.type = "violin" This specification allows users to visualize the expression distribution of all PS or DS as violin plot.

• distr.type = "dotplot" This specification allows users to visualize the expression distribution of all PS or DS as dot plot.

Together, studies perfomed with PlotCategoryExpr() allow users to conclude that genes originating in specific PS or DS contribute significantly more to the overall transcriptome than other genes originating from different PS or DS categories. More specialized analyses such as PlotMeans(), PlotRE(), PlotBarRE(), TAI(), TDI(), etc. will then allow them to study the exact mean expression patterns of these age or divergence categories.

Users will notice that so far all examples shown above specified log.expr = TRUE illustrating boxplots based on log2 expression levels. This way of visualization allows better visual comparisons between age or divergence categories. However, when specifying log.expr = FALSE absolute expression levels will be visualized in the corresponding boxplot.

Alternatively, instead of specifying log.expr = TRUE users can directly pass log2 transformed expression levels to PlotCategoryExpr() via tf(PhyloExpressionSetExample,log2) (when log.expr = FALSE):

data(PhyloExpressionSetExample)
# category-centered visualization of PS specific expression level distributions (log-scale)
PlotCategoryExpr(ExpressionSet = tf(PhyloExpressionSetExample, log2),
legendName    = "PS",
test.stat     = TRUE,
type          = "category-centered",
distr.type    = "boxplot",
log.expr      = FALSE)
                  Zygote Quadrant Globular Heart Torpedo Bent  Mature
category-centered "***"  "***"    "***"    "***" "***"   "***" "***" 

Or any other expression level transformation, e.g. sqrt.

data(PhyloExpressionSetExample)
# category-centered visualization of PS specific expression level distributions (sqrt-scale)
PlotCategoryExpr(ExpressionSet = tf(PhyloExpressionSetExample, sqrt),
legendName    = "PS",
test.stat     = TRUE,
type          = "category-centered",
distr.type    = "boxplot",
log.expr      = FALSE)
                  Zygote Quadrant Globular Heart Torpedo Bent  Mature
category-centered "***"  "***"    "***"    "***" "***"   "***" "***" 

## Gene Subset Age or Divergence Category Specific Expression Level Distributions

In some cases, users wish to visualize the gene expression distributions for a subset of genes in contrast to the entire transcriptome. For this purpose, the gene.set argument allows users to specify the gene ids of a subset of genes that shall be matched in the input ExpressionSet and for which expression level distributions shall be visualized.

library(myTAI)
data(PhyloExpressionSetExample)
# define an example gene subset (500 genes) which
# can be found in the input ExpressionSet
set.seed(234)
example.gene.set <- PhyloExpressionSetExample[sample(1:25260,500) , 2]
# visualize the gene expression distributions for these 500 genes (category-centered)
PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample,
legendName    = "PS",
test.stat     = TRUE,
type          = "category-centered",
distr.type    = "boxplot",
log.expr      = TRUE,
gene.set      = example.gene.set)
                  Zygote Quadrant Globular Heart Torpedo Bent Mature
category-centered "*"    "*"      "*"      "*"   "*"     "*"  "n.s."

Or analogously stage-centered:

library(myTAI)
data(PhyloExpressionSetExample)
# define an example gene subset (500 genes) which
# can be found in the input ExpressionSet
set.seed(234)
example.gene.set <- PhyloExpressionSetExample[sample(1:25260,500) , 2]
# visualize the gene expression distributions for these 500 genes (stage-centered)
PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample,
legendName    = "PS",
test.stat     = TRUE,
type          = "stage-centered",
distr.type    = "boxplot",
log.expr      = TRUE,
gene.set      = example.gene.set)
               PS1    PS2    PS3    PS4    PS5    PS6    PS7    PS8    PS9    PS10   PS11
stage-centered "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s."
PS12
stage-centered "n.s."

For example, users interested in the enrichment of PS or DS values in D. rerio brain genes (see Enrichment Vignette for details) could also visualize their gene expression distributions throughout development with PlotCategoryExpr() in cases where expression data is available.

## Computing the significant differences between gene expression distributions of PS or DS groups

As proposed by Quint et al., 2012 in some cases users whish to compare the difference of group specific expression levels using a statsitical test.

For this purpose, the PlotGroupDiffs() function performs a test to quantify the statistical significance between the global expression level distributions of groups of PS or DS. It therefore allows users to investigate significant groups of PS or DS that significantly differ in their gene expression level distibution within specific developmental stages or experiments.

Analogous to the PlotRE() or PlotMeans() function (see Introduction for details), users need to pass the Groups to PlotGroupDiffs() specifying the groups that shall be compared.

library(myTAI)
data(PhyloExpressionSetExample)

PlotGroupDiffs(ExpressionSet = PhyloExpressionSetExample,
Groups        = list(group_1 = 1:3,group_2 = 4:12),
legendName    = "PS",
plot.type     = "p-vals",
type          = "b",
lwd           = 6,
xlab          = "Ontogeny")

In cases where no plot shall be drawn and only the resulting p-value shall be returned users can specify the plot.type = NULL argument to receive only p-values returned by the underlying test statistic.

library(myTAI)
data(PhyloExpressionSetExample)
# only receive the p-values without the corresponding plot
PlotGroupDiffs(ExpressionSet = PhyloExpressionSetExample,
Groups        = list(group_1 = 1:3,group_2 = 4:12),
legendName    = "PS",
plot.type     = NULL)

Optionally, users can also visualize the difference in expression level distributions of groups of PS/DS during each developmental stage by specifying the plot.type = "boxplot" argument.

library(myTAI)
data(PhyloExpressionSetExample)
# visualize difference as boxplot
PlotGroupDiffs(ExpressionSet = tf(PhyloExpressionSetExample,log2),
Groups        = list(group_1 = 1:3,group_2 = 4:12),
legendName    = "PS",
plot.type     = "boxplot")

Here, we use log2 transformed expression levels for better visualization (tf(PhyloExpressionSetExample,log2)).

Internally, the PlotGroupDiffs() function performs a Wilcoxon Rank Sum test to quantify the statistical significance of PS/DS group expression. This quantification allows users to detect developmental stages of significant expression level differences between PS/DS groups. In this example we chose genes originated before the evolution of embryogenesis evolved in plants (Group1 = PS1-3) versus genes originated after the evolution of embryogenesis evolved in plants (Group2 = PS4-12). As a result, we observe that indeed the difference in total gene expression between these groups is significant throughout embryogenesis. In terms of the P-value quantification we observe that the P-value is minimized towards the phylotypic period. Hence, the expression level difference between the studied PS groups is maximized during the phylotypic period.