# Quick introduction to foodingraph

library(foodingraph)

## Introduction

A simple R package to infer food networks from categorical and binary variables.

Displays a weighted undirected graph from an adjacency matrix. Can perform confidence-interval bootstrap inference with mutual information or maximal information coefficient.

#### How it works

From an adjacency matrix, the package can infer the network with confidence-interval (CI) bootstraps of the distribution of mutual information1 values or maximal information coefficients (MIC)2for each pairwise association. The CI bootstrap calculated is compared to the CI bootstraps from simulated independent pairwise associations. The CI bootstrap from simulated independent pairwise variables is used to define a threshold of non-significance in the network. Our approach is to use a threshold for each pairwise variable type : two ordinal variables, two binary variables, one ordinal variable and one ordinal variable.

For example, For each pairwise association, if the 99th percentile of the simulated CI is higher than the 1th percentile of the sample bootstrap distribution, the edge is removed.

From the inferred adjacency matrix, the package can then display the graph using ggplot23, igraph4 and ggraph5.

## Example data set

For the purpose of this example, I invented some food intakes data on $$n=13$$ subjects and $$f=8$$ food groups : $$o=6$$ ordinally-encoded (from 0 to 13) and $$b=2$$ binary-encoded (0 or 1). Therefore, do not expect these examples to reflect reality.

# Food intakes (ordinaly- or binary-encoded)
obs_data <- data.frame(
#| Foods | Subject 1   2   3   4   5   6   7   8   9  10  11  12  13  |
#|-------|------------------------------------------------------------|
alcohol_cat  = c(8,  1,  3,  0, 10,  5,  1, 10,  2,  8,  1,  3,  9),
bread_cat    = c(7,  4,  3,  4,  0,  9,  4,  5,  7,  3,  4,  0,  9),
coffee_cat   = c(3,  6,  6,  6,  2,  3,  5,  8,  8,  6,  6,  2,  3),
duck_cat     = c(0,  3,  1,  0,  0,  2, 13,  1,  0,  0,  2, 13,  1),
eggs_cat     = c(5,  5,  4,  5,  8,  8,  6,  9,  6,  8,  2,  3,  1),
fruit_cat    = c(1,  7,  5,  8,  2,  3,  1,  0,  7,  7,  5,  8,  2),
gin_bin      = c(1,  0,  1,  0,  1,  0,  0,  1,  0,  0,  1,  0,  1),
ham_bin      = c(1,  1,  1,  1,  1,  1,  1,  0,  1,  1,  1,  0,  1)
)

head(obs_data)
#>   alcohol_cat bread_cat coffee_cat duck_cat eggs_cat fruit_cat gin_bin
#> 1           8         7          3        0        5         1       1
#> 2           1         4          6        3        5         7       0
#> 3           3         3          6        1        4         5       1
#> 4           0         4          6        0        5         8       0
#> 5          10         0          2        0        8         2       1
#> 6           5         9          3        2        8         3       0
#>   ham_bin
#> 1       1
#> 2       1
#> 3       1
#> 4       1
#> 5       1
#> 6       1
# The legend for the graph
legend <- data.frame(
name   = colnames(obs_data),
title  = c("Alcohol", "Bread",   "Coffee",    "Duck",    "Eggs", "Fruit", "Gin",     "Ham"),
family = c("Alcohol", "Cereals", "Beverages", "Poultry", "Eggs", "Fruit", "Alcohol", "Meats")
)

# Transform family intro factors?

Now let’s calculate the maximal information coefficient6 adjacency matrix, with the foodingraph function mic_adj_matrix.

adjacency_matrix <-  mic_adj_matrix(obs_data)

## Network inference

This step is optional. If you want to visualize the network, jump to Network visualization

### Arbitrary threshold

Foodingraph allows to select edges on the basis of a threshold value in the adjacency matrix. It can either be applied to the adjacency matrix by the functions graph_from_matrix() or links_nodes_from_mat(), with two parameters:

1. threshold (default is 0) : the threshold value
2. abs_threshold (bool, default TRUE) : if the threshold should apply to the absolute values of the edges or not. If TRUE, it will not convert the values of the adjacency matrix to absolute values, only compare the threshold to the absolute values.

### Confidence-interval bootstrap inference

Foodingraph allows to perform confidence-interval (CI) bootstrap inference, by comparing the CI bootstrap of simulated independent data to the CI bootstrap of each pairwise association of the dataset. Two methods to calculate the CI bootstrap exist : mutual information7 or maximal information coefficient8.

NOTE If you want to use mutual information, be sure to install the minet package available on Bioconductor. It will not be automatically downloaded when installing foodingraph.

#### CI bootstrap of independent simulated data

Let’s start by simulating independent data. As our dataset is comprised of ordinal and binary variables, we will simulate independent :

• pairwise ordinal variables
• pairwise binary variables
• pairwise ordinal & binary variables.

This will allow to compare each pairwise association of the dataset to the corresponding type of threshold.

For this example, we will use MIC.

# Ordinal vs. ordinal
thresh_ord_ord <- boot_simulated_cat_bin("cat", method = "mic", size = 500)
#> Confidence-interval bootstrap on simulated independent variables of type: cat
#> Number of simulations: 10
#> Number of bootstraps per simulations: 5000
#> Sample size for each simulation: 500
#> Contigency table of the simulated data:
#> 80673223530705025741138281041101310731006120000220000001000001010000
#> Simulation 1 : 0.0325156329536082
#> Simulation 2 : 0.0314411514630731
#> Simulation 3 : 0.0323417126850695
#> Simulation 4 : 0.0329836898244721
#> Simulation 5 : 0.0326975566229673
#> Simulation 6 : 0.0311743153258114
#> Simulation 7 : 0.0317563896617964
#> Simulation 8 : 0.0324681183893674
#> Simulation 9 : 0.0323719826652018
#> Simulation 10 : 0.0322471276786498
#> Mean of the percentiles: 0.0321997677270017
# Binary vs. binary
thresh_bin_bin <- boot_simulated_cat_bin("bin", method = "mic", size = 500)
#> Confidence-interval bootstrap on simulated independent variables of type: bin
#> Number of simulations: 10
#> Number of bootstraps per simulations: 5000
#> Sample size for each simulation: 500
#> Contigency table of the simulated data:
#> 1716220067
#> Simulation 1 : 0.00959234862161741
#> Simulation 2 : 0.00981309726162512
#> Simulation 3 : 0.00931286722002853
#> Simulation 4 : 0.00976621630717264
#> Simulation 5 : 0.00995851366409042
#> Simulation 6 : 0.00995319987683481
#> Simulation 7 : 0.0100071579896927
#> Simulation 8 : 0.0095677574348969
#> Simulation 9 : 0.00976409836009721
#> Simulation 10 : 0.00950546936773275
#> Mean of the percentiles: 0.00972407261037885
# Ordinal vs. binary
thresh_ord_bin <- boot_simulated_cat_bin("bincat", method = "mic", size = 500)
#> Confidence-interval bootstrap on simulated independent variables of type: bincat
#> Number of simulations: 10
#> Number of bootstraps per simulations: 5000
#> Sample size for each simulation: 500
#> Contigency table of the simulated data:
#> 40471553101671176132642
#> Simulation 1 : 0.0243750210123508
#> Simulation 2 : 0.0242125362129548
#> Simulation 3 : 0.0242291789462314
#> Simulation 4 : 0.0242102954103764
#> Simulation 5 : 0.0249391710165224
#> Simulation 6 : 0.0241030436053627
#> Simulation 7 : 0.0238956562884486
#> Simulation 8 : 0.0248165211990196
#> Simulation 9 : 0.0240867364120961
#> Simulation 10 : 0.0236899272448566
#> Mean of the percentiles: 0.024255808734822

#### CI bootstrap inference

Now let’s perform the CI bootstrap inference on the observed data. To do this, foodingraph needs a list of the ordinal (a.k.a. categorical) and binary variables, so it can accurately compare the correct threshold to the correct pairwise variables.

As the computations can take some time, a progress bar is built into the function. You can deactivate it by setting the parameter show_progress to FALSE (function boot_cat_bin). Recommended if the output is in a Rmarkdown document.

cat_var <- c("alcohol_cat", "bread_cat", "coffee_cat", "duck_cat", "eggs_cat",
"fruit_cat")
bin_var <- c("gin_bin", "ham_bin")

list_cat_var = cat_var,
list_bin_var = bin_var,
method = "mic",
threshold_cat = thresh_ord_ord,
threshold_bin = thresh_bin_bin,
threshold_bin_cat = thresh_ord_bin,
boots = 5000,
show_progress = FALSE)
#> Performing boostrap inference with method :  mic
# Print how many edges have been removed
print(paste(n_null_after - n_null_before, "edges have been removed"))
#> [1] "9 edges have been removed"

## Network visualization

### Quick start: directly from the adjacency matrix

graph1 <- graph_from_matrix(adjacency_matrix, legend, main_title = "My graph", layout = "graphopt")
graph1
#> $igraph #> IGRAPH e10217d UNW- 8 36 -- #> + attr: name (v/c), title (v/c), family (v/c), size (v/n), weight #> | (e/n), width (e/n), sign (e/c), alpha (e/n) #> + edges from e10217d (vertex names): #> [1] alcohol_cat--alcohol_cat alcohol_cat--bread_cat #> [3] alcohol_cat--coffee_cat alcohol_cat--duck_cat #> [5] alcohol_cat--eggs_cat alcohol_cat--fruit_cat #> [7] alcohol_cat--gin_bin alcohol_cat--ham_bin #> [9] bread_cat --bread_cat bread_cat --coffee_cat #> [11] bread_cat --duck_cat bread_cat --eggs_cat #> [13] bread_cat --fruit_cat bread_cat --gin_bin #> + ... omitted several edges #> #>$net

#>
#> $deg #> name degrees #> 1 Alcohol 9 #> 2 Bread 9 #> 3 Coffee 9 #> 4 Duck 9 #> 5 Eggs 9 #> 6 Fruit 9 #> 7 Gin 9 #> 8 Ham 9 #> #> attr(,"class") #> [1] "list" "foodingraph" ### Save the graph in a file save_graph(graph1) ### Customization Many options and layouts exist to customize the graph. library(ggplot2) custom1 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Node label as name", layout = "graphopt", node_label_title = F, node_label_size = 5) custom2 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Node type as label", layout = "graphopt", node_type = "label") custom3 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Grid layout", layout = "grid", node_label_size = 5) custom4 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Circle layout", layout = "circle", node_label_size = 5) custom1$net
custom2$net custom3$net
custom4\$net

### Compare graphs

Foodingraph provides a useful graph comparison function, which harmonizes the graphs’ weights and node degree sizes, in order to facilitate the visual comparison.

First, let’s generate a second graph.

# New set of observation data
obs_data_2 <- matrix(c(round(runif(78, 0, 13)), round(runif(26))), nrow = 13, ncol = 8)
colnames(obs_data_2) <- colnames(obs_data)

# Compute the MIC adjacency matrix

graph2 <- graph_from_matrix(adjacency_matrix_2, legend, main_title = "My graph 2",
layout = "graphopt")

Then let’s compare the first graph and this one on a single, unified plot using compare_graphs().

comp1_2 <- compare_graphs(graph1, graph2, position = "horizontal")
comp1_2

You can also save this new graph. It will automatically have a bigger size.

save_graph(comp1_2)

## References

1. Meyer, Patrick E, Frédéric Lafitte, and Gianluca Bontempi. “Minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information.” BMC Bioinformatics 9, no. 1 (December 2008). https://doi.org/10.1186/1471-2105-9-461.

2. Albanese, Davide, Michele Filosi, Roberto Visintainer, Samantha Riccadonna, Giuseppe Jurman, and Cesare Furlanello. “Minerva and Minepy: A C Engine for the MINE Suite and Its R, Python and MATLAB Wrappers.” Bioinformatics 29, no. 3 (February 1, 2013): 407–8. https://doi.org/10.1093/bioinformatics/bts707.

3. H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

4. Csardi G, Nepusz T: The igraph software package for complex network research, InterJournal, Complex Systems 1695. 2006. http://igraph.org

5. Thomas Lin Pedersen, https://ggraph.data-imaginist.com/

6. Albanese, Davide, Michele Filosi, Roberto Visintainer, Samantha Riccadonna, Giuseppe Jurman, and Cesare Furlanello. “Minerva and Minepy: A C Engine for the MINE Suite and Its R, Python and MATLAB Wrappers.” Bioinformatics 29, no. 3 (February 1, 2013): 407–8. https://doi.org/10.1093/bioinformatics/bts707.

7. Meyer, Patrick E, Frédéric Lafitte, and Gianluca Bontempi. “Minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information.” BMC Bioinformatics 9, no. 1 (December 2008). https://doi.org/10.1186/1471-2105-9-461.

8. Reshef, D. N., Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. “Detecting Novel Associations in Large Data Sets.” Science 334, no. 6062 (December 16, 2011): 1518–24. https://doi.org/10.1126/science.1205438.