Interspecific Ecometabolomics Tutorial • ecomet

This Relative Paths will work if you open the .Rjroject.

pak::pak("phytoecia/eCOMET", upgrade = TRUE)

library(here)
here()

library(ecomet)
library(dplyr)
library(ggplot2)

Background

In this tutorial, we will demonstrate how to use the eCOMET package for analyzing multiple species of trees that co-occur accross a network of forest plots. We start by comparing chemical richness and diversity across samples. We then compare the composition of samples using dimensional reduction (hierarchical clustering, PCAs, NMDS) to display how samples within groups compare to one another. Up until this point, we have assumed that all MZ-RT features are (statistically) independent. However,this assumption is often wrong when analyzing untargeted metabolomic datasets because multiple features are detected for each compound (i.e. adducts, isotopes and in-source fragments). To address this in the second half of the tutorial we take measures to account for associations between features, first by grouping features from to the same metabolite, and then by accounting for structural similarity between features.

Unlike in a treatment based study, where the expectation is that many compounds will be shared between treatment groups, and the goal is to identify differential expressed metabolites. In studies comparing across species, we are often presented with a case where the vast majority of compounds are not shared between groups (i.e species of plants). In these cases, metrics of chemical similarity can become relatively meaningless if we rely completely on shared MZ-RT features. To address this, chemical ecologists have integrated tools that account for relationships between detected features - feature based and ion identity molecular networking and structurally weighted scores of chemical similarity. These tools take into account for the fact that although two samples do not contain the same mz-rt feature, they both are producing compounds from the same biosynthetic origin and thus, should be considered more chemically similar than another sample that produces compounds from a different class/pathway.

Step 1: Load input data and generate the MMO object

here()

demo_feature <- here("tutorial/tutorial_2_interspecific/EcoMET_Interspecific_full_feature_table.csv")
demo_metadata <- here("tutorial/tutorial_2_interspecific/EcoMET_Interspecific_metadata_no_blank.csv")
demo_sirius_formula <- here("tutorial/tutorial_2_interspecific/canopus_formula_summary.tsv")
demo_sirius_structure <- here("tutorial/tutorial_2_interspecific/structure_identifications.tsv")
demo_dreams <- here("tutorial/tutorial_2_interspecific/EcoMET_Interspecific_edges_dreams_dreams.csv")

# Initialize eCOMET object
here()

mmo <- GetMZmineFeature(mzmine_dir=demo_feature, metadata_dir = demo_metadata, group_col = 'Species_binomial',sample_col = "filename")

mmo

Building the MMO object creates three objects:

Sample Metadata:

We are working with a dataset containing 10 species of tropical trees with 5-8 replicate samples for each species.

mmo$metadata

Feature info…

mmo$feature_info

Feature data…

Feature x Abundance matrix where the first two columns identify individual metabolites and the remaining columns represent the abundance in each sample

mmo$feature_data

Task 1: A) Calculate the number of features observed in each sample and each species:

We can use the GetAlphaDiversity function to calculate many diversity indices from the data. This function uses the functional hill framework(CITE) for calculating diversity and thus can integrate simple diversity measures (i.e. richness) or measures that take into account relative abundance of feature AND structural relationships between features.

For simple Richness, we set Q= 0, as this converts the abundance matrix to presence absence and we set mode= “unweighted” meaning we do not weight features by their structural similarity.

By default we set a

#Will update functional hill number to calculate above threshold not on NA.
sample_richness <- GetAlphaDiversity(mmo, q = 0, normalization = 'None',mode = 'unweighted', threshold = 0)
#must be here! return length(p when counting presence absence is just going to return the number of features?)
# hill_numbers <- apply(feature[, -(1:2)], 2, function(x) {
#     p <- x / sum(x)
#     if (q == 0) {
#       return(length(p))
#     } else if (q == 1) {
#       return(exp(-sum(p * log(p))))
#     } else {
#       return((sum(p^q))^(1 / (1 - q)))
#     }
#   })


sample_richness <- GetAlphaDiversity(
  mmo,
  mode = "richness",
  threshold = 0,
  output = "sample_level"
)

ggplot2::ggplot(sample_richness, ggplot2::aes(x = group, y = value)) +
  ggplot2::geom_boxplot(outlier.shape = NA) +
  ggplot2::geom_jitter(width = 0.15, height = 0, alpha = 0.6) +
  ggplot2::labs(
    x = "Species",
    y = "Feature Richness"
  ) +
  ggplot2::theme_classic() +
  ggplot2::theme(
    axis.text.x = element_text(angle = 90, hjust = 0.5)
  )

This plot shows the distribution of richness, but we can also summarize the richness per group directly…

group_mean_richness <- GetAlphaDiversity(
  mmo,
  mode = "richness",
  threshold = 0,
  output = "group_average",
  ci = 0.95   # optional; controls lwr/upr quantiles in the summary
)
group_mean_richness

We can calculate the cumualitive richess across each group.

#Should add option for mean or sum when pooling samples

group_pooled_richness <- GetAlphaDiversity(
  mmo,
  mode = "richness",
  threshold = 0,
  output = "group_cumulative",
  pool_method = "sum"
)

group_pooled_richness

Finally, we can create a compound accumulation curve per group by rarefying samples and seeing how compound richness increases as we add samples

#this is too slow because of filtering. Need to update functions so they don't require mmo..
#update so it includes raw data of the rarefaction needs ot be conserved
group_rarefied_richness <- GetAlphaDiversity(
  mmo,
  mode = "richness",
  threshold = 0,
  output = "rarefied_sample",
  n_perm = 200,    # permutations for CI
  ci = 0.95,
  seed = 1,
  pool_method = "sum"
)


group_rarefied_richness$summary
group_rarefied_richness$raw

p <- RarefactionAUC(group_rarefied_richness, n_boot = 500)

ggplot(p$auc_boot) +
  geom_density(aes(x = auc, fill = group)) +
  theme_classic()

#Stacked Barplot and Loading in Sirius… This is in bad shape. Need to discuss.

#Eventually Dale will implement addGNPS and addDREAMS hits...

# Add SIRIUS annotation
mmo <- AddSiriusAnnot(mmo, canopus_structuredir = demo_sirius_structure, canopus_formuladir = demo_sirius_formula)
mmo$sirius_annot

#MS will add this...
###Next add stacked barcplot...

PlotNPCStackedBar(mmo, 
    group_col = 'Species_binomial', 
    outdir = 'NPC_stakced_bar.pdf', 
    width = 8, height =4, 
    save_output = FALSE)

#Task 2: Visualize difference between Species

PCA and Bray Curtis

## MS will add this
mmo <- ReplaceZero(mmo,method = "one")

PCAplot(mmo, normalization = 'None', save_output = FALSE)

HC of Bray-Curtis - Dale adds does this.

We illustrate how different species of trees share a small fraction of there mz-rt features and show how this can be useful for identifying chemotypes.

introduce relationship amongst features…

We introduce feature based molecular networking and compound dendrograms that group mz-features into structurally similar groups on a tree.

# Add Dreams distance
mmo <- AddChemDist(mmo, dreams_dir = demo_dreams)

mmo$dreams.dissim[1:10,1:10]

Image of molecular network

Generate Dendrogram

build a dendrogram and plot

#write it out into itol…

unique(mmo$metadata$group)
mmo_sample <- filter_mmo(mmo,group_list = "Inga alba")

mmo_tree <- FeatureDendrogram(mmo_sample)
plot(mmo_tree$phylo,type = "fan")

length(mmo_tree$phylo$tip.label)

We illustrate how a feature abundance matrix and a chemical dendrogram/similarity matrix can be used to calculate:

#GetAlpha w/ weighted mode and hillnumber

Structurally weighted Chemical Diversity c) Structurally weighted Chemical Similarity between samples b) richness by pathway NPCstacked barplot - potentially add in pathway stacked

Note you can change the group column at any time using: SwitchGroup()

names(mmo$metadata)

mmo_genus <- SwitchGroup(mmo, new_group_col = "Genus")
mmo_genus

For the final tutorial I will create data that only includes this subset but for now we can just filter it…

samples_keep <- mmo$metadata$sample[mmo$metadata$Keep != "No"]
mmo <- filter_mmo(mmo,sample_list  = samples_keep,drop_empty_feat = TRUE)

Step 2: Estimate Chemical Richness in each sample and average across species.

we want to compare across species. *Note this assumes that all features re indepenent of one another. richness sirius annotation stacked barplot cumulative richness

Step 3: Visualize differences between species using a PCA plot and using Hierarchical Clustering

PCA and Bray-Curtis

Bray-Curtis

###?? why does it use the dreams distance? Isn’t Bray not using relationship among features ### Need to check what this is actually doing…

DLF needs to add a function for generating a hclust plot

beta_diversity_dreams_bray <- GetBetaDiversity(mmo, method = 'bray', 
 normalization = 'Log', distance = 'dreams', filter_feature = FALSE)

# Convert to 'dist' object
d <- as.dist(beta_diversity_dreams_bray)

library(ape)
library(colorspace)

# Run hierarchical clustering
hc <- hclust(d, method = "ward.D2")   # or "ward.D2", "complete"
plot(hc)

Pvclust allows you to assess significance between groups.

We illustrate how different species of trees share a small fraction of there mz-rt features and show how this can be usefull for identifying chemotypes.

Accounting for spectral, structural and biosynthetic similarity among features

Group Features by class NPCstacked plot

richness by pathway NPCstacked barplot - potentially add in pathway stacked

Use Ion Identity Molecular Networking to group coeluting features together.
Account for Structural Similarity between features using feature based molecular networks

We introduce feature based molecular networking and compound dendrograms that group mz-features into structurally similar groups on a tree.

show a feature based molecular network…

Show different trees. actual based on the standards chemodev cosine, dreams,

#dendrogram of

We illustrate how a feature abundance matrix and a chemical dendrogram/similarity matrix can be used to calculate:

Structurally weighted Chemical Diversity
Structurally weighted Chemical Similarity between samples

mmo$feature_info %>% arrange(feature_group)

We can add the feature x feature pairwise distance to the mmo object based on the output of MzMine. This can accept modified cosine, dreaMS and MS2Deepscore

# Add Dreams distance
mmo <- AddChemDist(mmo, dreams_dir = demo_dreams)

mmo$dreams.dissim[1:10,1:10]

Sample Metadata: In this tutorial we are comparing 5 species of hickory along side a sample of Eastern black walnut.

mmo $metadata$ Species

There are also two types of samples “pools” and "samples. Pools represent species pools where each individual sample for each spcies was combined and run for ms/ms data analysis. Samples represent MS1 data for anindividual plant.

mmo$metadata

Note we can subset an mmo object using filter mmo to filte based on a list of samples, a list of groups or a list of features

In this tutorial we will subset to only individual samples removing the pooled samples

Build a dendrogram three options…

We are going to randomly pick 100 feature groups in order to illustrate how to generate how to make a dendrogram

# samples_single <- mmo$metadata$sample[mmo$metadata$speciesCode == "ingalb"]
# mmo_single <- filter_mmo(mmo,sample_list  = samples_single,drop_empty_feat = TRUE)
# 
# View(mmo_single$feature_info %>% arrange(feature_group,`ion_identities:iin_id`))
# 
table(mmo_single$feature_info$feature_group)
table(table(mmo_single$feature_info$feature_group))

samples_single <- mmo$metadata$sample[mmo$metadata$speciesCode %in% c("stanard1","stanard2")]
mmo_single <- filter_mmo(mmo,sample_list  = samples_single,drop_empty_feat = TRUE)
 

table(mmo_single$feature_info$feature_group)


feature_info_abundance <- mmo_single$feature_info %>%
  mutate(id = as.character(id)) %>%
  left_join(
    mmo_single$feature_data[, c(1, 3:4)] %>%
      mutate(id = as.character(id)) %>%
      group_by(id) %>%
      summarise(max = max(I1_C1_1, I1_C2_1), .groups = "drop"),
    by = "id"
  )

abundant_standard_feature <- feature_info_abundance$id[feature_info_abundance$max > 1e6]

mmo_single <- filter_mmo(mmo,feature_list = abundant_standard_feature,drop_empty_feat = TRUE)

mmo_single
table(mmo_single$feature_info$feature_group)

mmo_single$feature_info %>% filter(feature_group== "253")

mmo_single$feature_info %>% filter(rt > 6.0 & rt < 6.2)
mmo_single$feature_info %>% filter(mz > 139.02 & mz < 139.04)

mmo_single$feature_data %>% filter(id %in% as.character(mmo_single$feature_info %>% filter(feature_group== "255") %>% pull(id))) %>% select(id,feature,I1_C1_1,I1_C2_1) %>% arrange(-I1_C2_1)



mmo_single_dendrogram_correlation <- FeatureDendrogram(mmo=mmo_single,distance = "dreams",
  method = "average",
  ion_identity = "correlation",
  corr_col = "feature_group",
  iin_col  = "ion_identities:iin_id",
  within_group_dist = 0,
  save_output = FALSE,
  outprefix = "feature_tree"
)
plot(mmo_single_dendrogram_correlation$phylo, type = "fan", cex = 0.4, label.offset = 0.002,main="Dreams_Distance")

all_feature_groups <- unique(mmo_individual $feature_info$ feature_group[!is.na(mmo_individual $feature_info$ feature_group)])

sampled_feature_groups <- sample(all_feature_groups,50)

sampled_features <- mmo_individual $feature_info$ id[mmo_individual $feature_info$ feature_group %in% sampled_feature_groups]

mmo_individual_sampled_features <- filter_mmo(mmo_individual,feature_list = sampled_features,drop_empty_feat = TRUE)

mmo_single_dendrogram <- FeatureDendrogram(mmo=mmo_individual_sampled_features,distance = “dreams”, method = “average”, ion_identity = “none”, corr_col = “feature_group”, iin_col = “ion_identities:iin_id”, within_group_dist = 0, save_output = FALSE, outprefix = “feature_tree” )

plot(mmo_single_dendrogram$phylo, type = “fan”, cex = 0.4, label.offset = 0.002,main=“Dreams_Distance”)

table(mmo_individual_sampled_features$feature_info$feature_group)
table(mmo_individual_sampled_features$feature_info$`ion_identities:iin_id`)

#highlight_groups <- c("249", "41", "16", "195", "604")

highlight_groups <- c("12", "19", "24", "50", "74")

highlight_groups <- as.character(highlight_groups)

phy <- mmo_single_dendrogram$phylo
fi  <- mmo_individual_sampled_features$feature_info

fi$id <- as.character(fi$id)
tips  <- as.character(phy$tip.label)

idx <- match(tips, fi$id)
if (anyNA(idx)) {
  stop("Some phylo tip labels not found in feature_info$id")
}

#fg <- as.character(fi$feature_group[idx])
fg <- as.character(fi$`ion_identities:iin_id`[idx])
fg[is.na(fg) | fg == "" | fg == "NA"] <- "other"

# base color for non-highlighted tips
tip_cols <- rep("grey80", length(fg))

# colors ONLY for highlighted groups
hl_cols <- setNames(
  grDevices::rainbow(length(highlight_groups)),
  highlight_groups
)

# apply highlight colors
sel <- fg %in% highlight_groups
tip_cols[sel] <- hl_cols[fg[sel]]


plot(
  phy,
  type = "fan",
  tip.color = tip_cols,
  cex = 0.4,
  label.offset = 0.002,
  main = "Dreams distance (selected feature groups highlighted)"
)

mmo_single_dendrogram_correlation <- FeatureDendrogram(mmo=mmo_individual_sampled_features,distance = "dreams",
  method = "average",
  ion_identity = "correlation",
  corr_col = "feature_group",
  iin_col  = "ion_identities:iin_id",
  within_group_dist = 0,
  save_output = FALSE,
  outprefix = "feature_tree"
)

plot(mmo_single_dendrogram_correlation$phylo, type = "fan", cex = 0.4, label.offset = 0.002,main="Dreams_Distance + Force by Correlation Group ID")


phy <- mmo_single_dendrogram_correlation$phylo
fi  <- mmo_individual_sampled_features$feature_info

fi$id <- as.character(fi$id)
tips  <- as.character(phy$tip.label)

idx <- match(tips, fi$id)
if (anyNA(idx)) {
  stop("Some phylo tip labels not found in feature_info$id")
}

fg <- as.character(fi$feature_group[idx])
fg[is.na(fg) | fg == "" | fg == "NA"] <- "other"

# base color for non-highlighted tips
tip_cols <- rep("grey80", length(fg))

# colors ONLY for highlighted groups
hl_cols <- setNames(
  grDevices::rainbow(length(highlight_groups)),
  highlight_groups
)

# apply highlight colors
sel <- fg %in% highlight_groups
tip_cols[sel] <- hl_cols[fg[sel]]


plot(
  phy,
  type = "fan",
  tip.color = tip_cols,
  cex = 0.4,
  label.offset = 0.002,
  main = "Dreams_Distance + Force by Correlation Group ID"
)

mmo_single_dendrogram_ion_identity <- FeatureDendrogram(mmo=mmo_individual_sampled_features,distance = "dreams",
  method = "average",
  ion_identity = "ion_identity_network",
  corr_col = "feature_group",
  iin_col  = "ion_identities:iin_id",
  within_group_dist = 0,
  save_output = FALSE,
  outprefix = "feature_tree"
)

plot(mmo_single_dendrogram_ion_identity$phylo, type = "fan", cex = 0.4, label.offset = 0.002,main="Dreams_Distance + Force by Ion Identity Group ID")

phy <- mmo_single_dendrogram_ion_identity$phylo
fi  <- mmo_individual_sampled_features$feature_info

fi$id <- as.character(fi$id)
tips  <- as.character(phy$tip.label)

idx <- match(tips, fi$id)
if (anyNA(idx)) {
  stop("Some phylo tip labels not found in feature_info$id")
}

#fg <- as.character(fi$feature_group[idx])
fg <- as.character(fi$`ion_identities:iin_id`[idx])
fg[is.na(fg) | fg == "" | fg == "NA"] <- "other"

# base color for non-highlighted tips
tip_cols <- rep("grey80", length(fg))

# colors ONLY for highlighted groups
hl_cols <- setNames(
  grDevices::rainbow(length(highlight_groups)),
  highlight_groups
)

# apply highlight colors
sel <- fg %in% highlight_groups
tip_cols[sel] <- hl_cols[fg[sel]]


plot(
  phy,
  type = "fan",
  tip.color = tip_cols,
  cex = 0.4,
  label.offset = 0.002,
  main = "Dreams_Distance + Force by Ion Identity Group ID"
)

#samples_keep <- mmo$metadata$sample[mmo$metadata$Type != "pool"]

#samples_keep <- mmo$metadata$sample[mmo$metadata$Keep_Demo == "Yes"]
samples_keep <- mmo$metadata$sample[mmo$metadata$Keep != "No"]
mmo_individual <- filter_mmo(mmo,sample_list  = samples_keep,drop_empty_feat = TRUE)

mmo_individual$metadata


mmo_individual <- FeaturePresence(mmo_individual,threshold = 1)

library(ggvenn)

fp <- mmo_individual$feature_presence
md <- mmo_individual$metadata

# 1) Identify which columns in feature_presence are sample columns
#    (everything except the feature ID column)
sample_cols <- setdiff(colnames(fp), "feature")

# 2) Keep metadata rows that correspond to samples in the presence table
group_df <- mmo_individual$metadata[, c("sample", "group")]
groups <- unique(group_df$group)


# 3) For each group, collect features present in ANY sample in that group

VennInput <- list()

for (g in groups) {

  # samples that belong to this group
  group_samples <- group_df$sample[group_df$group == g]

  # subset the presence matrix to just those samples
  sub <- mmo_individual$feature_presence[, c("feature", group_samples), drop = FALSE]

  # feature is present in the group if it is >0 in at least one sample
  present_in_group <- apply(sub[, -1, drop = FALSE], 1, function(x) any(x > 0))

  # store the feature IDs
  VennInput[[g]] <- sub$feature[present_in_group]
}

# Quick check: number of features per group
sapply(VennInput, length)

# 4) Plot Venn diagram
p <- ggvenn(
  VennInput,
  stroke_size = 0.5,
  set_name_size = 4,
  show_percentage = FALSE
) +
  theme(legend.position = "none")

p

# 5) Save
if (!dir.exists("plot")) dir.create("plot", recursive = TRUE)
ggsave("plot/Venn_FeaturePresence.pdf", plot = p, height = 5, width = 5)



require(ggvenn)
library(ggvenn)
VennInput <- list(
  
)

ggvenn(VennInput, stroke_size = 0.5, set_name_size = 4, show_percentage = FALSE) +
  theme(legend.position = "none") 
ggsave("plot/Venn_Upreg.pdf", height = 5, width = 5)

2) Visualize distance between samples with PLSDA

library(colorspace)

make_pal <- function(n) qualitative_hcl(n, palette = "Dark 3")

n_groups <- length(unique(mmo_individual$metadata$Species_binomial))  # change column
colors <- make_pal(n_groups)

#PLSDAplot(mmo_individual, color = colors, outdir = "~/Downloads/PLSDA.pdf")

beta_diversity_dreams_bray <- GetBetaDiversity(mmo_individual, method = 'bray', 
 normalization = 'Log', distance = 'dreams', filter_feature = FALSE)

beta_diversity_dreams_bray

# Convert to 'dist' object
d <- as.dist(beta_diversity_dreams_bray)

# Run hierarchical clustering
hc <- hclust(d, method = "ward.D2")   # or "ward.D2", "complete"
plot(hc)


library(ape)
library(colorspace)


# convert to phylo
tr <- as.phylo(hc)

# ---- map species to tree tip labels (EDIT these column names) ----
md <- mmo_individual$metadata
# md must contain: sample ID that matches tr$tip.label, and species

tip_ids <- tr$tip.label

species_vec <- md$Species_binomial[match(tip_ids, md$sample)]
if (anyNA(species_vec)) {
  bad <- tip_ids[is.na(species_vec)]
  stop("Species missing for some tree tips. Example missing IDs:\n",
       paste(head(bad, 20), collapse = "\n"))
}
species_vec <- as.character(species_vec)

# palette
sp_levels <- sort(unique(species_vec))
sp_cols <- setNames(qualitative_hcl(length(sp_levels), "Dark 3"), sp_levels)

tip_cols <- unname(sp_cols[species_vec])

# ---- plot ----
par(mar = c(2, 2, 2, 10))  # extra right margin for long labels
plot(tr, type = "phylogram", cex = 0.6, tip.color = tip_cols,
     main = "Hierarchical clustering (DREAMS–Bray)")

legend("topleft", legend = sp_levels, col = sp_cols, pch = 15,
       cex = 0.6, bty = "n")

mmo_individual_dendrogram_correlation <- FeatureDendrogram(mmo=mmo_individual,distance = "dreams",
  method = "average",
  ion_identity = "correlation",
  corr_col = "feature_group",
  iin_col  = "ion_identities:iin_id",
  within_group_dist = 0,
  save_output = FALSE,
  outprefix = "feature_tree"
)

beta_diversity_dreams_Gen_Uni <- GetBetaDiversity(mmo_individual, method = 'Gen.Uni', 
 normalization = 'Log', distance = 'dreams', filter_feature = FALSE)

dim(beta_diversity_dreams_Gen_Uni)

# Convert to 'dist' object
d_Gen_Uni <- as.dist(beta_diversity_dreams_Gen_Uni)

# Run hierarchical clustering
hc_Gen_Uni <- hclust(d_Gen_Uni, method = "average")   # or "ward.D2", "complete"
plot(hc_Gen_Uni, main = "Hierarchical clustering Gen_Uni")

3) Calculate Beta Diversity between samples

#First we calculate dreams
beta_diversity_dreams <- GetBetaDiversity(mmo_individual, method = 'CSCS', 
 normalization = 'None', distance = 'dreams', filter_feature = FALSE)