Interactive microbiome analysis with Jupyter notebooks
Today we’re announcing a big update to our notebooks platform and our
onecodex Python client library. It’s now easier than ever to quickly generate interactive, informative plots from your microbiome data on One Codex. In this post we’ll walk through how to answer some of the most common questions you’ll encounter in your microbiome work.
To get started, we’ll look at data from the DIABIMMUNE cohort, which was used to study the “hygiene hypothesis.” This hypothesis proposes that susceptibility to auto-immune disorders is linked to diverse antigen exposure in youth.
In this study, the gut microbiome of infants from three countries was sampled over the course of three years. This enabled the authors to investigate whether the abundances of particular microbial taxa were correlated with any of their rich study metadata — including a number of markers for allergy and auto-immune disease.
To set up this analysis, we simply imported 500 samples from the DIABIMMUNE project into One Codex. Everything that follows directly uses the One Codex API and library to interact with this data. Feel free to follow along as we go — you can either create a new notebook or copy the example notebook to get started!
from onecodex import Api ocx = Api() project = ocx.Projects.get("d53ad03b010542e3") # get DIABIMMUNE project by ID samples = ocx.Samples.where(project=project.id, public=True, limit=500) samples.metadata[[ "gender", "host_age", "geo_loc_name", "totalige", "eggs", "vegetables", "milk", "wheat", "rice", ]].head(20)
To save space, we’re showing a few columns of study metadata for the first 20 samples. In a live notebook, you can slice and dice this data however you’d like (and update metadata for your own samples). Storing rich metadata alongside your samples makes quick, exploratory analyses much faster.
Question #1: How does alpha diversity vary by sample group?
Diversity metrics are a common “first look” at microbiome datasets. With our Jupyter notebooks, you can plot the diversity of samples grouped by their metadata with a single line of code. Here, we combine several of these plots to display Chao1, Simpson’s Index, and Shannon Entropy grouped by the region of birth. Each group includes samples taken across the entire three-year longitudinal study.
chao1 = samples.plot_metadata(vaxis="chao1", haxis="geo_loc_name", return_chart=True) simpson = samples.plot_metadata(vaxis="simpson", haxis="geo_loc_name", return_chart=True) shannon = samples.plot_metadata(vaxis="shannon", haxis="geo_loc_name", return_chart=True) chao1 | simpson | shannon
As with all of these plotting routines, you can view the documentation and read about all different plotting options within your notebook, e.g., by entering
samples.plot_metadata?. You can also return the
altair chart objects directly and modify them or create your own custom plots. And if you have ideas for a new useful plot, issues and PRs are welcome!
Question #2: How does the microbiome change over time?
You might also want to see how the abundance of a particular organism or group of organisms (in this case the genus Bacteroides) varies over time across your cohort. We can use the convenient
plot_metadata method again to search through all taxa and plot the number of reads or relative abundances.
samples.plot_metadata(haxis="host_age", vaxis="Bacteroides", plot_type="scatter")
Note: This chart is fully interactive! You can pan around and zoom in on individual samples. Clicking a point will open the underlying analysis in One Codex.
Question #3: How does an individual subject’s gut change over time?
Another common question is to investigate how a given individual’s microbiome changed over time. We can plot this with only a few lines of code. It’s nice to see the expected high abundance of Bifidobacterium early in life, giving way to Bacteroides near age three!
# generate a dataframe containing relative abundances (default metric for WGS datasets) df_rel = samples.to_df(rank="genus") # fetch all samples for subject P014839 subject_metadata = samples.metadata.loc[samples.metadata["host_subject_id"] == "P014839"] subject_df = df_rel.loc[subject_metadata.index] # order by subject age subject_df = subject_df.loc[subject_metadata["host_age"].sort_values().index] # you can access our library using the ocx accessor on pandas dataframes! subject_df.ocx.plot_bargraph( rank="genus", label=lambda metadata: str(metadata["host_age"]), title="Subject P014839 Over Time", xlabel="Host Age at Sampling Time (days)", ylabel="Normalized Read Count", legend="Genus", )
Technical note: Under the hood, we use Pandas dataframes, enabling arbitrary, complex manipulations with code you’re already familiar with. Below, we drop into a dataframe, slice it to fetch all the data points from a single subject of the study, and generate a stacked bar plot. Even after this custom manipulation, we still have access to all of the handy One Codex library
Question #4: Heatmaps?!
Next up is a simple heatmap plot, allowing you to see how key organisms vary across samples. To save space, we’re only displaying the first 30 samples in this dataset and the 10 most abundant genera. Be sure to hover over each of the boxes — see how the geographical origin is displayed in a tooltip? You can display any of your metadata in a tooltip in pretty much all of our plots!
df_rel[:30].ocx.plot_heatmap(legend="Normalized Read Count", tooltip="geo_loc_name")
Question #5: How do samples cluster?
The next few plots demonstrate the different ways of calculating distances between samples and finding clusters that are available in our library. First up, we’ll plot a heatmap of weighted UniFrac distance between the first 30 samples in the dataset. This requires absolute read counts, which we’ll generate in a new dataframe.
# generate a dataframe containing absolute read counts df_abs = samples.to_df() df_abs[:30].ocx.plot_distance(metric="weighted_unifrac")
Question #6: Can I do PCA?
In the original paper, Figure 2A shows two PCA plots: one colored by geographical location and another by host age at the time of sampling. In our library, generating these plots is easy! You can also size the points in the plot by our own metadata, or the relative abundance of a particular taxon. For example, in the below plot, large points correspond to high abundances of Bifidobacterium in samples taken just after birth. The older samples approaching age three, as expected, have low abundances of this taxon.
Use your mouse to zoom in on and scroll around this plot — it’s fully interactive!
Question #7: Can I do something better than PCA?
We also support multidimensional scaling using several distance metrics and either an iterative
smacof) or eigenvector decomposition (
PCoA) algorithm. Like PCA, points in these
plots can be colored and scaled according to data of your choice.
Clicking on a point in this plot will take you to the analysis page on our website, where you can learn about the sample, perform analyses on it, and more!
samples.plot_mds( metric="weighted_unifrac", method="pcoa", color="geo_loc_name" )
In our next blog post, we'll show you how easy it is to reproduce a publication using our platform. Thanks for reading and [drop us a note](mailto:email@example.com) if you have any questions!