Interactive microbiome analysis with Jupyter notebooks

Today we’re announcing a big update to our notebooks platform and our onecodex Python client library. It’s now easier than ever to quickly generate interactive, informative plots from your microbiome data on One Codex. In this post we’ll walk through how to answer some of the most common questions you’ll encounter in your microbiome work.

To get started, we’ll look at data from the DIABIMMUNE cohort, which was used to study the “hygiene hypothesis.” This hypothesis proposes that susceptibility to auto-immune disorders is linked to diverse antigen exposure in youth.

Vatanen T, Kostic A, d’Hennezel E, et al.
Variation in Microbiome LPS Immunogenicity Contributes to Autoimmunity in Humans.
Cell. Online April 28, 2016. DOI: 10.1016/j.cell.2016.04.007

In this study, the gut microbiome of infants from three countries was sampled over the course of three years. This enabled the authors to investigate whether the abundances of particular microbial taxa were correlated with any of their rich study metadata — including a number of markers for allergy and auto-immune disease.

To set up this analysis, we simply imported 500 samples from the DIABIMMUNE project into One Codex. Everything that follows directly uses the One Codex API and library to interact with this data. Feel free to follow along as we go — you can either create a new notebook or copy the example notebook to get started!

from onecodex import Api
ocx = Api()

project = ocx.Projects.get("d53ad03b010542e3")  # get DIABIMMUNE project by ID
samples = ocx.Samples.where(, public=True, limit=500)


To save space, we’re showing a few columns of study metadata for the first 20 samples. In a live notebook, you can slice and dice this data however you’d like (and update metadata for your own samples). Storing rich metadata alongside your samples makes quick, exploratory analyses much faster.

gender host_age geo_loc_name totalige eggs vegetables milk wheat rice
6579e99943f84ad2 Male 1093 Finland:Espoo 62.90 True True False True True
b50c176668234fe7 Female 686 Russia:Petrozavodsk 91.50 True True True False True
e0422602de41479f Female 673 Estonia:Tartu 112.00 True True True True True
23f2a0a6723f4d10 Female 173 Russia:Petrozavodsk NaN False False False False False
d353e3d4f5304e29 Male 493 Russia:Petrozavodsk 42.30 False True False False False
d75858183ff54c5d Male 229 Estonia:Tartu 36.50 False False False False False
4180ce34885f4273 Male 502 Estonia:Tartu 7.39 True True True True True
c4937be840af4c01 Male 390 Estonia:Tartu 698.00 True True True True True
95717cd0285444f3 Male 427 Estonia:Tartu 88.10 True True True True True
d60071c6a3cc4e6b Female 587 Estonia:Tartu 25.20 False True True True True
ee5bf7fa639a4103 Female 598 Estonia:Tartu 131.00 True True True True True
31e2157fa89b4266 Female 594 Estonia:Tartu 30.80 True True True True True
25cd500155f34a3e Male 410 Estonia:Tartu 7.39 True True True True True
2441a5c9ff864022 Male 500 Estonia:Tartu 86.60 True True True True True
dbdc0630ed9c45a8 Female 406 Estonia:Tartu 13.70 True True True True True
0d3b62b9525e4e64 Female 278 Russia:Petrozavodsk 12.10 True True True False True
2d9a5b5a037a42c1 Male 483 Estonia:Tartu 698.00 True True True True True
9633c41650824108 Female 500 Estonia:Tartu 23.90 True True True False False
e0ee7b3625f54325 Male 675 Estonia:Tartu 25.00 True True True True True
3da0cb6de24846cd Male 479 Estonia:Tartu 88.10 True True True True True

Question #1: How does alpha diversity vary by sample group?

Diversity metrics are a common “first look” at microbiome datasets. With our Jupyter notebooks, you can plot the diversity of samples grouped by their metadata with a single line of code. Here, we combine several of these plots to display Chao1, Simpson’s Index, and Shannon Entropy grouped by the region of birth. Each group includes samples taken across the entire three-year longitudinal study.

chao1 = samples.plot_metadata(vaxis="chao1", haxis="geo_loc_name", return_chart=True)
simpson = samples.plot_metadata(vaxis="simpson", haxis="geo_loc_name", return_chart=True)
shannon = samples.plot_metadata(vaxis="shannon", haxis="geo_loc_name", return_chart=True)

chao1 | simpson | shannon

As with all of these plotting routines, you can view the documentation and read about all different plotting options within your notebook, e.g., by entering samples.plot_metadata?. You can also return the altair chart objects directly and modify them or create your own custom plots. And if you have ideas for a new useful plot, issues and PRs are welcome!

Question #2: How does the microbiome change over time?

You might also want to see how the abundance of a particular organism or group of organisms (in this case the genus Bacteroides) varies over time across your cohort. We can use the convenient plot_metadata method again to search through all taxa and plot the number of reads or relative abundances.

samples.plot_metadata(haxis="host_age", vaxis="Bacteroides", plot_type="scatter")

Note: This chart is fully interactive! You can pan around and zoom in on individual samples. Clicking a point will open the underlying analysis in One Codex.

Question #3: How does an individual subject’s gut change over time?

Another common question is to investigate how a given individual’s microbiome changed over time. We can plot this with only a few lines of code. It’s nice to see the expected high abundance of Bifidobacterium early in life, giving way to Bacteroides near age three!

# generate a dataframe containing relative abundances (default metric for WGS datasets)
df_rel = samples.to_df(rank="genus")

# fetch all samples for subject P014839
subject_metadata = samples.metadata.loc[samples.metadata["host_subject_id"] == "P014839"]
subject_df = df_rel.loc[subject_metadata.index]

# order by subject age
subject_df = subject_df.loc[subject_metadata["host_age"].sort_values().index]

# you can access our library using the ocx accessor on pandas dataframes!
    label=lambda metadata: str(metadata["host_age"]),
    title="Subject P014839 Over Time",
    xlabel="Host Age at Sampling Time (days)",
    ylabel="Normalized Read Count",

Technical note: Under the hood, we use Pandas dataframes, enabling arbitrary, complex manipulations with code you’re already familiar with. Below, we drop into a dataframe, slice it to fetch all the data points from a single subject of the study, and generate a stacked bar plot. Even after this custom manipulation, we still have access to all of the handy One Codex library plot_* methods.

Question #4: Heatmaps?!

Next up is a simple heatmap plot, allowing you to see how key organisms vary across samples. To save space, we’re only displaying the first 30 samples in this dataset and the 10 most abundant genera. Be sure to hover over each of the boxes — see how the geographical origin is displayed in a tooltip? You can display any of your metadata in a tooltip in pretty much all of our plots!

df_rel[:30].ocx.plot_heatmap(legend="Normalized Read Count", tooltip="geo_loc_name")

Question #5: How do samples cluster?

The next few plots demonstrate the different ways of calculating distances between samples and finding clusters that are available in our library. First up, we’ll plot a heatmap of weighted UniFrac distance between the first 30 samples in the dataset. This requires absolute read counts, which we’ll generate in a new dataframe.

# generate a dataframe containing absolute read counts
df_abs = samples.to_df()

Question #6: Can I do PCA?

In the original paper, Figure 2A shows two PCA plots: one colored by geographical location and another by host age at the time of sampling. In our library, generating these plots is easy! You can also size the points in the plot by our own metadata, or the relative abundance of a particular taxon. For example, in the below plot, large points correspond to high abundances of Bifidobacterium in samples taken just after birth. The older samples approaching age three, as expected, have low abundances of this taxon.

Use your mouse to zoom in on and scroll around this plot — it’s fully interactive!

samples.plot_pca(color="host_age", size="Bifidobacterium")

Question #7: Can I do something better than PCA?

We also support multidimensional scaling using several distance metrics and either an iterative optimization (smacof) or eigenvector decomposition (PCoA) algorithm. Like PCA, points in these plots can be colored and scaled according to data of your choice.

Clicking on a point in this plot will take you to the analysis page on our website, where you can learn about the sample, perform analyses on it, and more!

    metric="weighted_unifrac", method="pcoa", color="geo_loc_name"

In our next blog post, we'll show you how easy it is to reproduce a publication using our platform. Thanks for reading and [drop us a note]( if you have any questions!
← Back to the One Codex blog SMS-seq  🧬 → ☎️ → 🦠 →