Interactive microbiome analysis with Jupyter notebooks

April 25, 2019 Python , bioinformatics , notebooks , code

Today we’re announcing a big update to our notebooks platform and our onecodex Python client library. It’s now easier than ever to quickly generate interactive, informative plots from your microbiome data on One Codex. In this post we’ll walk through how to answer some of the most common questions you’ll encounter in your microbiome work.

To get started, we’ll look at data from the DIABIMMUNE cohort, which was used to study the “hygiene hypothesis.” This hypothesis proposes that susceptibility to auto-immune disorders is linked to diverse antigen exposure in youth.

Vatanen T, Kostic A, d’Hennezel E, et al.
Variation in Microbiome LPS Immunogenicity Contributes to Autoimmunity in Humans.
Cell. Online April 28, 2016. DOI: 10.1016/j.cell.2016.04.007

In this study, the gut microbiome of infants from three countries was sampled over the course of three years. This enabled the authors to investigate whether the abundances of particular microbial taxa were correlated with any of their rich study metadata — including a number of markers for allergy and auto-immune disease.

To set up this analysis, we simply imported 500 samples from the DIABIMMUNE project into One Codex. Everything that follows directly uses the One Codex API and library to interact with this data. Feel free to follow along as we go — you can either create a new notebook or copy the example notebook to get started!

from onecodex import Api
ocx = Api()

project = ocx.Projects.get("d53ad03b010542e3")  # get DIABIMMUNE project by ID
samples = ocx.Samples.where(project=project.id, public=True, limit=500)

samples.metadata[[
    "gender",
    "host_age",
    "geo_loc_name",
    "totalige",
    "eggs",
    "vegetables",
    "milk",
    "wheat",
    "rice",
]].head(20)

To save space, we’re showing a few columns of study metadata for the first 20 samples. In a live notebook, you can slice and dice this data however you’d like (and update metadata for your own samples). Storing rich metadata alongside your samples makes quick, exploratory analyses much faster.

	gender	host_age	geo_loc_name	totalige	eggs	vegetables	milk	wheat	rice
classification_id
6579e99943f84ad2	Male	1093	Finland:Espoo	62.90	True	True	False	True	True
b50c176668234fe7	Female	686	Russia:Petrozavodsk	91.50	True	True	True	False	True
e0422602de41479f	Female	673	Estonia:Tartu	112.00	True	True	True	True	True
23f2a0a6723f4d10	Female	173	Russia:Petrozavodsk	NaN	False	False	False	False	False
d353e3d4f5304e29	Male	493	Russia:Petrozavodsk	42.30	False	True	False	False	False
d75858183ff54c5d	Male	229	Estonia:Tartu	36.50	False	False	False	False	False
4180ce34885f4273	Male	502	Estonia:Tartu	7.39	True	True	True	True	True
c4937be840af4c01	Male	390	Estonia:Tartu	698.00	True	True	True	True	True
95717cd0285444f3	Male	427	Estonia:Tartu	88.10	True	True	True	True	True
d60071c6a3cc4e6b	Female	587	Estonia:Tartu	25.20	False	True	True	True	True
ee5bf7fa639a4103	Female	598	Estonia:Tartu	131.00	True	True	True	True	True
31e2157fa89b4266	Female	594	Estonia:Tartu	30.80	True	True	True	True	True
25cd500155f34a3e	Male	410	Estonia:Tartu	7.39	True	True	True	True	True
2441a5c9ff864022	Male	500	Estonia:Tartu	86.60	True	True	True	True	True
dbdc0630ed9c45a8	Female	406	Estonia:Tartu	13.70	True	True	True	True	True
0d3b62b9525e4e64	Female	278	Russia:Petrozavodsk	12.10	True	True	True	False	True
2d9a5b5a037a42c1	Male	483	Estonia:Tartu	698.00	True	True	True	True	True
9633c41650824108	Female	500	Estonia:Tartu	23.90	True	True	True	False	False
e0ee7b3625f54325	Male	675	Estonia:Tartu	25.00	True	True	True	True	True
3da0cb6de24846cd	Male	479	Estonia:Tartu	88.10	True	True	True	True	True

Question #1: How does alpha diversity vary by sample group?

Diversity metrics are a common “first look” at microbiome datasets. With our Jupyter notebooks, you can plot the diversity of samples grouped by their metadata with a single line of code. Here, we combine several of these plots to display Chao1, Simpson’s Index, and Shannon Entropy grouped by the region of birth. Each group includes samples taken across the entire three-year longitudinal study.

chao1 = samples.plot_metadata(vaxis="chao1", haxis="geo_loc_name", return_chart=True)
simpson = samples.plot_metadata(vaxis="simpson", haxis="geo_loc_name", return_chart=True)
shannon = samples.plot_metadata(vaxis="shannon", haxis="geo_loc_name", return_chart=True)

chao1 | simpson | shannon

As with all of these plotting routines, you can view the documentation and read about all different plotting options within your notebook, e.g., by entering samples.plot_metadata?. You can also return the altair chart objects directly and modify them or create your own custom plots. And if you have ideas for a new useful plot, issues and PRs are welcome!

Question #2: How does the microbiome change over time?

You might also want to see how the abundance of a particular organism or group of organisms (in this case the genus Bacteroides) varies over time across your cohort. We can use the convenient plot_metadata method again to search through all taxa and plot the number of reads or relative abundances.

samples.plot_metadata(haxis="host_age", vaxis="Bacteroides", plot_type="scatter")

Note: This chart is fully interactive! You can pan around and zoom in on individual samples. Clicking a point will open the underlying analysis in One Codex.

Question #3: How does an individual subject’s gut change over time?

Another common question is to investigate how a given individual’s microbiome changed over time. We can plot this with only a few lines of code. It’s nice to see the expected high abundance of Bifidobacterium early in life, giving way to Bacteroides near age three!

# generate a dataframe containing relative abundances (default metric for WGS datasets)
df_rel = samples.to_df(rank="genus")

# fetch all samples for subject P014839
subject_metadata = samples.metadata.loc[samples.metadata["host_subject_id"] == "P014839"]
subject_df = df_rel.loc[subject_metadata.index]

# order by subject age
subject_df = subject_df.loc[subject_metadata["host_age"].sort_values().index]

# you can access our library using the ocx accessor on pandas dataframes!
subject_df.ocx.plot_bargraph(
    rank="genus",
    label=lambda metadata: str(metadata["host_age"]),
    title="Subject P014839 Over Time",
    xlabel="Host Age at Sampling Time (days)",
    ylabel="Normalized Read Count",
    legend="Genus",
)

Technical note: Under the hood, we use Pandas dataframes, enabling arbitrary, complex manipulations with code you’re already familiar with. Below, we drop into a dataframe, slice it to fetch all the data points from a single subject of the study, and generate a stacked bar plot. Even after this custom manipulation, we still have access to all of the handy One Codex library plot_* methods.

Question #4: Heatmaps?!

Next up is a simple heatmap plot, allowing you to see how key organisms vary across samples. To save space, we’re only displaying the first 30 samples in this dataset and the 10 most abundant genera. Be sure to hover over each of the boxes — see how the geographical origin is displayed in a tooltip? You can display any of your metadata in a tooltip in pretty much all of our plots!

df_rel[:30].ocx.plot_heatmap(legend="Normalized Read Count", tooltip="geo_loc_name")

Question #5: How do samples cluster?

The next few plots demonstrate the different ways of calculating distances between samples and finding clusters that are available in our library. First up, we’ll plot a heatmap of weighted UniFrac distance between the first 30 samples in the dataset. This requires absolute read counts, which we’ll generate in a new dataframe.

# generate a dataframe containing absolute read counts
df_abs = samples.to_df()
df_abs[:30].ocx.plot_distance(metric="weighted_unifrac")

Question #6: Can I do PCA?

In the original paper, Figure 2A shows two PCA plots: one colored by geographical location and another by host age at the time of sampling. In our library, generating these plots is easy! You can also size the points in the plot by our own metadata, or the relative abundance of a particular taxon. For example, in the below plot, large points correspond to high abundances of Bifidobacterium in samples taken just after birth. The older samples approaching age three, as expected, have low abundances of this taxon.

Use your mouse to zoom in on and scroll around this plot — it’s fully interactive!

samples.plot_pca(color="host_age", size="Bifidobacterium")

Question #7: Can I do something better than PCA?

We also support multidimensional scaling using several distance metrics and either an iterative optimization (smacof) or eigenvector decomposition (PCoA) algorithm. Like PCA, points in these plots can be colored and scaled according to data of your choice.

Clicking on a point in this plot will take you to the analysis page on our website, where you can learn about the sample, perform analyses on it, and more!

samples.plot_mds(
    metric="weighted_unifrac", method="pcoa", color="geo_loc_name"
)

In our next blog post, we'll show you how easy it is to reproduce a publication using our platform. Thanks for reading and [drop us a note](mailto:hello@onecodex.com) if you have any questions!