Distances for time series clustering¶

April 7, 2015

Mick Watson wrote a great blog post introducing heatmaps produced by hierarchical clustering and the distance functions they critically depend on. The toy example happened to be time series gene expression data, and I wanted to make a subtle but important point about the choice of distance function for time series data.

Another toy example¶

Consider time series expression data for the three genes below. Each gene is expressed at two time points. I calculated both Pearson and Spearman correlation for the gene pairs, both of which are frequently used to derive distances for clustering.

from numpy import random as rd
from scipy import stats as st
%pylab inline
rd.seed(472015)

rc('axes', titlesize='x-large', labelsize='x-large')
rc('legend', fontsize='x-large')

# Each gene will be expressed at two time points.  Expression and time in arbitrary units.
expression = zeros((3,24))
# Gene A
expression[0,2] = 1
expression[0,3] = 1
# Gene B
expression[1,19] = 1
expression[1,20] = 1
# Gene C
expression[2,21] = 1
expression[2,22] = 1
# Amplitude of the signal
expression *= 20
# Add noise
noise = rd.randn(3,24)
# Non-negative expression levels make for a better visualization
expression += abs(amin(noise))
expression += noise

plot(expression.T, lw='3')
legend(['A', 'B', 'C'], loc='center left', bbox_to_anchor=(1, 0.5))
title('Time series toy example')
xlabel('Time')
ylabel('Expression')

# Compute gene-gene correlation
print('\nOriginal time series gene expression correlations')
print('Gene pair\tPearson\t\tSpearman')
print('A-B\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[0,:], expression[1,:])[0], st.spearmanr(expression[0,:], expression[1,:])[0])
print('A-C\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[0,:], expression[2,:])[0], st.spearmanr(expression[0,:], expression[2,:])[0])
print('B-C\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[1,:], expression[2,:])[0], st.spearmanr(expression[1,:], expression[2,:])[0])

Populating the interactive namespace from numpy and matplotlib

Original time series gene expression correlations
Gene pair	Pearson		Spearman
A-B		-0.03		0.02
A-C		-0.11		-0.25
B-C		-0.15		-0.32

Correlation observations¶

None of the gene pairs are well-correlated because the times at which they are expressed are disjoint. Intuitively, one might wish to treat B and C as similar temporal profiles because both are expressed "late" whereas A is expressed "early". Correlation doesn't capture that visual trend.

Permuting time points¶

In fact, these correlation measures don't care about the temporal structure in the data at all. They treat the time series as a bag of samples. To illustrate, I randomly reordered the time points and recompute the correlation. The visual trends are destroyed, but the correlation remains the same.

# Generate a random ordering of the time points and rearrange the columns
order = rd.permutation(24)
expression = expression[:,order]

# Plot the time series in the permuted order
figure()
plot(expression.T, lw='3')
legend(['A', 'B', 'C'], loc='center left', bbox_to_anchor=(1, 0.5))
title('Toy example - permuted time points')
xlabel('Permuted time')
ylabel('Expression')

# Compute gene-gene correlation on the permuted matrix
print('Permuted time series gene expression correlations')
print('Gene pair\tPearson\t\tSpearman')
print('A-B\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[0,:], expression[1,:])[0], st.spearmanr(expression[0,:], expression[1,:])[0])
print('A-C\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[0,:], expression[2,:])[0], st.spearmanr(expression[0,:], expression[2,:])[0])
print('B-C\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[1,:], expression[2,:])[0], st.spearmanr(expression[1,:], expression[2,:])[0])

Permuted time series gene expression correlations
Gene pair	Pearson		Spearman
A-B		-0.03		0.02
A-C		-0.11		-0.25
B-C		-0.15		-0.32

Time series clustering¶

So what if a user does want to cluster based on temporal trends instead of the bag of samples correlation? A few existing options include:

Temporally-aware distance functions There are time series-specific distance functions that consider alignment of the temporal expression profiles, transformations of the time series, or other temporal features that can be extracted for each gene.
Temporal smoothness or interpolation Some clustering algorithms interpolate unobserved time points by fitting splines or other curves before clustering. Others use Gaussian processes as a prior that expression levels should be smooth over time.
Template or shape-based clusters These methods cluster time series data by matching them to reference temporal profiles, for instance, identifying genes that follow the expression pattern "up-up-unchanged-down-unchanged".
Probabilistic graphical models Hidden-markov models and dynamic Bayesian networks have also been used in clustering.

Two reviews I wrote with co-authors provide a few specific references to methods implementing these strategies. If there is sufficient interest, I would also consider expanding this discussion to include more references.

Contact information is at my website if you have comments, want additional information, or can't access the reviews.