Mick Watson wrote a great blog post introducing heatmaps produced by hierarchical clustering and the distance functions they critically depend on. The toy example happened to be time series gene expression data, and I wanted to make a subtle but important point about the choice of distance function for time series data.
Consider time series expression data for the three genes below. Each gene is expressed at two time points. I calculated both Pearson and Spearman correlation for the gene pairs, both of which are frequently used to derive distances for clustering.
from numpy import random as rd
from scipy import stats as st
%pylab inline
rd.seed(472015)
rc('axes', titlesize='x-large', labelsize='x-large')
rc('legend', fontsize='x-large')
# Each gene will be expressed at two time points. Expression and time in arbitrary units.
expression = zeros((3,24))
# Gene A
expression[0,2] = 1
expression[0,3] = 1
# Gene B
expression[1,19] = 1
expression[1,20] = 1
# Gene C
expression[2,21] = 1
expression[2,22] = 1
# Amplitude of the signal
expression *= 20
# Add noise
noise = rd.randn(3,24)
# Non-negative expression levels make for a better visualization
expression += abs(amin(noise))
expression += noise
plot(expression.T, lw='3')
legend(['A', 'B', 'C'], loc='center left', bbox_to_anchor=(1, 0.5))
title('Time series toy example')
xlabel('Time')
ylabel('Expression')
# Compute gene-gene correlation
print('\nOriginal time series gene expression correlations')
print('Gene pair\tPearson\t\tSpearman')
print('A-B\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[0,:], expression[1,:])[0], st.spearmanr(expression[0,:], expression[1,:])[0])
print('A-C\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[0,:], expression[2,:])[0], st.spearmanr(expression[0,:], expression[2,:])[0])
print('B-C\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[1,:], expression[2,:])[0], st.spearmanr(expression[1,:], expression[2,:])[0])
None of the gene pairs are well-correlated because the times at which they are expressed are disjoint. Intuitively, one might wish to treat B and C as similar temporal profiles because both are expressed "late" whereas A is expressed "early". Correlation doesn't capture that visual trend.
In fact, these correlation measures don't care about the temporal structure in the data at all. They treat the time series as a bag of samples. To illustrate, I randomly reordered the time points and recompute the correlation. The visual trends are destroyed, but the correlation remains the same.
# Generate a random ordering of the time points and rearrange the columns
order = rd.permutation(24)
expression = expression[:,order]
# Plot the time series in the permuted order
figure()
plot(expression.T, lw='3')
legend(['A', 'B', 'C'], loc='center left', bbox_to_anchor=(1, 0.5))
title('Toy example - permuted time points')
xlabel('Permuted time')
ylabel('Expression')
# Compute gene-gene correlation on the permuted matrix
print('Permuted time series gene expression correlations')
print('Gene pair\tPearson\t\tSpearman')
print('A-B\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[0,:], expression[1,:])[0], st.spearmanr(expression[0,:], expression[1,:])[0])
print('A-C\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[0,:], expression[2,:])[0], st.spearmanr(expression[0,:], expression[2,:])[0])
print('B-C\t\t%.2f\t\t%.2f') % (st.pearsonr(expression[1,:], expression[2,:])[0], st.spearmanr(expression[1,:], expression[2,:])[0])
So what if a user does want to cluster based on temporal trends instead of the bag of samples correlation? A few existing options include:
Two reviews I wrote with co-authors provide a few specific references to methods implementing these strategies. If there is sufficient interest, I would also consider expanding this discussion to include more references.
Contact information is at my website if you have comments, want additional information, or can't access the reviews.