Conjugate Dirichlet Process Mixture Models: Gene Expression,
Efficient Sampling, and Clustering

David Dahl, PhD Candidate, Departments of Statistics and Biostatistics & Medical Informatics, UW-Madison

Friday, January 23, 2004, 12-1 p.m.

132 WARF Building, 610 Walnut St.

ABSTRACT

This talk proposes a novel conjugate Dirichlet process mixture (DPM) model for the analysis of gene expression data, introduces a new MCMC sampling algorithm for fitting general conjugate DPM models, and describes a quick mode-finding algorithm for clustering in a particular class of conjugate DPM models. Since biologists are typically interested in expression patterns over a variety of treatment conditions, the proposed model clusters genes having similar patterns of expression (instead of similar levels of expression) and naturally incorporates any number of treatment conditions. Further, hypotheses are easily tested and false discovery rates are readily estimated. The second part of the talk addresses formidable computational issues arising in the use of DPM models by introducing a new MCMC sampling algorithm for any (not just the gene expression model) conjugate DPM model. Simulations indicate that the proposed sampler can be significantly faster than existing methods. The new algorithm is a merge-split sampler which uses ideas similar to those in sequential importance sampling. Finally, in the case of two treatment conditions, a very quick clustering algorithm is introduced which is guaranteed to find the mode of the posterior clustering distribution in a class of conjugate DPM models. Pre-prints are available at http://www.stat.wisc.edu/~dbdahl.