Summary by Lisa Chung 11/19/2008

********************************
*  Data and Analysis Methods   *
********************************
- Data:
HPV dataset with 5 tissue types (NORMAL-CIN1-CIN2-CIN3-CANCER)
Here, we combine CIN1 and CIN2 as an EARLY tissue type
-------------------------------
| NORMAL  EARLY   CIN3 CANCER | 
|     24     36     40     28 | : total of 128 arrays
------------------------------- 

- Feature selection: 

1) 2 way ANOVA (tissue type and batch effect with no interaction) is performed. 

2) Benjamini-Hochberg correction used for p-value adjustment, cutoff = 0.05

3) Out of 54675 probesets, *6663 probesets* show significant difference in gene expression across tissue type.

- EB calculation:
1) with all possible 51 patterns (w/o null), apply emfit with 20 iterations, estimate model parameters  

2) with all possible 540 ordered structures (w/o null), apply emfit with 0 and 100 iterations.

3) After the iteration, compute probabilities of having each structure given the expression data for every gene (Pr(structure|data)).

- Structure Assignment rule:
1) maximum assignment rule
Assign a structure with maximum Pr(structure|data)

2) cutoff = 0.9
Assign a structure if maximum Pr(structure|data) is greater than 0.9 (more stringent..)

3) After structure assignment based on EB probabilities, I removed a few probesets if their average log2 expression aren't consistent with the underlying structures.


- Gene Enrichment Calculation: 4-set analysis
set1: Normal < (Early, CIN3, CANCER) or  Normal > (Early, CIN3, CANCER)
      i.e. (1,2,2,2) or (2,1,1,1)
set2: (Normal, Early) < (CIN3, CANCER) or  (Normal, Early) < (CIN3, CANCER)
      i.e. (1,1,2,2) or (2,2,1,1)
set3: (Normal, Early, CIN3) < CANCER or  (Normal, Early, CIN3) < CANCER
      i.e. (1,1,1,2) or (2,2,2,1)
set4: Normal < Early < CIN3 < CANCER or  Normal > Early > CIN3 > CANCER
      i.e. (1,2,3,4) or (4,3,2,1)

set1, set2, and set3 are group of genes which change *only* at each
transition point.

I collected probesets for each structure by *maximum assignment rule from 100 EM* iterations. 
I removed a few probesets whose mean expression levels are not consistent with structure assignment. 
Set probeset score equals to 1 if the probeset is in each set, otherwise set 0. Use this binary score for gene enrichment calculation by allez package.
I listed interesting GO/KEGG pathways with z.score > 4 and number of genes > 4.

*************
*  Files:   *
*************
Average log2 expression and structure assignment for all of 6663 probesets:
All.dataTable.txt
(* in structure assignment,
   Unassigned: unassigned by low probabilities
   Unassigned.by.fc: unassigned due to inconsistency bet/n structure and average expression)

Gene Enrichment calculation result (That I showed on last meeting, 11/06): 
GeneEnrichment-4sets.xls 

List of genes in each set  (That I showed on last meeting, 11/06): 
geneList-4sets.xls

Trajectory plots: (based on raw (unlogged) scale)
trajectory-4-set-max.png, trajectory-4-set-9.png: 
trajectory plots for 4-set analysis with max. rule and with cutoff = 0.9, respectively

trajectory-top20-max.png, trajectory-top20-9.png:
trajectory plots for top 20 structures with max. rule and with cutoff = 0.9, respectively