Transcription Unit Predictions for E. coli K-12 The file "TU.txt" contains our transcription unit (TU) predictions for E. coli K-12 using the methods described in "Predicting bacterial transcription units using sequence and expression data" (Bockhorst et at. ISMB 2003). This file contains a set of TU records separated blank lines. Each TU record consists of seven lines. These lines identify the name of the TU, the location and strand of the TU, the genes contained in the TU and the evidence used to predict the TU boundaries. The predictions of our TU model were constrained so that all of the regulatory elements in our training set appear in a TU (note that this contrasts with the methodology used in the article). For example, each of the known operons in our training set is associated with a TU prediction. A result of this is that certain cases, such as overlapping operons, are reported here that would be infeasible for the TU model as described in the article to predict. Here is an example TU record: TU.5.0 12047 15346 FORWARD dnaK dnaJ first-gene-source: [PROMOTER] [START-OF-RUN] [START-OF-OPERON] TSS-source: [PROMOTER] last-gene-source: [END-OF-OPERON] TES-source: [PREDICTION] The first line contains the name of the TU as "TU.x.y" where x identifies the run the TU is in and y indexes TUs within a run. The second line indicates the location of the TU as <5' position> <3' position> <5' position> indicates the position of the first base in the TU, <3' position> indicates the position of the last base in the TU and indicates whether the TU lies on the forward (FORWARD) or reverse (REVERSE) strand. The third line gives that name of all genes wholly contained within this TU that are on the same strand The fourth line indicates the evidence used for predicting the first gene of the TU. There are six different cases that cause a gene to be the first gene of a predicted TU. (i) [START-OF-OPERON] The gene is the first gene of a known operon. (ii) [PROMOTER] A promoter lies in the non-coding region (NCR) upstream of the gene. (iii) [START-OF-RUN] The gene is the first gene of a run. (iv) [UPSTREAM-OPERON] The upstream neighbor gene is the last gene in a known operon. (v) [UPSTREAM-TERMINATOR] A terminator lies in the NCR following the upstream neighbor. (vi) [PREDICTION] TU model predcits gene to be the first gene in TU based on DNA sequence and expression data. One of these cases applies to every TU record and in general multiple cases can apply to a single record however, [PREDICTION] only applies if none of the other cases apply. The fifth line indicates the evidence associated with the TSS. There are three possible cases and exactly one applies to every TU: (i) [PROMOTER] TSS is based on known promoter (ii) [PREDICTION] TSS is based on prediction of TU model (iii) [GENE-START] TSS refers to the first base of the first gene in the TU. The last case is used when the first gene of the TU is the first gene of a TU (based on case (i), (iv), or (v) of line three) but that the space of the non-coding region is too small for our TU model to predict a TSS. The sixth line indicates the evidence used for predicting the last gene of the TU. As for the first gene, there are six different cases that cause a gene to be the last gene of a predicted TU. (i) [END-OF-OPERON] The gene is the last gene of a known operon (ii) [TERMINATOR] A terminator lies in the NCR downstream of the gene (iii) [END-OF-RUN] The gene is the last gene in a run (iv) [DOWNSTREAM-OPERON] The nearest downstream neighbor is the first gene of a known operon (v) [DOWNSTREAM-PROMOTER] A promoter lies in the NCR upstream of the downstream neighbor gene. (vi) [PREDICTION] TU model predicts gene to be the last gene in a TU based on DNA sequence and expression data. The seventh line indicates the evidence associated with the TES. As with the TSS, there are three possible cases and exactly one applies to every TU: (i) [TERMINATOR] TES is based on known terminator (ii) [PREDICTION] TES is based on prediction of TU model (iii) [GENE-END] TES refers to the last base of the last gene in the TU. The last case is analogous to case (iii) for the TSS and comes up when evidence suggests the gene to be the last gene in a TU but the downstream NCR is too small for our TU model to make a prediction.