Introduction to perl

All statisticians should be proficient in C (for speed), perl (for data manipulation), and R (for interactive analyses and graphics). Think "CPR".

I find the perl programming language absolutely crucial for manipulating data files. For my work in statistical genetics, virtually every new set of data that I receive is in a unique format and so must be converted into one or each of the various formats required by my own software and others' software. (Moreover, I often wish to combine such data with other information such as the locations of genetic markers and so forth.) To do such manipulation “by hand” would be a nightmare. Having a program to do such things is valuable for at least three reasons: (a) it's faster, (b) it avoids errors, and (c) it can quickly be redone if the primary data are modified.

Most C programmers likely realize that it would be a real pain to do such manipulation of text files in a C program. But many things that are hard to do in C are easy in perl. Perl is a combination of the unix utilities sed, awk and also shell scripting. It is excellent for text manipulation, and one should make special use of its regular expression facilities.

I should say that another major use of perl, for me, is to run simulations testing/using other people's programs, pulling out summary bits from the output of such programs. I can (and sometimes do) write the simulation part within perl, but mostly I use the shell script features of perl to repeatedly call other programs, and the text processing features to pull out the interesting pieces from the output of other programs.

In this document, I will try to assist the reader in getting started in perl programming. Perl can do a mess of things. I'll focus solely on the manipulation of text files, since that is probably the most important use of perl for statisticians.

If you have any suggestions for improving this document, please send me an email (kbroman at biostat.wisc.edu).

Topics

Resources

Before I get started, I need to mention some books and web resources:

Books

Tutorials

Other links

Getting started

Perl is an interpreted language (as opposed to a compiled language). The perl interpreter is usually located in /usr/bin/perl. On the Hopkins biostatistics system, the perl libraries and so forth are located in /opt/LWperl. (That's important, because the perl libraries [see CPAN] contain enourmous amounts of excellent code that we can reuse.

To get a perl program to run, you place the code in a text file, say program.pl. The first line of the program should be

#!/usr/bin/perl -w

Then make the file executable using (from the unix command line)

chmod +x program.pl

Then you just type program.pl to get the thing to run. [Alternatively, you can skip the #!/usr/bin/perl and chmod bit and type perl -w program.pl.] Note: the -w switch turns on perl's warnings facility to help you to find errors in your code. Type perl -h or man perl for a complete list of such switches. (I'm embarassed to admit that I seldom use the -w switch myself.)

The usual stupid first example

Following tradition, I'll give a stupid first example to show how to get a perl program to run. We'll consider a program that prints the phrase, “Hello, world!”

Here is the code:

#!/usr/bin/perl -w

print("Hello, world!\n");

Here is some sample input and output:

> hello.pl 
Hello, world! 
> 

Comments:

A much more complex example

Here I will discuss a fabricated but relatively realistic example of reorganizing a set of data. I must warn you...this is maybe a bit too complicated for a second example.

I received, from a collaborator, three files: a file giving the genotype data for a set of individuals at a set of genetic markers on a single chromosome, a file giving the parents and sex of each individual, and a file giving the order of the markers along the chromosome. I want to combine the data into a single file in the format used by the program CRI-MAP (a .gen file).

Input data

Here is the input data:

Output data

Here is the output data, in the format I want (as a CRI-MAP .gen file):

One might say, perhaps, that this is more of a mess then the original data, but at least now it's all in one file, and (more importantly), it's now ready for use with the CRI-MAP program.

Perl program

Now, here's the perl program to do the conversion. It's not very long, but there are a lot of rather complicated things going on, which I will explain a bit at a time below. Note: I've put this in an html file so that you can view it easily. If you want to save it and try running it, be sure to first clip off the html code at the top of the file.

Before I go into the details of the code, I should give an important caveat. I'm not a terribly good perl programmer: I likely break a lot of the rules, and my code is not very efficient or concise. Still, my programs generally do what I want them to do. One can use a variety of different approaches to any one coding problem in perl. The approaches I've taken are not necessary the best (but they do work).

Okay, now here's the detailed explanation of the code.

#!/usr/bin/perl
#
# Combine the data in "genotypes.txt", "markers.txt" and "families.txt" 
# and convert into a CRI-MAP .gen file.

The first line is as I'd mentioned above: it's needed so that the unix shell knows to call the perl compiler. Note that I dropped the -w switch, which turns on perl's warnings, because they end up saying a bunch of stuff about uninitialized variables which I find painful. Also note that the # symbol indicates a comment. Wherever it appears, the rest of the line is treated as a comment (and thus ignored by the compiler).

# file names
$gfile = "genotypes.txt";  # genotype data
$mfile = "markers.txt";    # list of markers, in order
$ffile = "families.txt";   # family information 
$ofile = "data.gen";       # output file

Here I define the names of the three input files and the output file. In perl, scalar variables consist of a name preceded by a $ symbol. (I found the $, @ and % symbols very confusing when I was first learning perl, since it is so different than C.) Note that I do not declare variables before I start using them. Also note that (like C) each statement ends with a semicolon. A funny (but very convenient) feature of perl is that variables are not explicitly integers, doubles or strings. Perl converts scalars from integer to double to string according to context; we seldom need to worry about it.

# read marker names and place in the vector @markers
open(IN, $mfile) or die("Cannot read from $mfile");

This line opens the file $mfile for reading, using the “file handle” IN. If the file fails to open (e.g., if it doesn't exist or we don't have read permission), open returns a 0, and so the other half of the or statement is executed. The statement die stops the program, printing its contents. A nice feature of perl is that it will evalute any variables inside the quotes, so that the name of the marker data file is printed rather than the variable name $mfile. If the file is successfully opened, a non-zero value is returned, and the right side of the or statement is ignored. One could use a more complex if statement, but this is concise and (with practice) quite readable.

while($line = <IN>) {
    chomp($line);
    push(@markers, $line);
}
close(IN);

This group of code reads each successive line in the $mfile file, strips off the newline characters at the end of each line, and then builds up a vector @markers containing the marker names in the same order as in the file.

The while statement loops until its argument is zero.

In the $line = <IN> statement, the <IN> part reads a single line from the file (that's how the file handle IN is used) and places it into the scalar $line. When the end of the file is reached, this statement has value zero, and so the while loop is broken.

chomp($line) cuts off the ending “newline” character from $line, if there is one. (Compare this to chop, which cuts off the last character no matter what it is.) Note: Be careful when you move files between Windows (and perhaps MacOS) and unix, since the end-of-line characters are different between them. This has been a constant source of irritation for me. The perl script clean (which I got from the guys in the UC Berkeley Statistical Computing Facility, to whom I am eternally grateful for many reasons) fixes text files from Windows to conform to the unix format. Use it.

@markers is a vector (i.e. an ordered set of scalars). The @ denotes vector just as $ denoted scalar.

The function push places the scalar $line onto the end of the vector @markers.

close(IN) closes the file $mfile. I don't think it's absolutely necessary, but I always do it.

(That sure was a lot of explantion for five lines of code! Maybe I should be chopping things up into smaller bits.)

# read family information and place in the hashes %dad, %mom, %sex
open(IN, $ffile) or die("Cannot read from $ffile");
$line = <IN>;
while($line = <IN>) {
    chomp($line);

This set of lines should now be clear from the above.

    @v = split(/\s+/, $line);

In this line, the function split chops the scalar $line at the regular expression \s+. The two / symbols indicate a “regular expression” (a pattern to be matched), \s stands for any “white space” (spaces, tabs, etc.), and + indicates one or more bits of white space. The return value is a vector, @v composed of the bits of $line, now split up.

    if($v[0] eq "") { shift @v; }

The point of this lines is that if $line has leading whitespace, the first element of the vector returned from split will be empty. Here we look for that, and dump the empty element if there is one. Thus a line like

     1          1      0      0   1

is initially converted into the vector ("",1,1,0,0,1), and after this line it becomes (1,1,0,0,1).

$v[0] is the way that we refer to a single element of a vector. Vectors are indexed by 0, 1, 2, etc. We could refer to a “subvector” by @v[(1,3,5)]. Note that when we use @v in a “scalar” context, its value taken to be the length of the vector. Thus @v[(1..(@v-1))] is @v but dropping the first element. (I guess I should mention that ($a..$b) creates a vector with a sequence of numbers from $a to $b, inclusive, with increments of 1.

eq is for string comparison (compare this to == for numeric comparison)...this part tests whether $v[0] exactly matches the empty string "". Note that "" is different than "0" and 0 when using eq but not when using ==. This has been a recurring source of stupid bugs for me.

if is obviously a conditional statement. Note that, unlike C, you must always use the brackets with if statements and such, even if (as here) there is only a single line in the “then” part of the statement.

Now, if the first element of the @v vector is empty, the shift @v; statment is executed. shift will dump the first element of the vector. (Compare this to pop.) Note that shift @v; could be replaced with @v = @v[1..(@v-1)];

    ($fam, $ind, $dad, $mom, $sex) = @v;

This line is not really necessary, but I find it convenient. It places the first five components of the @v vector into $fam, $ind, etc., respectively. Thus I can refer to them as $fam and so forth rather than as $v[0], $v[1], ..., $v[4].

    $dad{$fam}{$ind} = $dad;
    $mom{$fam}{$ind} = $mom;

Oog. This brings up the issue of hashes. These are very important and incredibly useful, but took me a while to figure out (especially in figuring out why I should use them). Let me try to explain them.

A scalar is a single number/string, and is given a name like $scalar. A vector is an ordered set of scalars indexed by the whole numbers 0, 1, 2, ..., and is given a name like @vector. We refer to particular elements of the vector by $vector[0], $vector[1], etc.

Now a hash is an (unordered) set of scalars indexed by another set of scalars (called keys), and is given a name like %hash. We refer to particular elements of the hash using an expression like $hash{$scalar}. For example, we might have $color{"sky"} = "blue"; and $color{"dirt"} = "brown";. Note the use of braces {} rather than brackets [] (as is used for vectors) or parentheses () (as is used for functions/subroutines).

The %dad and %mom hashes that I am forming in these lines are doubly indexed hashes. Each individual in these data has a family ID and an individual ID. (The individual IDs are unique within a family but not necessarily between families.) So here I'm building up tables indicating the ID for the father and mother of each individual in each family.

I find hashes useful for at least two reasons. First, a typical study has a bunch of variables measure on each of a set of subjects. I find it very convenient to index the measured variables by the subjects' IDs rather than keep track of an ordered vector for each variable. (It's particularly convenient for sorting things but subject ID, which we'll see below.) Second, when confronted with a categorial variable with an unknown set of possible values, hashes make it very convenient to identify the set of distinct possible values (using the expression keys %hash...the keys returns a vector of the keys to the hash).

This explanation is probably insufficient. There will be more below, and you'll have to dig into some other books/tutorials. Hopefully you now at least vaguely understand some of the possibilities.

    if($sex == 2) {   # make female = 0 rather than = 2 
	$sex{$fam}{$ind} = 0;
    }
    else {  
	$sex{$fam}{$ind} = $sex;
    }
}
close(IN);

This is like the previous bit. Here I'm also switching the sex = 2 to sex = 0, since while the input data coded female as 2, CRI-MAP codes it as 0. I then close the $ffile since I'm done with it.

# read genotype data
open(IN, $gfile) or die("Cannot read from $gfile");

# header line : parse family/individual IDs
$line = <IN>; chomp($line);
@v = split(/\s+/, $line);
if($v[0] eq "") { shift @v; } # if first entry is blank, get rid of it
shift @v; # get rid of the first entry "Marker"

I think we should now understand this part. I open the genotype data file, read the initial header line, chomp off any extra newline character, split it into a vector at any white space, dump the first element if it's empty (if there's leading whitespace on the line), and dump the first element ("Marker") since I'm not interested in it.

foreach $v (@v) {
    ($fam,$ind) = split(/-/, $v); # split at '-'
    push(@fam, $fam);
    push(@ind, $ind);
}

The remaining elements from the header line are of the form familyID-indID. We wish to create two vectors, one containing all of the family IDs and the other containing all of the individual IDs.

foreach $v (@v) loops through the elements of the vector @v one at a time in order; the element is placed in $v and the body of the statement is executed.

split breaks the $v scalar at the hyphen and places the part in front of the hyphen in $fam and the part after in $ind.

The two push lines, as we'd seen earlier, place the scalars $fam and $ind onto the end of the vectors @fam and @ind, respectively. Notice that I don't need to define or declare these vectors before I use them.

# parse rest of file 
while($line = <IN>) {

    $marker = substr($line, 0, 9);
    $marker =~ s/\s+//g; # remove any extra spaces

Here we read the rest of the file, one line at a time. With these lines, we need to work with particular columns (characters): the marker name is in columns 1-9; the genotypes follow that, taking up 7 columns for each individual.

substr($str,$skip,$length) picks out a portion from the string $str, first skipping $skip characters and picking out a total of $length characters. So here we pick out the first nine characters from $line.

The next line removes any extra whitespace from the marker name. This is another use of regular expressions. The =~ symbol says to apply the pattern matching on the right to the scalar on the left. The s///g construction does the search-and-replace bit. The s indicates search-and-replace. The stuff between the first and second / is the regular expression to search for (here \s+, which as we saw before indicates whitespace). The stuff between the second and third / is what to replace the found whitespace with. Since we put nothing there, the whitespace is deleted. The final g indicates to do this globally throughout the string. If we'd left off the g, the line would only delete the first bit of whitespace found.

I should emphasize...the last line here gives you a glimpse at the power of perl. I recommend that you look into =~ and // and s/// further!

    foreach $i (0..(@fam-1)) {
	$g = substr($line, 9+$i*7, 7); 
	($g1, $g2) = split(/\//, $g);  # split at '/'
	$g1 =~ s/\s+//g; # remove any extra spaces
	$g2 =~ s/\s+//g;

Here we have another foreach loop. Recall that when we refer to a vector in a scalar context, it acts like a scalar whose value is the length of the vector. So here (0..(@fam-1)) creates a vector 0, 1, 2, ... up to 1 less than the length of the @fam vector. (Thus it gives the indices for the @fam vector. So the foreach loop repeatedly executes the stuff between the braces, with $i incrementing through the indices of the @fam (or equivalently the @ind) vector.

In the next line, we use substr to pull out the appropriate seven columns from $line.

We then split this at the / symbol. Note that I need to use \/ rather than just / so that the two regular expression /s aren't confused with their contents.

The last two lines in this group remove any extra spaces. This is not really necessary, but it's helpful.

	if($g1 == 0 or $g2==0) { # replace blanks with 0's
            $g1 = 0;
            $g2 = 0;
        }  

Here we replace ensure that if the missing values (blank in the input data) become 0's (which is what we want in the output data). We also make sure that either both $g1 and $g2 are missing or neither are missing. (The program CRI-MAP doesn't like to have one allele missing and one not.)

The == symbol checks whether $g1 or $g2 is numerically equivalent to 0 (either blank or actually a 0). If either is 0, we give them the explicit value 0. Note the use of or. I like this a lot. You could also use || as in C, but or is nicer to read.

	$gen{$fam[$i]}{$ind[$i]}{$marker} = $g1 . " " . $g2;
	
    }
}
close(IN);

To end this loop and the reading of the genotype data, we form a triply indexed hash (indexed by family ID, individual ID, and marker name) containing the genotype as two alleles separated by a single space. The periods do string concatenation.

# write genotype data as cri-map .gen file
open(OUT, ">$ofile") or die("Cannot write to $ofile");

Now we've read in all of the data and are ready to write it to the output file in CRI-MAP format. This line opens a file for writing. Note the > symbol within the filename indicates writing. Because perl interpolates (is that the right word?) the values of variables within quotes, ">$ofile" becomes the string ">data.gen" (see the definition of $ofile way back at the beginning of this perl script).

$nfam = keys %gen;      # number of families

keys %gen gives a vector of all of the unique keys to the hash %gen, which is a triply indexed hash. Since the innermost index is the family ID, this statement gives a vector with the distinct family IDs. Since this vector resides in a scalar context, $nfam takes the length of the vector (that is, the number of families).

print OUT ("$nfam\n");

We now print the value $nfam to the file $ofile. Recall that \n indicates a newline character. Note that print OUT("$nfam\n"); doesn't work...you need the space between OUT and the (.

$nmar = @markers;       # number of markers
print OUT ("$nmar\n");
foreach $mar (@markers) {
    print OUT ("$mar\n");
}

Is this part clear, or am I just getting tired? Print out the number of markers followed by the marker names (in the order they had appeared in the markers.txt file), one per line.

foreach $fam (sort numerically keys %gen) {  # families in numerical order

Here we use a foreach loop again to loop one-by-one through the family IDs. Recall that keys %gen gives a vector containing the set of unique family IDs. Since the keys to hashes are in no special order, and since I wish to print out the families in numerical order (not really necessary, but I want to), we use sort to sort the vector. The sort function by default sorts things by alphabetical type order (placing 111 ahead of 20), we need to use a subroutine in order to get sort to put things into numerical order. I'll try to explain the numerically subroutine below, where it appears at the end of the perl script.

    print OUT ("$fam\n");

We print the family name to the output file.

    $nind = keys %{$gen{$fam}};  # number of individuals in family
    print OUT ("$nind\n");

This is a bit tricky. %gen refers to the overall hash. But since this is a triply indexed hash, %{$gen{$fam}} refers to the part of the hash corresponding to the family $fam (whatever family we happen to be on so far in the loop). The keys to the hash %{$gen{$fam}} are the individual IDs for family $fam. Thus $nind takes the number of individuals in this family, which we print to the output file (on a separate line).

If you study the brackets and % and $ symbols here, I think you'll make a good step towards how to work with these multiply indexed hashes and arrays, but no doubt further study/reading will be necessary. It may take a while, but it all will become clear eventually.

    foreach $ind (sort numerically keys %{$gen{$fam}}) {
	print OUT ("$ind $mom{$fam}{$ind} $dad{$fam}{$ind} $sex{$fam}{$ind}\n");

Now we loop through each individual (sorted numerically...see below) in the family $fam and print out a line containing his/her ID, mother, father and sex.

	foreach $mar (@markers) {  # markers in order
	    print OUT ("$gen{$fam}{$ind}{$mar} ");
	}

	print OUT ("\n");
    }
}
close(OUT);

Here we loop through the marker names, in the order in which they appeared in the marker.txt file. We print, on a single line, the genotypes for the individual, separated by spaces. There will be an extra space at the end of the line, but that doesn't matter. Then we close the output file, and we're done (except for explaining that subroutine).

# subroutine to allow sorts by numerical order
sub numerically { $a <=> $b; }

Subroutines are very useful in breaking up complex perl programs into pieces; if the pieces are chosen carefully, you can create nicely modular, reuseable code.

The definition of a subroutine begins with the word sub followed by the name of the subroutine. Then the definition is given in brackets.

This is a special kind of subroutine to be used by the sort function. The one line in the subroutine takes value 0 if $a is the same (numerically) as $b, and is positive (negative) if $a is bigger (smaller) than $b. I won't explain this further, but will leave it to you to read further.

Other things

To print an array separated by something other than spaces (for example, commas) use the function join:

#print @res separated by commas
print OUT (join(",", @res), "\n");


Back to home page

Last modified: Mon Feb 4 08:48:50 2013