All statisticians should be proficient in C (for speed), perl (for data manipulation), and R (for interactive analyses and graphics). Think "CPR".
I find the perl programming language absolutely crucial for manipulating data files. For my work in statistical genetics, virtually every new set of data that I receive is in a unique format and so must be converted into one or each of the various formats required by my own software and others' software. (Moreover, I often wish to combine such data with other information such as the locations of genetic markers and so forth.) To do such manipulation by hand would be a nightmare. Having a program to do such things is valuable for at least three reasons: (a) it's faster, (b) it avoids errors, and (c) it can quickly be redone if the primary data are modified.
Most C programmers likely realize that it would be a real pain to do such manipulation of text files in a C program. But many things that are hard to do in C are easy in perl. Perl is a combination of the unix utilities sed, awk and also shell scripting. It is excellent for text manipulation, and one should make special use of its regular expression facilities.
I should say that another major use of perl, for me, is to run simulations testing/using other people's programs, pulling out summary bits from the output of such programs. I can (and sometimes do) write the simulation part within perl, but mostly I use the shell script features of perl to repeatedly call other programs, and the text processing features to pull out the interesting pieces from the output of other programs.
In this document, I will try to assist the reader in getting started in perl programming. Perl can do a mess of things. I'll focus solely on the manipulation of text files, since that is probably the most important use of perl for statisticians.
If you have any suggestions for improving this document, please send me an email (kbroman at biostat.wisc.edu).
Before I get started, I need to mention some books and web resources:
Perl is an interpreted language (as opposed to a compiled
language). The perl interpreter is usually located in
/usr/bin/perl
. On the Hopkins biostatistics
system, the perl libraries and so forth are located in
/opt/LWperl
. (That's important, because the perl
libraries [see CPAN] contain
enourmous amounts of excellent code that we can reuse.
To get a perl program to run, you place the code in a text file,
say program.pl
. The first line of the program should be
#!/usr/bin/perl -w
Then make the file executable using (from the unix command line)
chmod +x program.pl
Then you just type program.pl
to get the thing to run.
[Alternatively, you can skip the #!/usr/bin/perl
and chmod
bit and type perl -w program.pl
.]
Note: the -w
switch turns on perl's warnings
facility to help you to find errors in your code. Type
perl -h
or man perl
for a complete list of such
switches. (I'm embarassed to admit that I seldom use the
-w
switch myself.)
The usual stupid first example
Following tradition, I'll give a stupid first example to show how to get a perl program to run. We'll consider a program that prints the phrase, Hello, world!
Here is the code:
#!/usr/bin/perl -w
print("Hello, world!\n");
Here is some sample input and output:
> hello.pl
Hello, world!
>
Comments:
#!/usr/bin/perl
as the
first line, and then use chmod +x theprog.pl
to make the
file executable.
print
(obviously) is a command for printing. Like
in C, statements must end with a semicolon. Also like in C,
\n
is the code for a newline.
Here I will discuss a fabricated but relatively realistic example of reorganizing a set of data. I must warn you...this is maybe a bit too complicated for a second example.
I received, from a collaborator, three
files: a file giving the genotype data for a set of individuals at a
set of genetic markers on a single chromosome, a file giving the
parents and sex of each individual, and a file giving the order of the
markers along the chromosome. I want to combine the data into a
single file in the format used by the program CRI-MAP
(a .gen
file).
Input data
Here is the input data:
genotypes.txt
[the genotype data]
Each row corresponds to a single marker. The genotype data for
an individual is a pair of numbers separated by /
.
Each genotype is given seven characters, with no spaces between
columns, and missing genotypes are left completely blank. The
individuals are identified in the first row, with family ID and
individual ID separated by a hyphen. The markers have been
sorted by their names.
families.txt
[the family data]
Each row corresponds to a single individual, and is relatively straight forward. In the father and mother columns, 0's (for the founders in each family) indicate missing. In the sex column, 1 = male and 2 = female.
markers.txt
[the genetic markers]
This is a simple list of the markers in their order along the chromosome.
Here is the output data, in the format I want (as a
CRI-MAP
.gen
file):
data.gen
The first line gives the number of families. The second line gives the number of markers. The next set of lines gives the marker names, in their order along the chromosome.
Following that is a group of lines for each family: the family ID, the number of individuals in the family, and two lines for each individual: the first line contains the individual ID, the IDs of the individual's mother and father, and then sex (1 = male; 0 = female); the second line contains the genotype data, pairs of numbers all separated by spaces, with things in the same order as the marker names and 0's in place of missing data.
One might say, perhaps, that this is more of a mess then the original
data, but at least now it's all in one file, and (more importantly),
it's now ready for use with the CRI-MAP program.
Perl program
Now, here's the perl program to do the conversion. It's not very long, but there are a lot of rather complicated things going on, which I will explain a bit at a time below. Note: I've put this in an html file so that you can view it easily. If you want to save it and try running it, be sure to first clip off the html code at the top of the file.
Before I go into the details of the code, I should give an important caveat. I'm not a terribly good perl programmer: I likely break a lot of the rules, and my code is not very efficient or concise. Still, my programs generally do what I want them to do. One can use a variety of different approaches to any one coding problem in perl. The approaches I've taken are not necessary the best (but they do work).
Okay, now here's the detailed explanation of the code.
#!/usr/bin/perl
#
# Combine the data in "genotypes.txt", "markers.txt" and "families.txt"
# and convert into a CRI-MAP .gen file.
The first line is as I'd mentioned above: it's needed so that the
unix shell knows to call the perl compiler. Note that I dropped the
-w
switch, which turns on perl's warnings, because they
end up saying a bunch of stuff about uninitialized variables which I
find painful. Also note that the #
symbol indicates a
comment. Wherever it appears, the rest of the line is treated as a
comment (and thus ignored by the compiler).
# file names
$gfile = "genotypes.txt"; # genotype data
$mfile = "markers.txt"; # list of markers, in order
$ffile = "families.txt"; # family information
$ofile = "data.gen"; # output file
Here I define the names of the three input files and the output
file. In perl, scalar variables consist of a name preceded by a
$
symbol. (I found the $
, @
and %
symbols very confusing when I was first learning
perl, since it is so different than C.) Note that I do not declare
variables before I start using them. Also note that (like C) each
statement ends with a semicolon. A funny (but very convenient)
feature of perl is that variables are not explicitly integers, doubles
or strings. Perl converts scalars from integer to double to string
according to context; we seldom need to worry about it.
# read marker names and place in the vector @markers
open(IN, $mfile) or die("Cannot read from $mfile");
This line opens the file $mfile
for reading, using
the file handle IN
. If the file fails to
open (e.g., if it doesn't exist or we don't have read permission),
open
returns a 0, and so the other half of the
or
statement is executed. The statement die
stops the program, printing its contents. A nice feature of perl is
that it will evalute any variables inside the quotes, so that the name
of the marker data file is printed rather than the variable name
$mfile
. If the file is successfully opened, a non-zero
value is returned, and the right side of the or
statement
is ignored. One could use a more complex if
statement,
but this is concise and (with practice) quite readable.
while($line = <IN>) {
chomp($line);
push(@markers, $line);
}
close(IN);
This group of code reads each successive line in the
$mfile
file, strips off the newline characters at the end
of each line, and then builds up a vector @markers
containing the marker names in the same order as in the file.
The while
statement loops until its argument is zero.
In the $line = <IN>
statement, the
<IN>
part reads a single line from the file (that's
how the file handle IN
is used) and places it into the
scalar $line
. When the end of the file is reached, this
statement has value zero, and so the while
loop
is broken.
chomp($line)
cuts off the ending newline
character from $line
, if there is one. (Compare this to
chop
, which cuts off the last character no matter what it
is.) Note: Be careful when you move files between Windows (and
perhaps MacOS) and unix, since the end-of-line characters are different
between them. This has been a constant source of irritation for me.
The perl script clean
(which I
got from the guys in the UC Berkeley Statistical Computing Facility, to whom I am
eternally grateful for many reasons) fixes text files from Windows to
conform to the unix format. Use it.
@markers
is a vector (i.e. an ordered set of scalars).
The @
denotes vector just as $
denoted
scalar.
The function push
places the scalar
$line
onto the end of the vector @markers
.
close(IN)
closes the file $mfile
. I
don't think it's absolutely necessary, but I always do it.
(That sure was a lot of explantion for five lines of code! Maybe I should be chopping things up into smaller bits.)
# read family information and place in the hashes %dad, %mom, %sex
open(IN, $ffile) or die("Cannot read from $ffile");
$line = <IN>;
while($line = <IN>) {
chomp($line);
This set of lines should now be clear from the above.
@v = split(/\s+/, $line);
In this line, the function split
chops the scalar
$line
at the regular expression \s+
. The
two /
symbols indicate a regular expression
(a pattern to be matched), \s
stands for any white
space (spaces, tabs, etc.), and +
indicates one or
more bits of white space. The return value is a vector,
@v
composed of the bits of $line
, now split
up.
if($v[0] eq "") { shift @v; }
The point of this lines is that if $line
has leading
whitespace, the first element of the vector returned from
split
will be empty. Here we look for that, and dump the
empty element if there is one. Thus a line like
1 1 0 0 1
is initially converted into the vector ("",1,1,0,0,1)
,
and after this line it becomes (1,1,0,0,1)
.
$v[0]
is the way that we refer to a single element of
a vector. Vectors are indexed by 0, 1, 2, etc. We could refer to a
subvector by @v[(1,3,5)]
. Note that when we
use @v
in a scalar context, its value taken
to be the length of the vector. Thus @v[(1..(@v-1))]
is
@v
but dropping the first element. (I guess I should
mention that ($a..$b)
creates a vector with a sequence of
numbers from $a
to $b
, inclusive, with
increments of 1.
eq
is for string comparison (compare this to
==
for numeric comparison)...this part tests whether
$v[0]
exactly matches the empty string
""
. Note that ""
is different than
"0"
and 0
when using eq
but
not when using ==
. This has been a recurring source
of stupid bugs for me.
if
is obviously a conditional statement. Note that,
unlike C, you must always use the brackets with if
statements and such, even if (as here) there is only a single line in
the then part of the statement.
Now, if the first element of the @v
vector is
empty, the shift @v;
statment is executed.
shift
will dump the first element of the vector.
(Compare this to pop
.) Note that shift
@v;
could be replaced with @v = @v[1..(@v-1)];
($fam, $ind, $dad, $mom, $sex) = @v;
This line is not really necessary, but I find it convenient. It
places the first five components of the @v
vector into
$fam
, $ind
, etc., respectively. Thus I can
refer to them as $fam
and so forth rather than as
$v[0]
, $v[1]
, ..., $v[4]
.
$dad{$fam}{$ind} = $dad;
$mom{$fam}{$ind} = $mom;
Oog. This brings up the issue of hashes. These are very important and incredibly useful, but took me a while to figure out (especially in figuring out why I should use them). Let me try to explain them.
A scalar is a single number/string, and is given a name
like $scalar
. A vector is an ordered set of
scalars indexed by the whole numbers 0, 1, 2, ..., and is given a name
like @vector
. We refer to particular elements of the
vector by $vector[0]
, $vector[1]
, etc.
Now a hash is an (unordered) set of scalars indexed by
another set of scalars (called keys), and is given a name like
%hash
. We refer to particular elements of the hash using
an expression like $hash{$scalar}
. For example, we might
have $color{"sky"} = "blue";
and $color{"dirt"} =
"brown";
. Note the use of braces {}
rather than
brackets []
(as is used for vectors) or parentheses
()
(as is used for functions/subroutines).
The %dad
and %mom
hashes that I am
forming in these lines are doubly indexed hashes. Each individual in
these data has a family ID and an individual ID. (The individual IDs
are unique within a family but not necessarily between families.) So
here I'm building up tables indicating the ID for the father and
mother of each individual in each family.
I find hashes useful for at least two reasons. First, a typical
study has a bunch of variables measure on each of a set of subjects.
I find it very convenient to index the measured variables by the
subjects' IDs rather than keep track of an ordered vector for each
variable. (It's particularly convenient for sorting things but subject
ID, which we'll see below.) Second, when confronted with a categorial
variable with an unknown set of possible values, hashes make it very
convenient to identify the set of distinct possible values (using the
expression keys %hash
...the keys
returns a
vector of the keys to the hash).
This explanation is probably insufficient. There will be more below, and you'll have to dig into some other books/tutorials. Hopefully you now at least vaguely understand some of the possibilities.
if($sex == 2) { # make female = 0 rather than = 2
$sex{$fam}{$ind} = 0;
}
else {
$sex{$fam}{$ind} = $sex;
}
}
close(IN);
This is like the previous bit. Here I'm also switching the sex = 2
to sex = 0, since while the input data coded female as 2, CRI-MAP
codes it as 0. I then close the $ffile
since I'm done
with it.
# read genotype data
open(IN, $gfile) or die("Cannot read from $gfile");
# header line : parse family/individual IDs
$line = <IN>; chomp($line);
@v = split(/\s+/, $line);
if($v[0] eq "") { shift @v; } # if first entry is blank, get rid of it
shift @v; # get rid of the first entry "Marker"
I think we should now understand this part. I open the genotype
data file, read the initial header line, chomp off any extra newline
character, split it into a vector at any white space, dump the first
element if it's empty (if there's leading whitespace on the line), and
dump the first element ("Marker"
) since I'm not
interested in it.
foreach $v (@v) {
($fam,$ind) = split(/-/, $v); # split at '-'
push(@fam, $fam);
push(@ind, $ind);
}
The remaining elements from the header line are of the form
familyID-indID
. We wish to create two vectors, one
containing all of the family IDs and the other containing all of the
individual IDs.
foreach $v (@v)
loops through the elements of the
vector @v
one at a time in order; the element is placed
in $v
and the body of the statement is executed.
split
breaks the $v
scalar at the hyphen
and places the part in front of the hyphen in $fam
and
the part after in $ind
.
The two push
lines, as we'd seen earlier, place the
scalars $fam
and $ind
onto the end of the
vectors @fam
and @ind
, respectively. Notice
that I don't need to define or declare these vectors before I use
them.
# parse rest of file
while($line = <IN>) {
$marker = substr($line, 0, 9);
$marker =~ s/\s+//g; # remove any extra spaces
Here we read the rest of the file, one line at a time. With these lines, we need to work with particular columns (characters): the marker name is in columns 1-9; the genotypes follow that, taking up 7 columns for each individual.
substr($str,$skip,$length)
picks out a portion from the
string $str
, first skipping $skip
characters
and picking out a total of $length
characters. So here
we pick out the first nine characters from $line
.
The next line removes any extra whitespace from the marker name.
This is another use of regular expressions. The =~
symbol says to apply the pattern matching on the right to the
scalar on the left. The s///g
construction does the
search-and-replace bit. The s
indicates
search-and-replace. The stuff between the first and second
/
is the regular expression to search for (here
\s+
, which as we saw before indicates whitespace). The
stuff between the second and third /
is what to replace
the found whitespace with. Since we put nothing there, the whitespace
is deleted. The final g
indicates to do this globally
throughout the string. If we'd left off the g
, the line
would only delete the first bit of whitespace found.
I should emphasize...the last line here gives you a glimpse at the
power of perl. I recommend that you look into =~
and
//
and s///
further!
foreach $i (0..(@fam-1)) {
$g = substr($line, 9+$i*7, 7);
($g1, $g2) = split(/\//, $g); # split at '/'
$g1 =~ s/\s+//g; # remove any extra spaces
$g2 =~ s/\s+//g;
Here we have another foreach
loop. Recall that when
we refer to a vector in a scalar context, it acts like a
scalar whose value is the length of the vector. So here
(0..(@fam-1))
creates a vector 0, 1, 2, ... up to 1 less
than the length of the @fam
vector. (Thus it gives the
indices for the @fam
vector. So the foreach loop
repeatedly executes the stuff between the braces, with $i
incrementing through the indices of the @fam
(or
equivalently the @ind
) vector.
In the next line, we use substr
to pull out the
appropriate seven columns from $line
.
We then split this at the /
symbol. Note that I need
to use \/
rather than just /
so that the
two regular expression /
s aren't confused with their
contents.
The last two lines in this group remove any extra spaces. This is not really necessary, but it's helpful.
if($g1 == 0 or $g2==0) { # replace blanks with 0's
$g1 = 0;
$g2 = 0;
}
Here we replace ensure that if the missing values (blank in the
input data) become 0's (which is what we want in the output data). We
also make sure that either both $g1
and $g2
are missing or neither are missing. (The program CRI-MAP doesn't like
to have one allele missing and one not.)
The ==
symbol checks whether $g1
or
$g2
is numerically equivalent to 0 (either blank or
actually a 0). If either is 0, we give them the explicit value 0.
Note the use of or
. I like this a lot. You could also
use ||
as in C, but or
is nicer to read.
$gen{$fam[$i]}{$ind[$i]}{$marker} = $g1 . " " . $g2;
}
}
close(IN);
To end this loop and the reading of the genotype data, we form a triply indexed hash (indexed by family ID, individual ID, and marker name) containing the genotype as two alleles separated by a single space. The periods do string concatenation.
# write genotype data as cri-map .gen file
open(OUT, ">$ofile") or die("Cannot write to $ofile");
Now we've read in all of the data and are ready to write it to the
output file in CRI-MAP format. This line opens a file for writing.
Note the >
symbol within the filename indicates
writing. Because perl interpolates (is that the right word?) the
values of variables within quotes, ">$ofile"
becomes
the string ">data.gen"
(see the definition of
$ofile
way back at the beginning of this perl script).
$nfam = keys %gen; # number of families
keys %gen
gives a vector of all of the unique keys to
the hash %gen
, which is a triply indexed hash. Since the
innermost index is the family ID, this statement gives a vector with
the distinct family IDs. Since this vector resides in a scalar
context, $nfam
takes the length of the vector (that is,
the number of families).
print OUT ("$nfam\n");
We now print the value $nfam
to the file
$ofile
. Recall that \n
indicates a newline
character. Note that print OUT("$nfam\n");
doesn't
work...you need the space between OUT
and the
(
.
$nmar = @markers; # number of markers
print OUT ("$nmar\n");
foreach $mar (@markers) {
print OUT ("$mar\n");
}
Is this part clear, or am I just getting tired? Print out the
number of markers followed by the marker names (in the order they had
appeared in the markers.txt
file), one per line.
foreach $fam (sort numerically keys %gen) { # families in numerical order
Here we use a foreach
loop again to loop one-by-one
through the family IDs. Recall that keys %gen
gives a
vector containing the set of unique family IDs. Since the keys to
hashes are in no special order, and since I wish to print out the
families in numerical order (not really necessary, but I want to), we
use sort
to sort the vector. The sort
function by default sorts things by alphabetical type order (placing
111 ahead of 20), we need to use a subroutine in order to get
sort
to put things into numerical order. I'll try to
explain the numerically
subroutine below, where it
appears at the end of the perl script.
print OUT ("$fam\n");
We print the family name to the output file.
$nind = keys %{$gen{$fam}}; # number of individuals in family
print OUT ("$nind\n");
This is a bit tricky. %gen
refers to the overall
hash. But since this is a triply indexed hash,
%{$gen{$fam}}
refers to the part of the hash
corresponding to the family $fam
(whatever family we
happen to be on so far in the loop). The keys to the hash
%{$gen{$fam}}
are the individual IDs for family
$fam
. Thus $nind
takes the number of
individuals in this family, which we print to the output file (on a
separate line).
If you study the brackets and %
and $
symbols here, I think you'll make a good step towards how to work with
these multiply indexed hashes and arrays, but no doubt further
study/reading will be necessary. It may take a while, but it all will
become clear eventually.
foreach $ind (sort numerically keys %{$gen{$fam}}) {
print OUT ("$ind $mom{$fam}{$ind} $dad{$fam}{$ind} $sex{$fam}{$ind}\n");
Now we loop through each individual (sorted numerically...see
below) in the family $fam
and print out a line containing
his/her ID, mother, father and sex.
foreach $mar (@markers) { # markers in order
print OUT ("$gen{$fam}{$ind}{$mar} ");
}
print OUT ("\n");
}
}
close(OUT);
Here we loop through the marker names, in the order in which they
appeared in the marker.txt
file. We print, on a single
line, the genotypes for the individual, separated by spaces. There
will be an extra space at the end of the line, but that doesn't
matter. Then we close the output file, and we're done (except for
explaining that subroutine).
# subroutine to allow sorts by numerical order
sub numerically { $a <=> $b; }
Subroutines are very useful in breaking up complex perl programs into pieces; if the pieces are chosen carefully, you can create nicely modular, reuseable code.
The definition of a subroutine begins with the word
sub
followed by the name of the subroutine. Then the
definition is given in brackets.
This is a special kind of subroutine to be used by the
sort
function. The one line in the subroutine
takes value 0 if $a
is the same (numerically) as
$b
, and is positive (negative) if $a
is
bigger (smaller) than $b
. I won't explain this further,
but will leave it to you to read further.
Other things
To print an array separated by something other than spaces (for
example, commas) use the function join
:
#print @res separated by commas
print OUT (join(",", @res), "\n");
Last modified: Mon Feb 4 08:48:50 2013 |