Previous | Home | Exercises | PDF slides

 

Perl 2007: Lesson 6


This text is part of the lecture notes for a programming course taught at University of Tilburg, The Netherlands.

6. File management

This section describes how you can work with files in Perl. We will examine some basic facts about files and directories, look at the file operations in Perl and finish with some miscellaneous topics related to file management

6.1. Files and directories

A file is a collection of data that is stored on some disk. We will use them for storing large amounts of data. In order to access files, we need to be able to read from them and write them. Operations for performing these actions will be presented in the next section. Analogously, a directory is a collection of files and directories. Sometimes we need to read from a directory in order to find out which files it contains. Therefore we will also need a read operation for directories. Writing a directory is dangerous, so we will not attempt to do that.

The files that we will use, contain text. This means that they contain characters that you can see at your keyboard. On the disk, the characters are represented by numbers from 0 to 255. In the early days of computing, only the numbers 0-127 were used. The computer alphabets of those days contained the characters used in American English: a-zA-Z0-9_ and a few punctuation signs. The map converting these characters to numbers is called ASCII: American Standard Code for Information Interchange. In order to be able to use characters with accents, numbers higher than 127 had to be used. ISO-8859-1 (or Latin-1) is the most frequently used mapping from characters to numbers in the range 0-255 while UTF-8 is frequently used for mapping characters to numbers in the range 0-4,294,967,296.

Often a text contains more information than just characters, words and sentences. The text may also contain information about its structure and the way words or phrases should be displayed. Structural information includes information about the locations of paragraph boundaries and headings. Display information includes things like the size of the characters and the font type which should be used for them. There are different ways to put this extra information in the text, to name a few: Word's DOC format and HTML. Some of these formats use human-readable code like <p> in HTML for defining the start of a paragraph. Others, like DOC, use unreadable codes and turn the text into a mess: it becomes a binary file. If you want your Perl program to process such a text, you should try to convert it to a different format first (for DOC that format is RTF). Sometimes you may even want to remove all structural information from the texts before processing it. However, you should be aware of the fact that structural information can help analyzing the text. For example, paragraph and heading information help find the boundaries of sentences.

6.2. File operations

Up until now we have seen three types of variables: scalars, lists and hashes. In this section we introduce the fourth: the file handle. This is a variable which is tied to a file. File handles can used in two ways: to read data from and to write data to. We define them with the open command; you do not need to define them with my even if you are programming in a strict environment. Here are a few examples of using open:

When we open a file, we should define what action to perform on the file: writing (prefix name with ">"), appending (">>") and reading (nothing or "<"). We can also write to a program or read from a program by putting a pipe symbol (|) behind or in front of the name. A file handle name can be anything but it has become custom to use only capital characters. File handle names do not require a special prefix character. Note that for opening a directory we use a special command opendir.

As soon as we have defined a file handle, we can perform operations on the file. Here are some examples:

So we can read both from files and directories and the amount of information read depends on the context (either scalar context or list context). For writing a text to a file we use print with the file handle put between the command and the text. As soon as you have finished processing a file, you should close it:

This command can be used for any file handle, regardless of whether is was used for reading or writing.

6.3. Miscellaneous

There are three special file handles: STDIN, STDOUT and STDERR. The first one is used for reading from the keyboard or another program and the second for writing to the screen or another program. Actually these two file handles are the default ones for reading and writing: we have already used <STDIN> for reading lines of text and print STDOUT "text" is the same as the familiar print "text". STDERR is also used for writing to the screen. The reason for having two output streams is that programs sometimes send their output to other programs or files. If an error occurs, we do not want the error message to be put in the output file but we want it to be shown on the screen. Therefore all error messages should be sent to STDERR.

There are a few more commands in Perl that are related to file management:

The first command should be used after opening a file on Windows systems. It will deal with the fact that on these systems lines have two extra characters at the end: \r\n. Lines that are read from or written in a file on such systems will be converted to the \n end format we have used up until now.

The latter two are useful for handling problems while opening files or directories. If a program tries to open a file which does not exist, it should generate an error message and exit. This can be implemented as follows:

   open(INFILE,"myfile") or die("cannot open myfile!");

Perl will try to open the file. When this fails, open returns 0 (false) and the second part of the or statement will be executed. This will make the program exit with the specified error message.

You should be aware of two extra ways to use print. First you can use print for printing a block of text rather than a string. Second, there is a variant of print which allows printing a formatted string: printf. Here are two examples:

   # print everything up until EOF
   print <<"EOF";
   this text will be printed
   the fact that it spans four lines
   is no problem
   variable $x will be evaluated as well
   EOF

   # this will print "1 1.234   1.23\n"
   printf "%d %-7s %4.2f\n",1.234,1.234,1.234;

First, print prints every succeeding line until a line is found which contains the specified tag EOF. Second, printf prints its final three arguments according to the format specified in the first argument. There, %d means digit, %s means string and %f means floating point expression. Between the percent sign and the character you can specify how many characters the expression contains. A negative number behind the % means that the tokens should be left-aligned rather than aligned at the right.

6.4. Error messages and testing

Find errors in your code is an important task. There are two kinds of errors: (1) those that keep your program from running, and (2) those that keep your program from generating the right results. Perl helps you to find to identify errors of the first type by generating warnings and error messages. Let's look at an example by running the following program for computing one divided by two:

   # program.pl
   use strict;

   $a = 1 
   $b = 2
   $c = $a/$b
   printf "%d",$c

   # example run
   erikt@stuwww:~$ perl -w program.pl
   Scalar found where operator expected at program.pl line 4, near "$b"
           (Missing semicolon on previous line?)
   syntax error at program.pl line 4, near "$b "
   Execution of program.pl aborted due to compilation errors.

Perl can generate an impressive number of error messages for basic errors. Whenever you are faced with error messages, always start with solving the first one and then run the program again. The second, third, and following messages may all be caused by the same problem. In this case, we need to look up line 4 ($b = 2). We find out that, indeed, a semicolon is missing on line 3 and in fact as well on lines 4, 5 and 6. We add them and run the program again:

   # program.pl
   use strict;

   $a = 1;
   $b = 2;
   $c = $a/$b;
   printf "%d",$c;

   # example run
   erikt@stuwww:~$ perl -w ~/tmp/perl
   Global symbol "$c" requires explicit package name at program.pl line 5.
   Global symbol "$c" requires explicit package name at program.pl line 6.
   Execution of program.pl aborted due to compilation errors.

This time, Perl complains about a variable ($c) not being defined on lines 5 and 6. This is correct. In fact we did not predefine any variable. We change this by adding my before the three first usages of each variable and run the program again.

   # program.pl
   use strict;

   my $a = 1;
   my $b = 2;
   my $c = $a/$b;
   printf "%d",$c;

   # example run
   erikt@stuwww:~$ perl -w program.pl
   0$

Now Perl does not generate any error messages but the code is not generating the right output. First, there is no newline after the number, something we can fix easily. But then the output number is zero and not 0.5. In order to find out what is wrong, we run Perl in debugging mode, with option -d:

   # program.pl
   use strict;

   my $a = 1;
   my $b = 2;
   my $c = $a/$b;
   printf "%d\n",$c;

   # example run
   erikt@stuwww:~$ perl -d program.pl
   Loading DB routines from perl5db.pl version 1.27
   Editor support available.

   Enter h or `h h' for help, or `man perldebug' for more help.

   main::(program.pl:3): my $a = 1;
     DB<1> n
   main::(program.pl:4): my $b = 2;
     DB<1> n
   main::(program.pl:5): my $c = $a/$b;
     DB<1> n
   main::(program.pl:6): printf "%d\n",$c;
     DB<1> p$a
   1
     DB<2> p$b
   2
     DB<3> p$c
   0.5
     DB<4> q
   $

In debugging mode you can run the program step-by-step with the command n followed by Return. After every step, Perl will show what command will be executed next. You can check the values of the variables with the command p followed by the variable name.

We notice that the variables have the expected values right before the print statement. We inspect the statement and notice that we have made an error. We try to print a fraction but the print statement processes an integer (%d). We change this part to %3.1f and run the program again:

   # program.pl
   use strict;

   my $a = 1;
   my $b = 2;
   my $c = $a/$b;
   printf "%3.1f\n",$c;

   # example run
   erikt@stuwww:~$ perl -w program.pl
   0.5

This time the output of the program is correct.

If you want to know more about debugging Perl programs, you can take a look at the Perl debugging tutorial.

Perl contains a built-in method for testing subroutines.


Previous | Home | Exercises | PDF slides
Last update: October 12, 2007. erikt(at)science.uva.nl