Previous | Home | Exercises | Solutions | Next

 

A Shortcut to Perl

Erik Tjong Kim Sang, Jakub Zavrel, Guy De Pauw and Walter Daelemans
CNTS - Language Technology Group, University of Antwerp
http://lcg-www.uia.ac.be/~erikt/perl/


This text is part of the lecture notes for a Perl course taught by the CNTS - Language Technology Group at the University of Antwerp.

9. File management

This section describes how you can work with files in Perl. We will examine some basic facts about files and directories, look at the file operations in Perl and finish with some miscellaneous topics related to file management

9.1. Files and directories

A file is a collection of data that is stored on some disk. We will use them for storing large amounts of data. In order to access files, we need to be able to read from them and write them. Operations for performing these actions will be presented in the next section. Analogously, a directory is a collection of files and directories. Sometimes we need to read from a directory in order to find out which files it contains. Therefore we will also need a read operation for directories. Writing a directory is dangerous, so we will not attempt to do that.

The files that we will use, contain text. This means that they contain characters that you can see at your keyboard. On the disk, the characters are represented by numbers from 0 to 255. In the early days of computing, only the numbers 0-127 were used. The computer alphabets of those days contained the characters used in American English: a-zA-Z0-9_ and a few punctuation signs. The map converting these characters to numbers is called ASCII: American Standard Code for Information Interchange. In order to be able to use characters with accents, numbers higher than 127 had to be used. These are insufficient to cover all alphabets of the world and therefore there are different ways to map these numbers to characters. The standard map which you will probably encounter most often is ISO-8859-1.

Often a text contains more information than just characters, words and sentences. The text may also contain information about its structure and the way words or phrases should be displayed. Structural information includes information about the locations of paragraph boundaries and headings. Display information includes things like the size of the characters and the font type which should be used for them. There are different ways to put this extra information in the text, to name a few: Word, HTML and SGML. Some of these formats use readable code like <p> in SGML and HTML for defining the start of a paragraph. Others, like Word, use unreadable codes and turn the text into a mess: it becomes a binary file. If you want your Perl program to process such a text, you should try to convert it to a different format first (for Word that format is RTF). Sometimes you may even want to remove all structural information from the texts before processing it. However, you should be aware of the fact that structural information can help analyzing the text. For example, paragraph and heading information help find the boundaries of sentences.

9.2. File operations

Up until now we have seen three types of variables: scalars, lists and hashes. In this section we introduce the fourth: the file handle. This is a variable which is tied to a file. We define it with the open command; you do not need to define it with my even if you are programming in a strict environment. Here are a few examples of using open:

When we open a file, we should define what action to perform on the file: writing (prefix name with ">"), appending (">>") and reading (nothing or "<"). We can also write to a program or read from a program by putting a pipe symbol (|) behind or in front of the name. A file handle name can be anything but it has become custom to use only capital characters. File handle names do not require a special prefix character. Note that for opening a directory we use a special command opendir.

As soon as we have defined a file handle, we can perform operations on the file. Here are some examples:

So we can read both from files and directories and the amount of information read depends on the context (either scalar context or list context). For writing a text to a file we use print with the file handle put between the command and the text. As soon as you have finished processing a file, you should close it:

This command can be used for any file handle, regardless of whether is was used for reading or writing.

9.3. Miscellaneous

There are three special file handles: STDIN, STDOUT and STDERR. The first one is used for reading from the keyboard or another program and the second for writing to the screen or another program. Actually these two file handles are the default ones for reading and writing: print STDOUT "text" is the same as print "text" and <STDIN> is the same as <>. STDERR is also used for writing to the screen. The reason for having two output streams is that programs sometimes send their output to other programs or files. If an error occurs, we do not want the error message to be put in the output file but we want it to be shown on the screen. Therefore all error messages should be sent to STDERR.

There are a few more commands in Perl that are related to file management:

The first command should be used after opening a file on Windows systems. It will deal with the fact that on these systems lines have two extra characters at the end: \r\n. Lines that are read from or written in a file on such systems will be converted to the \n end format we have used up until now.

The latter two are useful for handling problems while opening files or directories. If a program tries to open a file which does not exist, it should generate an error message and exit. This can be implemented as follows:

   open(INFILE,"myfile") or die("cannot open myfile!");

Perl will try to open the file. When this fails, open returns 0 (false) and the second part of the or statement will be executed. This will make the program exit with the specified error message.

You should be aware of two extra ways to use print. First you can use print for printing a block of text rather than a string. Second, there is a variant of print which allows printing a formatted string: printf. Here are two examples:

   print <<"EOF"; # print everything up until EOF
   this text will be printed
   the fact that it spans four lines
   is no problem
   variable $x will be evaluated as well
   EOF
   # this will print "1 1.234   1.23\n"
   printf "%d %-7s %4.2f\n",1.234,1.234,1.234;

First, print prints every succeeding line until a line is found which contains the specified tag EOF. Second, printf prints its final three arguments according to the format specified in the first argument. There, %d means digit, %s means string and %f means floating point expression. Between the percent sign and the character you can specify how many characters the expression contains. A negative number behind the % means that the tokens should be left-aligned rather than aligned at the right.

9.4. Programming example

The file management functions of Perl enable us to work with large amounts of data. We will give a demonstration of this by adding two extra commands to the texttool program: first readFile, which will read a file and put its contents in a text and second store, which will store a text in a file. We will start with the error checking code:

   # added error checking code for readFile
   if ($command eq "readFile" and @args != 2) { $errorNbr = 1; }
   if ($command eq "readFile" and @args == 2) {
      if (not(open(INFILE,"$args[0]"))) { $errorNbr = 6; }
      else { close(INFILE); }
   }
   # added error checking code for store
   if ($command eq "store" and @args != 2) { $errorNbr = 1; }
   if ($command eq "store" and @args == 2) {
      if (not(defined($text{$args[0]}))) { $errorNbr = 2; }
      elsif (not(open(OUTFILE,">$args[1]"))) { $errorNbr = 6; }
      else { close(OUTFILE); }
   }

So, apart from checking if the number of arguments is correct and the text variable used by store is defined, the program also attempts to open the file. If this fails, it will generate an error message. Otherwise, the program has succeeded in opening a file which it does not need, so it will close it again. It is reasonable to check an input file before we try to read from it: it might not exist. However, we also need to check output files since the operating system might forbid writing in some files.

The processing code for readFile and store is very similar to the code for read and print:

   elsif ($command eq "readFile") {
      open(INFILE,"$args[0]");
      $text{$args[0]} = "";
      while (<INFILE>) { $text{$args[1]} .= $_; }
      close(INFILE);
   }
   elsif ($command eq "store") {
      open(OUTFILE,">$args[1]");
      print OUTFILE $text{$args[0]}; 
      close(OUTFILE);
   }

Equipped with these two commands, the texttool program has become much more useful than it was before. We have tested it by processing a file containing almost 200,000 tokens. Apart from some small delays which were required for the increased amount of computation, the program performed fine.


Previous | Home | Exercises | Solutions | Next
Last update: March 31, 2000. erikt@uia.ua.ac.be