Previous | Home | Exercises | Solutions | Next

A Shortcut to Perl

Erik Tjong Kim Sang, Jakub Zavrel, Guy De Pauw and Walter Daelemans
CNTS - Language Technology Group, University of Antwerp
http://lcg-www.uia.ac.be/~erikt/perl/

8. Subroutines

This section presents subroutines and related concepts such as variable scope and program arguments.

8.1. The basic facts

Subroutines are named blocks of code. The fact that they have a name enables us to execute their body of code from anywhere in the program by calling their name. Calls to subroutines can be recognized because subroutine names often start with the special character &. Here is an example:

When the subroutine askForInput is called at the end of the program, its body will be executed and a request for input will be printed. Note that the subroutine body will only be executed when the subroutine is called. So if the subroutine is located at the start of the program, the code in the body will just be skipped when the program is read.

Subroutines will perform small and useful tasks. After you have been programming for some months, you will have written many subroutines. Some of these will be used in different programs. You do not want to have to copy a subroutine to a program file every time you need it. There is a convenient work-around for this: put related subroutines in a file and include the file with the command require:

Here we start with defining in which directory the files with subroutines have stored. Then we read the file we want to use. Files included like this are called packages, modules or libraries. It is customary to give them a name with extension .pm which does not have to be specified in the require command. So in this example we are actually including file nlp.pm. When you write your own packages, you should be aware of the fact that Perl requires packages to return a true value. This can be enforced by letting the packages end with a one followed by a semicolon (1;).

8.2. Variable scope

Suppose we have written a program with one subroutine. A variable $a is used both in the subroutine and in the main part program of the program. Here is an overview of the program:

The value of $a will be printed three times and the question is what values will be printed. The first time the value is printed, the value 0 has been put in $a so this value will be printed. This will also happen the second time because the body of the subroutine definition will be skipped. The third time, the value 1 will be printed because $a has been changed by the subroutine call on the previous line. In this example, the variable $a can be accessed throughout the program: both in the main part and in the subroutine. We say that $a is a global variable.

Suppose that we have written a subroutine which is included in some package. This subroutine may modify many variables. When we use this subroutine in some large program, we do not want to have to look inside the subroutine to see which variable names might clash with the variables of the main program. We want the subroutine to hide its variables from the rest of the program. This can be done by declaring the variables with my:

Because we have put my before the first use of $a in the subroutine, changeA obtained its own variable $a which is not related to variable used in the main program. This means that the value of the main program variable will not be changed and the program will print 0 three times. The my construct influences the scope of a variable: the part of the program in which it can be used. It restricts it to the part starting after the variable definition and ending at the end of the subroutine in which the variable is defined.

8.3. Communication between subroutines and programs

Subroutines communicate with other parts of programs by exchanging variable values. Input of a subroutine can be specified by providing the input as arguments of the subroutine call, for example: &doSomething(2,"a",$abc). A peculiarity of Perl subroutines is that they convert their input variables to a flat list. This means that &doSomething((2,"a"),$abc) will result in the same as the earlier example. Inside the subroutine the argument values can be accessed via the special list @_. So the first argument (here 2) will be put in $_[0], the second (here "a") in $_[1] and the third (here $abc) in $_[2].

When a variable is used as argument of a subroutine, then a modification of the @_ location corresponding with the argument will result in a modification of the variable. So, if the subroutine doSomething modifies $_[2], after the example call in the previous paragraph, then $abc will be modified. A tricky problem is passing two or more lists as arguments of a subroutine. If we just say something like &sub(@a,@b), the subroutine will receive the two list as one big one and it will be unable to determine where the first ends and where the second starts. The solution to this problem is to pass the lists as reference arguments: &sub(\@a,\@b). We will not go further into this but refer to the Perl literature for the details.

Just like a subroutine uses a list for its input, it also uses a list as output. In general the return value of a subroutine is equal to the return value of its final command. We can enforce a specific return value by specifying it as final part of the subroutine. For example, if we end a subroutine with a line containing (1,2), or more explicitly return(1,2), then it will return the list (1,2). These return values can be intercepted by using the subroutine as the right-hand side of an assignment, for example: ($a,$b) = &subr(). The return values of the subroutine will be stored in the left-hand side variables.

The concept of communication explained here does not only apply to subroutines but also to complete programs. A Perl program can also output values by specifying them on the last line as 0 or exit(0) as you may have seen in some of the example programs. These values are only interesting for programs that communicate with other programs or the operating system. More useful is the concept of program arguments. These are the strings specified on the command line after the program call. Like in the texttool program, Perl programs can be called with arguments like in texttool /p "TT>". These arguments will be stored in the special variable @ARGV (here $ARGV[0]="/p" and $ARGV[1]="TT>"; extra quotes were removed by the operating system).

8.4. Programming example

In the exercises of the earlier sessions we have worked on translation programs. Here we will construct yet another translation program but this time we will use subroutines. We will also enable the user to influence the behavior of the program by using command line arguments. The program will translate Dutch to English or English to Dutch and the user can use the arguments d-e and e-d to enforce one of the translation directions. We start with making a lop-level description of the program:

This program contains a loop and four tasks. Each of the tasks can be put in a subroutine. However, the reading and printing bits are simple so we will only create subroutines for mode determination and translation. The translation memory will consist of a hash with Dutch words as keys and English words as values. Translation consists of word lookup and translation from English to Dutch is required we will swap the keys with the values (reverse in Perl). We will start with filling the translation memory and determining the translation mode:

The subroutine detTrMod does not use arguments and returns no values. It checks the argument of the program and reverses the translation dictionary when English to Dutch translation is required.

The translation part is more complex than the subroutine detTrMod. The text will be received in a string. It needs to be converted to a list, be translated and be converted back to a string. Since we will translate by lookup in the dictionary, we need to perform some cleaning up as well by removing punctuation marks. We will keep the translation subroutine restricted to processing a clean list of words:

translate receives a list of words as it arguments. It looks up each of these words in the dictionary and adds the translation of each word to the list of translated words. When a word is not specified in the dictionary, the word itself will be added to the translation. After having processed all words, translate returns the list of translated words.

The translation program works by iteration: it contains a loop which processes one word after another. It is also possible to to work by recursion: translate one word and leave the rest of the translation to an embedded call of translate. Here is an example:

Now we start by checking if the subroutine was called with an empty argument list. If that is the case we return the empty list since there was nothing to translate. Otherwise, we take the first word from the list and return its translation, if one exists, and the translation of the rest. For the latter part we trust translate to be able to translate the rest of the text. The translation task can be solved both by iteration and by recursion. For some other tasks recursion is the best solution so that is why we are showing you an example of that right here.

After initializing two variables the program checks the arguments and enters a loop. In the loop, a text is read and the non-word characters except white space characters are removed from the text. Next, the text is converted to a list, translated, converted back to a string and printed. This complete program is capable of translating the sentence John and Mary went to the restaurant both from English to Dutch and the other way around.