Previous | Home | Exercises | Solutions | Next


A Shortcut to Perl

Erik Tjong Kim Sang, Jakub Zavrel, Guy De Pauw and Walter Daelemans
CNTS - Computational Linguistics, University of Antwerp

This text is part of the lecture notes for a Perl course taught at CNTS - Computational Linguistics at the University of Antwerp.

3. String processing

In this section we will examine string processing and introduce regular expressions.

3.1. Basic string operations

In Perl, strings are stored in the same type of variables we have used for storing numbers. String values can be specified between double and single quotes. In the first one variables will be evaluated, in the second one they will not. So for example, if $s1 contains the word example then after $s2 = "$s1$s1" the second string variable will contain exampleexample while after $s2 = '$s1$s1' it would have contained $s1$s1.

There are two basic string operators available. The first one is the concatenation operator: $s2 . $s1. It puts two strings behind each other. The second is similar to the multiplication operator: $s x $n. This repeats string $s $n times. Both have a related assignment operator (.= and x=).

The comparison operators for strings are different from the number comparison operators. Here is an overview:

A string consisting the characters a-z is less than a similar string then it would be placed before that string in a dictionary. Not that the final four might behave different from what you expect if the strings contain characters outside of the range a-z (capital characters, characters with accents and punctuation marks).

Perl contains a convenient function for determining the length of a string. When applied to a number, function will return the number of digits in that number. For example, length(246) is equal to 3. Note that this functions includes the newline characters in the count as well.

3.2. String substitution and string matching

We will often need to change a part of a string. There are two operators available for this. The s/// operator modifies sequences of characters and the tr/// operator changes individual characters. Both operators contain two parts. The first part between the first two slashes contains a search pattern and the second part between the final two slashes contains the replacement. Behind the final slash we can put characters for modifying the behavior of the commands. By default s/// only replaces the first occurrence of the search pattern and by appending a g to the operator it will replace every occurrence. The tr/// operator allows the modification characters c (replace the complement of the search class), d (delete characters of the search class that are not replaced) and s (squeeze sequences of identical replaced charters to one character). Here are a few examples of the two operators:

   # replace first occurrence of "bug"
   $text =~ s/bug/feature/;
   # replace all occurrences of "bug"
   $text =~ s/bug/feature/g;
   # convert to lower case
   $text =~ tr/[A-Z]/[a-z]/;
   # delete vowels
   $text =~ tr/AEIOUaeiou//d;
   # replace nonnumber sequences with x
   $text =~ tr/[0-9]/x/cs;
   # replace all capital characters by CAPS
   $text =~ s/[A-Z]/CAPS/g;

Note the notation for sequences of successive characters like for example in [A-Z]. The assignment operator =~ is new. We need a special operator to make clear that the right-hand part of the operator should be applied to the left-hand part.

There is a similar operator available for performing tests on parts of strings: the matching operator: m// or in short //. This operator tests if an expression matches a string. Its most important modification character is i (case-insensitive matching). Here is an example:

   if ($text =~ /danger/i) {
      if ($test =~ /DANGER/) {

This time we use =~ as a comparison operator.

3.3 Regular expressions

The examples we have seen until now contain fixed strings. However, many problems are difficult to solve with fixed strings only. For example, you have an English text and are asked to write a program for extracting palindromic three-character words with a vowel in the middle. In that case you would need a flexible string matching scheme. Perl and many other computer languages offer this as regular expressions. With one of these expressions you can describe a set of strings. This was made possible by giving some character sequences a special meaning:

The sequences which match more than one character and of which the names start with a backslash (\), have related sequences which matches the opposite. These sequences have the same names as the positive version but spelled with a capital character instead of a lower case character. A note should be made about the \w sequence: it matches a-zA-Z0-9 but whether it matches characters with accents as well depends on the setup of your operating system.

With these sequences we can create many regular expressions but we need more. We need something for finding a repetition of the same string of arbitrary length, for example ha, haha, hahaha and so on. This is possible with quantifiers. Here are the quantifiers offered by Perl's regular expressions:

The quantifier will operate on the previous token, usually the previous character. For example, ha* will match h, ha, haa, haaa and so on. If we want the operator to match more than one character, we have to include these characters between round brackets. So (ha)+ will match our example laughing strings. When applied to a string, a quantifier will attempt to match as many characters as possible: it is greedy. So if we apply (ha)+ to he said hahaha it will match with hahaha. There are also quantifiers available for matching strings which are as short as possible. They consist of the standard quantifier with a question mark added to it. So when applied to the example string, (ha)+? will initially match ha.

The quantifiers allow us to recognize successive repetitions. However, sometimes we need to be able to refer back to an earlier part of the string that is not immediately before the part we are referring from. Perl regular expressions have a solution for this. We can put the first part between round brackets and refer to it with \p where p is the number which indicates the position of the marked part. For example, if we have marked two parts and we want to refer to the second part, then we should use \2. Now we have enough material for solving the palindromic three-character word question: \b(\w)[aeiou]\1\b. Note: in a matching context the name for the first referred part is \1 but in a replacement context it should be $1.

3.4. Programming example

In this programming example we will write a program from extracting URIs (Uniform Resource Identifiers) from web pages. A url is a description of a location of some resource. Usually it starts with http:// but there are alternatives such as ftp:// and mailto:. A URI can be absolute, for example when it starts with one of the three strings mentioned in the previous sentence, or it can be relative to the current location. Our goal will be to extract both absolute and relative URIs and translate the latter to the former.

When we examine the source of a web page, we will notice that URIs usually occur in an environment preceded by href=" and followed by a double quote. However, not every string in such an environment is a URI; the former sentence is an example of that. The only proper way to extract URIs is to parse the source of the web page and find the exact spots which contain URIs. This is too much work for our example program. We will make our program assume that every environment described above, contains a URI. This means that we accept that our program occasionally will make errors. There are two types of errors the program can make. Sometimes it will return a string which is not a URI and sometimes it will miss a URI.

In order to find out what sub tasks this program requires, we construct a small proto-type program which performs the task in the simplest way we could think of: read each line and when it contains a URI, print it. Here is the corresponding program:

   while (<>) {
      if (/href=\".*\"/) {

This program will perform the if statement as long as something can be read. We did not specify in what variable the input line should be stored and so Perl will store it in the default variable $_. In the if condition we did not specify what variable should be matched with the pattern. This means that the default variable will be tested. When the regular expression matches, everything but the URI will be removed from the default variable and the result will be printed. Again the default variable will be processed because neither the third nor the fourth line contain an argument.

The little program almost does what we want. One important problem is that it does not expand the referential URIs. In order to make that possible, the program needs to know the URI of the current web page. Usually this is not stored in the source, so we need to supply it separately with an extra read instruction before the loop.

There are two types of referential URIs: those that point from a directory to a file and those that point to a location in the current file. The latter can be identified because they start with a hash mark (#). We will assume that the first one starts with anything else except with a string of characters followed by a colon. The two referential URIs need to be treated differently. We should place the URI of the directory before the file pointer but before the location pointer we should put both the directory URI and the file name. This is why the URI in the program is split into a directory and file part:

   chomp($_ = <>);
   if (m?(.*/)(.*)?) {
      $dir = $1;
      $file = $2;
      while (<>) {
         if (/href=\".*\"/) {
            s/^/$file/ if /^#/;
            s/^/$dir/ if ! /^[a-zA-Z]*:/;

This program will produce a perfect result for the course home page. However, it is still only a proto-type and we should test it on other web pages. We have tested it on Four important errors occurred. First, this page contains a third type of referential URI: pointing from the host name. Our program cannot handle those. Second, sometimes the page contains multiple URI pointers on the same line. The current program combines these in a strange way. Third, the href attribute might be specified in capital characters and our program only recognizes the lower case variant. And fourth, the program included extra attributes after href in the URI. We have updated the program to get rid of these errors:

   chomp($_ = <>);
   if (m?(.*://)([^/]*)(.*/)([^/]*)?) {
      $xtp = $1;
      $host = $2;
      $dir = $3;
      $file = $4;
      while ($line = <>) {
         while ($line =~ /href=\"([^"]*)\"/ig) {
            $_ = $1; 
            s/^/$file/ if /^#/;
            s/^/$dir/ if ! (/^[a-zA-Z]*:/ || /^\//);
            s/^/$xtp$host/ if ! /^[a-zA-Z]*:/;
            print "$_\n";

This program seems to work correctly for the second test page. We should continue testing it and perhaps expanding it but we will leave the program as it is right now. Note that the basic commands of the initial program are still present in this version. The more we will test the program, the more infrequent its errors will be. We can continue adding extensions but the extensions will deal with less and less frequent problems.

Previous | Home | Exercises | Solutions | Next
Last update: February 19, 2000.