Previous | Home | Exercises | PDF slides | Next

Perl 2007: Lesson 3

This text is part of the lecture notes for a programming course taught at University of Tilburg, The Netherlands.

3. String processing

In this section we will examine string processing and introduce regular expressions.

3.1. Basic string operations

In Perl, strings are stored in the same type of variables we have used for storing numbers. String values can be specified between double and single quotes. In the first one variables will be evaluated, in the second one they will not. So for example, if $s1 contains the word example then after $s2 = "$s1$s1" the second string variable will contain exampleexample while after $s2 = '$s1$s1' it would have contained $s1$s1.

There are two basic string operators available. The first one is the concatenation operator: $s2 . $s1. It puts two strings behind each other. The second is similar to the multiplication operator: $s x $n. This repeats string $s $n times. Both have a related assignment operator (.= and x=).

The comparison operators for strings are different from the number comparison operators. Here is an overview:

eq: equal
ne: not equal
lt: less than
le: less than or equal to
gt: greater than
ge: greater than or equal to

A string consisting the characters a-z is less than a similar string when it would be placed before that string in a dictionary. Note that Perl makes a difference between "A" and "a" (capital characters are first in the alphabet) and that the order of non-alphabetic characters and characters with accents is nontrivial (see for example the ISO 8859-1 table).

Perl contains a convenient function for determining the length of a string. When applied to a number, function will return the number of digits in that number. For example, length(246) is equal to 3. Note that this functions includes the newline characters in the count as well.

3.2. String substitution and string matching

We will often need to change a part of a string. There are two operators available for this. The s/// operator modifies sequences of characters and the tr/// operator changes individual characters. Both operators contain two parts. The first part between the first two slashes contains a search pattern and the second part between the final two slashes contains the replacement. Behind the final slash we can put characters for modifying the behavior of the commands. By default s/// only replaces the first occurrence of the search pattern and by appending a g to the operator it will replace every occurrence. The tr/// operator allows the modification characters c (replace the complement of the search class), d (delete characters of the search class that are not replaced) and s (squeeze sequences of identical replaced charters to one character). Here are a few examples of the two operators:

   # replace first occurrence of "bug"
   $text =~ s/bug/feature/;
   # replace all occurrences of "bug"
   $text =~ s/bug/feature/g;
   # convert to lower case
   $text =~ tr/A-Z/a-z/;
   # delete vowels
   $text =~ tr/AEIOUaeiou//d;
   # replace nonnumber sequences with x
   $text =~ tr/0-9/x/cs;
   # replace all capital characters by CAPS
   $text =~ s/[A-Z]/CAPS/g;

Note the notation for sequences of successive characters like for example in [A-Z]. The assignment operator =~ is new. We need a special operator to make clear that the right-hand part of the operator should be applied to the left-hand part. The operation specified to the right of =~ will be applied to the variable to the left and the result will be stored in the same variable.

There is a similar operator available for performing tests on parts of strings: the matching operator: m// or in short //. This operator tests if an expression matches a string. Its most important modification character is i (case-insensitive matching). Here is an example:

   if ($text =~ /danger/i) {
      if ($test =~ /DANGER/) {
         ...

This shows that =~ can also be used as a comparison operator.

3.3 Regular expressions

The examples we have seen until now contain fixed strings. However, many problems are difficult to solve with fixed strings only. For example, you have an English text and are asked to write a program for extracting palindromic three-character words with a vowel in the middle (dad, did, gig, mom, ...). In that case you would need a flexible string matching scheme. Perl and many other computer languages offer this as regular expressions. With one of these expressions you can describe a set of strings. This was made possible by giving some character sequences a special meaning:

\b: word boundaries
\d: digits
\n: newline
\r: carriage return
\s: white space characters
\t: tab
\w: alphanumeric characters
^: beginning of string
$: end of string
.: any character
[bdkp]: characters b, d, k and p
[a-f]: characters a to f
[^a-f]: all characters except a to f
abc|def: string abc or string def

The sequences which match more than one character and of which the names start with a backslash (\), have related sequences which matches the opposite. These sequences have the same names as the positive version but spelled with a capital character instead of a lower case character. A note should be made about the \w sequence: it matches a-zA-Z0-9 but whether it matches characters with accents as well depends on the setup of your operating system.

With these sequences we can create many regular expressions but we need more. We need something for finding a repetition of the same string of arbitrary length, for example ha, haha, hahaha and so on. This is possible with quantifiers. Here are the quantifiers offered by Perl's regular expressions:

*: zero or more times
+: one or more times
?: zero or one time
{p,q}: at least p times and at most q times
{p,}: at least p times
{p}: exactly p times

The quantifier will operate on the previous token, usually the previous character. For example, ha* will match h, ha, haa, haaa and so on. If we want the operator to match more than one character, we have to include these characters between round brackets. So (ha)+ will match our example laughing strings. When applied to a string, a quantifier will attempt to match as many characters as possible: it is greedy. So if we search for (ha)+ to he said hahaha it will match with hahaha. There are also quantifiers available for matching strings which are as short as possible. They consist of the standard quantifier with a question mark added to it. So when applied to the example string, (ha)+? will initially match ha.

The quantifiers allow us to recognize successive repetitions. However, sometimes we need to be able to refer back to an earlier part of the string that is not immediately before the part we are referring from. Perl regular expressions have a solution for this. We can put the first part between round brackets and refer to it with \p where p is the number which indicates the position of the marked part. For example, if we have marked two parts and we want to refer to the second part, then we should use \2. Now we have enough material for solving the palindromic three-character word question: \b(\w)[aeiou]\1\b. Note: in a matching context (like if (...)) the name for the first referred part is \1 but in a replacement context ($a =~ ...) it should be $1.

3.4. Programming example

In this programming example we will write a program from extracting URIs (Uniform Resource Identifiers) from web pages. A URI is a description of a location of some resource, for example a web page. Usually it starts with http:// but there are alternatives such as ftp:// and mailto:. A URI can be absolute, for example when it starts with one of the three strings mentioned in the previous sentence (http://uvt.nl), or it can be relative to the current location (ex03.html). Our goal will be to extract both absolute and relative URIs and translate the latter to the former.

When we examine the source of a web page, we will notice that URIs usually occur in an environment preceded by href=" and followed by a double quote ("). Not every string in such an environment is a URI; the former sentence is an example of this. The best method for extracting URIs is to parse the source of the web page and find the exact spots which contain URIs. This is too much work for our example program. Therefore we will just use href=" as an indicator for the start of a URI. We accept that our program occasionally will make identification errors. There are two types of errors the program can make. Sometimes it will return a string which is not a URI (false positive) and sometimes it will fail to recognize a URI (false negative).

In order to find out what subtasks this program requires, we construct a small proto-type program which performs the task in the simplest way we could think of: read each line and when it contains a URI, print it. Here is the corresponding program:

   # read lines of html
   while (<STDIN>) {
      # store the current line in $line
      $line = $_;
      chomp($line);
      # is there a URI in this line?
      if ($line =~ /href="(.*)"/) {
         # yes: put URI in \$uri and print the result
         $uri = $1;
         print "$uri\n";
      }
   }

This program will perform commands as long as something can be read from STDIN. We did not specify in what variable the input line should be stored and so Perl will store it in the default variable $_. The first thing we do, is to copy the contents of $_ to the variable $line and remove the newline character from $line with chomp. In the if condition we check if $line contains href=" followed by another double quote. When this is the case, everything but the URI will be removed from $line and the result will be printed.

The little program almost does what we want. One important problem is that it does not expand the referential URIs. In order to make that possible, the program needs to know the URI of the current web page. Usually this is not stored in the source, so we need to supply it separately with an extra read instruction before the loop.

There are two types of referential URIs: those that point from a directory to a file and those that point to a location in the current file. The latter can be identified because they start with a hash mark (#). We will assume that the first one starts with anything else except with a string of characters followed by a colon. The two referential URIs need to be treated differently. If we find a file URI, the current directory should be put in front of it. If we find a location URI (starting with #), we should put both the current directory and the current file in front of it. This is why the URI in the program is split into a directory and file part:

   # read the address of this web page
   $address = <STDIN>;
   chomp($address);
   # split the address in directory and file
   if ($address =~ /(.*\/)(.*)/) {
      $dir = $1;
      $file = $2;
   } else {
      die "error: cannot parse URI $address\n";
   }
   # read lines of html
   while (<STDIN>) {
      # store the current line in $line
      $line = $_;
      chomp($line);
      # is there a URI in this line?
      if ($line =~ /href=\".*\"/) {
         # yes: remove the part before and after the URI
         $line =~ s/.*href=\"(.*)\".*/$1/;
         # add the current file name to location URIs
         if ($line =~ /^#/) { $line =~ s/^/$file/; }
         # add the current directory name to file URIs
         if ($line !~ /^[a-zA-Z]*:/) { $line =~ s/^/$dir/; }
         # print the result
         print "$line\n";
      }
   }

This program will produce a perfect result for the course home page. However, it is still only a proto-type and we should test it on other web pages. We have tested it on http://www.uvt.nl/. Three important errors occurred. First, this page contains a third type of referential URI: pointing from the host name (/internet/newscentre/). Our program cannot handle those. Second, sometimes the page contains multiple URI pointers on the same line. The current program combines these in a strange way. And third, the program included extra attributes after href in the URI. We have updated the program to get rid of these errors:

   # read the address of this web page
   $address = <STDIN>;
   chomp($address);
   # split the address in directory and file
   if ($address =~ /(.*:\/\/[^\/]*)(.*)([^\/]*)/) {
      $host = $1;
      $dir = $2;
      $file = $3;
   } else {
      die "error: cannot parse URI $address\n";
   }
   # read lines of html
   while (<STDIN>) {
      # store the current line in $line
      $line = $_;
      chomp($line);
      # is there URIs in this line?
      while ($line =~ /href="([^"]*)"/g) {
         # yes: put the uri in $uri
         $uri = $1;
         # add the current file name to location URIs
         if ($uri =~ /^#/) { $uri =~ s/^/$file/; }
         # add the current directory name to file URIs
         if ($uri !~ /^[a-zA-Z]*:/ and $uri !~ /^\//) { $uri =~ s/^/$dir/; }
         # add the current host name to directory URIs
         if ($uri !~ /^[a-zA-Z]*:/) { $uri =~ s/^/$host/; }
         # print the result
         print "$uri\n";
      }
   }

This program seems to work correctly for the second test page. We should continue testing and expanding the program but we will leave it as it is right now. Note that the basic commands of the initial program are still present in this version. The more we will test the program, the more infrequent its errors will be. We can continue adding extensions but the extensions will deal with less and less frequent problems.

Previous | Home | Exercises | PDF slides | Next

Last update: September 25, 2007. erikt(at)science.uva.nl