Previous | Home | Exercises | Next

A Shortcut to Perl

Erik Tjong Kim Sang, Jakub Zavrel, Guy De Pauw and Walter Daelemans
CNTS - Computational Linguistics, University of Antwerp
http://lcg-www.uia.ac.be/~erikt/perl/

This text is part of the lecture notes for a Perl course taught at CNTS - Computational Linguistics at the University of Antwerp.

5. Hashes (Associative arrays)

In this section we introduce one of the most powerful data types in Perl, the hash. Another (more descriptive) name for this data type is associative array. Hashes are very similar to arrays, in that they are collections of scalar values that can be accessed through an index. But unlike in arrays, the index, or key, can be any type of scalar, for example a string. You can see hashes as a one-directional mapping function from scalars to scalars (or, in some contexts, as an unordered bag of unique associated key-value pairs). A hash variable is always written with a percentage sign (%) before its name, instead of respectively the dollar ($) for scalar variables, and the at sign (@) for array variables. Again, homonymous unrelated hashes, arrays and scalars may peacefully coexist in Perl, but they are likely to confuse you.

The internal storage mechanism for hashes is designed to allow very fast lookup using so called hash functions, hence the name of this data type (But there's no need to understand this mechanism to work with hashes in Perl). This allows for almost instantaneous lookup of a value that is associated with some particular key. For example, think of words and their dictionary entries, categories and their frequencies, people and their phone numbers, etc. To access a hash-value by its key, we use the curly bracket notation: $hash{$key}. Remember, that hashes are in essence a one-directional mapping! It requires some effort to retrieve a key that belongs to a particular value, just like it is tedious to find the name of a person with a particular phone number in a phone book.

Existing, Defined and true. If the value for a key does not exist in the hash, the access to it returns the undef value. You can use the special test function exists(HASHENTRY), which returns true if the hash key exists in the hash, whereas, if the hash entry had been created by $hash{$key} = undef, a simple test such as if($hash{$key}){...}, or if(defined($hash{$key})){...} would return false.

5.1. How to put information in hashes?

By referencing a hash value, e.g. $hash{$key}, we do not create it (unlike in AWK); it springs into existence only after we assign some value to it. For example:

$wordfrequency{"the"} = 12731;     # creates key "the", value 12731
$phonenumber{"An De Wilde"} = "+31-20-6777871";
$index{$word} = $nwords;
$occurrences{$a}++;                # if this is the first reference,
                                   # the value associated with $a will 
				   # be increased from 0 to 1

There is no literal representation for the contents of a hash, so if we want to read or fill an entire hash at once, we must make use of the fact that the key value pairs in a hash can be converted to and created from a list. This is called unwinding the hash. The conversion takes the odd-numbered items from the list as keys and the even-numbered to the right of them as values. So we can create a hash as follows:

# fill the hash 
%birthdays = ("An","25-02-1975","Bert","12-10-1953","Cindy","23-05-1969","Dirk","01-04-1961");  
@list = %birthdays; # make a list of the key/value pairs
%copy_of_bdays = %birthdays;

Note that the conversion to and from a hash does not preserve the order of the list, since a hash has no well-defined internal ordering. You should never rely on any particular ordering. Note also that when hashes are copied, as in the last example above, there are in fact two conversions after another.

A different notation that is often used to fill a hash, uses the => operator. This operator is similar to a comma, but it fits the unidirectional semantics of the hash better. It also quotes bare identifiers to the left of it:

%birthdays = (                   # this does the same thing as above.
        An   => "25-02-1975",
	Bert => "12-10-1953",
	Cindy => "23-05-1969",
	Dirk => "01-04-1961"
	);

5.2. Operations on hashes

We have already seen that we can unwind the hash by using it in a list context. By the way, if we use a hash in a scalar context we can test whether it contains any keys at all, e.g.if(! %hash){ print "No entries\n"; }. If we want to impose a fixed order, we can sort the result, but this will mix both keys and values in the sort order. This is where the keys and values functions come in handy. They function as their name suggests:

keys HASH returns a list with only the keys in the hash. As with any list, using it in a scalar context returns the number of keys in that list.

values HASH returns a list with only the values in the hash, in the same order as the keys returned by keys.

Now we can make a list which is sorted from the hash:

foreach $key ( sort keys %hash ){
   push @sortedlist, ( $key , $hash{$key} );
   print "Key $key has value $hash{$key}\n";
}

This example also shows that individual hash items can be interpolated in double-quoted strings. There is no way to interpolate the entire hash, however. You will need to iterate over it or unwind it into a list.

You can also use unwinding of a hash to reverse the direction of the mapping, i.e. to construct a hash with keys and values swapped:

%backwards = reverse %forward;

Of course, if %forward has two identical values (associated with different keys, of course), those will end up as only a single element in %backwards, because a hash only has one entry for each unique key, so this is best performed only on hashes with unique keys and values.

Just as we can take sublists of arrays using slices, we can also select a range of values from a hash using a so called hash slice. A hash slice can also be assigned to as in the following example (please take notice of the use of the at sign as a prefix, because the slice is a list):

@birthdays{"An","Bert","Cindy","Dirk"} = ("25-02-1975","12-10-1953","23-05-1969","01-04-1961");

Hash slices can also be used to merge two hashes into a larger one. In the following example, the values of the %mybirthdayhash overwrite the values that were present in %birthdays in the case of a duplicate.

@birthdays{keys %mybirthdayhash} = values %mybirthdayhash;

All values that have a key in %mybirthdayhash are merged into the %birthdays hash, using a hash slice. This is equivalent to the much slower operation (which requires a number of conversions):

%birthdays = (%birthdays, %mybirthdayhash);

To iterate over a hash, we can access it using keys, and look up the value associated with each returned key. A more efficient way is to use the function each( HASH ), which returns a key/value pair as a two-element list. Each evaluation of this function for the same hash returns the next successive key-value pair until all the elements have been accessed. When there are no pairs left, each returns an empty list. For example:

while (($name,$date) = each(%birthdays)) {
    print "$name's birthday is $date\n";
}

Assigning a new value to the entire hash that is the subject of the iteration resets the each function to the beginning. Adding or deleting elements of the hash while iterating with each is quite likely to confuse it (and possibly you as well). If you want to do this, use the construction:

foreach $key ( keys %hash ){
    if($del_condition){
       delete $hash{$key};
    }
    if($add_condition){
       $hash{$something} = $somethingelse;
    }
}

because the call of the keys function will result in a fixed list to iterate over. This example also introduced the last hash related operation we will discuss here, the delete( HASHENTRY ) function. delete() takes a single hash reference as an argument, and, as its name suggests, deletes it from the hash, so that a subsequent call of exists() on the same reference yields false.

5.3 Multidimensional data structures

Sometimes you need to make your data structures more complex than just arrays or hashes. For example, if you want to store a matrix of bigram frequencies (such as in Exercise 4.4) , or transition probabilities between letters, you would ideally want to use a two-dimensional array or hash. Perl does not really have multi-dimensional data structures, but it has a nice way of emulating them that's hardly distinguishable from the real thing. The secret of this emulation lies in the fact that apart from strings and numbers a scalar value can also be a reference. This is analogous to pointers in other languages. Thus you can emulate a two-dimensional array by storing references to (one-dimensional) arrays in an array. The same goes for hashes. The subject of references is too complicated to fully cover here (consult your book if you are interested), but you should know that you can work with notation like:

$matrix[$i][$j] = $x;
$lexicon1{"word"}[1] = $partofspeech;
$lexicon2{"word"}{"noun"} = $frequency;

@matrix = (             # an array of references to anonymous arrays
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
); 

%lexicon1 = (            # a hash from strings to anonymous arrays
    the => [ "Det", 12731 ],
    man => [ "Noun", 658 ],
    with => [ "Prep", 3482 ]
);

%lexicon2 = (            # a hash from strings to anonymous hashes of strings to numbers
    the => { Det => 12731 },
    man => { Noun =>  658 , Verb => 12 },
    with => { Prep => 3482 }
);

Using this notation you are in fact using shorthands for the underlying reference representation. The embedded "[" and "{" create a reference to a so called anonymous array. For example $matrix[0][0] really is another way of writing: $matrix[0]->[0]. But the fact that we haven't discussed the full functionality of references in Perl should not stop us from doing useful stuff with it. Pay attention, however, that often when you get an item from a complex structure with a function like pop or values, the resulting value is a reference, not a string or number.

5.4. Programming example

Sometimes you need to store strings and have a unique numerical identifier for them. In this example we write a program that reads lines of text, gives a unique index number to each word and counts the word frequencies. In the end we print the list of words in alphabetical order, together with their index number and frequency. For both counting and indexing, the hash is the ideal data structure: in a single step you can look up whether you know a word, and what its previous frequency was.

#!/usr/local/bin/perl  # only needed on Unix systems

# example.5.4: perform task for programming example 5.4
# usage: example.5.4
# 2000-01-03 zavrel@uia.ua.ac.be

# read all lines in the input
#
$nwords = 0;
while(defined($line = <>)){                 
    
    # cut off leading and trailing whitespace
    #
    $line =~ s/^\s*//;
    $line =~ s/\s*$//;
    
    # and put the words in an array
    #
    @words = split /\s+/, $line;             
    if(!@words){  # there are no words?
       next;
    }
    
    # process each word...
    #
    while($word = pop @words){

       # if it's unknown assign a new index
       #
       if(!exists($index{$word})){
          $index{$word} = $nwords++;
       }
       # always update the frequency
       $frequency{$word}++;
    }
}

# now we print the words sorted
#
foreach $word ( sort keys %index ){
   print "$word has frequency $frequency{$word} and index $index{$word}\n";
}

If we would like to have the words sorted by their frequency instead of by alphabet, we need a construct that imposes a different sort order. The sort function can use any sort order that is provided as an expression to it. The usual alphabetical sort order is in fact a shorthand notation for:

sort { $a cmp $b } @list;

Where $a and $b are placeholders for the two items from the list that are to be compared. The comparison function should return a negative number if $a ought to appear before $b in the output list, 0 if they're the same and their order doesn't matter, or a positive number if $a ought to appear after $b. The operator cmp does just this. We can achieve a numerical sort order (so that e.g. 7 comes before 11), by changing the comparison operator to the numerical "<=>" operator. In fact we might supply sort with any operator or function we like.

sort { $a <=> $b } @list;

To get a reverse sort, we change the order of the arguments:

sort { $b <=> $a } @list;

And in order to sort the keys of a hash by their value instead of by their own identity, we substitute those values for the arguments of sort:

sort { $hash{$b} <=> $hash{$a} } ( keys %hash )

Previous | Home | Exercises | Next

Last update: March 1st, 2000. zavrel@uia.ua.ac.be