From: Subject: Perl Solutions (5) Date: Thu, 6 Feb 2003 12:01:58 +0100 MIME-Version: 1.0 Content-Type: text/html; charset="windows-1250" Content-Transfer-Encoding: quoted-printable Content-Location: http://lcg-www.uia.ac.be/~erikt/perl/so05.html X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Perl Solutions (5)

Previous | = Home | Lecture = Notes | Exercises | = Next=20

 

Perl Solutions (5)


These exercise solutions are part of a Perl course taught at CNTS -=20 Computational Linguistics at the University of Antwerp.=20

The programs corresponding with these exercises can be found in the = appendix.=20

Exercise 5.1 (set difference and intersection)

Write a program that reads two lines of words (further called A = and B),=20 stores the words from each line in separate hash, and then prints a = sorted list=20 of words marked with the following symbols: "A" (occurs only in line A), = "B"=20 (occurs only in B), and "AB" (occurs both in A and B).

To do this, we must read two lines and split them into words. We put = each of=20 the lines in a hash. There are many ways to print the appropriate = symbols, but=20 in the solution given below we choose to merge the two hashes first. = Then we can=20 iterate over all the words that are in the merge and print an A if they = are in=20 the first line, and a B if they are in the second line. Because we = append the=20 two results together, we get the intersection symbol "AB" for free.=20

Exercise 5.2a (word for word translation)

Write a program that stores a list of word translations between = two=20 languages in a hash, and translates in both directions in a word for = word=20 fashion. Make a small lexicon of about twenty words that allows you to = translate=20 some simple sentences.

We get the reverse translation by reversing the first translation = hash. Then=20 we read lines, split them into words, and look up each word in the = hashes. For=20 unknown words we print the original string between brackets. It is a = word for=20 word translation with little elegance.=20

Exercise 5.2b*

Extend the translation program from the previous exercise to = handle=20 ambiguous words. Hint: use multidimensional hashes and a few simple = rules that=20 look at the context of a word.

We indicate the senses of a word in the dictionary with a "_number" = suffix.=20 If we do not know a word's direct translation, we first look it up in a = hash of=20 ambiguous words and decide on the translation on the basis of a very = simple=20 heuristic, looking at the word to the right of it. If nothing matches we = give=20 the DEFAULT sense. Of course this is not how you want to do some serious = translation, but there are in fact products which do little more.=20

Exercise 5.3 (bigram statitics)

Write a program that takes a chunk of text as input, and = outputs a list=20 of letter bigrams and unigrams from that text together with their = frequency,=20 reverse sorted by alphabet. Use hashes! A unigram is a single letter = character,=20 a letter bigram is a sequence of two adjacent characters. E.g. "bigram" = contains=20 the bigrams "bi ig gr ra am". Ignore case and whitespace.

The program below is a straightforward extension of last week's = solution with=20 hashes subsituted for lists.=20

Exercise 5.4* Bigram language model

Extend the program from exercise 5.3 to compute the probability = of a=20 word, assuming the probability of a word is the product of all letter = bigram=20 transition probabilities, as given in the following formula:=20

P(Word) =3D product_i P(Char_i | Char_i-1)=20

The bigram transition probabilities for a bigram xy are = defined as:=20 P(y|x) =3D freq(xy)/freq(x). Where freq(xy) is the frequency of = a bigram=20 xy, and freq(x) that of unigram x. Your = program=20 should first read some amount of text to estimate the probabilities and = then ask=20 for words and compute their probabilities.

To make a language model we also need transition probabilities from = the start=20 of a word to its first letter, so we need to keep track of those as = well. Once=20 we have all probabilities, the value for a word is a simple product. = Note that=20 an unseen bigram will produce a zero term, and we want to avoid this, so = we add=20 one to all frequencies. The solution below is stolen from Gert Durieux = :-)=20

Appendix

Exercise 5.1

# exercise5.1.pl
# 2000-03-10 zavrel@uia.ua.ac.be

print "Input line A:";
$line_A =3D <>;                     # read A
@linea  =3D split /\s+/, $line_A;   # get the words
while($word =3D pop @linea){        # iterate over words
    if($word =3D~ /\w/){            # filter out all non-words
       $hash_A{$word}=3D1;          # put into hash
    }
}

print "Input line B:";            # same as A
$line_B =3D <>;
@lineb  =3D split /\s+/, $line_B;
while($word =3D pop @lineb){
    $hash_B{$word}=3D1;
}

%hash_merge =3D ( %hash_A , %hash_B ); # do the merge

foreach $key ( sort keys %hash_merge ){

    if(exists($hash_A{$key})){
	print "A";
    }
    if(exists($hash_B{$key})){
	print "B";
    }   =20
    print "\t$key\n";
}

Exercise 5.2

# exercise5.2.pl
# 2000-03-10 zavrel@uia.ua.ac.be

%dutch2english =3D (
		  Jan =3D> "John",
		  ziet =3D> "sees",
		  Marie =3D> "Mary",
		  de =3D> "the",
		  kinderen =3D> "children"
		  );

%english2dutch =3D reverse %dutch2english;

do{

    print "Give a sentence in Dutch or English for translation:\n> ";
    $line =3D <>;
    if($line =3D~ /\w/){
=09
	$line =3D~ s/^\s*//;
	$line =3D~ s/\s*$//;

	@words =3D split /\s+/, $line;
=09
	foreach $word ( @words ){
	    if(exists($dutch2english{$word})){
		print "$dutch2english{$word} ";
	    }
	    elsif(exists($english2dutch{$word})){
		print "$english2dutch{$word} ";
	    }
	    else{
		print "[$word] "; # unknown
	    }
	}
	print "\n";
    }
}while($line =3D~ /\w/);

Exercise 5.2b

# exercise5.2b.pl
# 2000-03-10 zavrel@uia.ua.ac.be

%dutch2english =3D (
		  Jan =3D> "John",
		  ziet =3D> "sees",
		  Marie =3D> "Mary",
		  kinderen =3D> "children",
		  boek =3D> "book",
		  de =3D> "the_1",   # the subscripts denote various senses
		  het =3D> "the_2",
		  haar_1 =3D> "hair",
		  haar_2 =3D> "her",
		  );

%ambiword_english =3D (
		     the =3D> {=20
			      DEFAULT =3D> "de",  # default sense
			      book =3D> "het"     # simple look-right heuristics
			      }
		     );

%ambiword_dutch =3D (
		   haar =3D> {=20
		       DEFAULT =3D> "hair",
		       kinderen =3D> "her" ,
		       boek =3D> "her",
		       haar =3D> "her",
		       }
		   );

%english2dutch =3D reverse %dutch2english;

do{

    print "Give a sentence in Dutch or English for translation:\n> ";
    $line =3D <>;
    if($line =3D~ /\w/){
=09
	$line =3D~ s/^\s*//;
	$line =3D~ s/\s*$//;

	@words =3D split /\s+/, $line;

	$position =3D 0;
	foreach $word ( @words ){
	    if(exists($dutch2english{$word})){
		$trans =3D $dutch2english{$word};
		$trans =3D~ s/_\d+//; # strip ambiguity marker
		print "$trans ";
	    }
	    elsif(exists($english2dutch{$word})){
		$trans =3D $english2dutch{$word};
		$trans =3D~ s/_\d+//;=20
		print "$trans ";
	    }
	    else{

		# look up whether it's not one of these
		#
		if(exists($ambiword_dutch{$word})){

		    # resolve the ambiguity if possible
		    if(exists($ambiword_dutch{$word}{$words[$position+1]})){
			$trans =3D $ambiword_dutch{$word}{$words[$position+1]};
			print "$trans ";
		    }
		    else{
			$trans =3D $ambiword_dutch{$word}{"DEFAULT"};
			print "$trans ";
		    }
		}
		# or one of those
		elsif(exists($ambiword_english{$word})){

		    # resolve the ambiguity if possible
		    if(exists($ambiword_english{$word}{$words[$position+1]})){
			$trans =3D $ambiword_english{$word}{$words[$position+1]};
			print "$trans ";
		    }
		    else{
			$trans =3D $ambiword_english{$word}{"DEFAULT"};
			print "$trans ";
		    }		   =20
		}
		else{
		    print "[$word] "; # unknown
		}
	    }
	    $position++;
	}
	print "\n";
    }
}while($line =3D~ /\w/);



Exercise 5.3

# exercise5.3.pl:=20
# 2000-03-10 zavrel@uia.ua.ac.be

print "Give some input text:";

while(defined($line =3D <>)){

    chomp $line;
    $line =3D lc($line);
    $line =3D~ s/^\s*//;
    $line =3D~ s/\s*$//;

    @wordline =3D split /\s+/, $line;
    foreach $word (@wordline){
=09
	@letterline =3D split //, $word;
   =20
	for($i=3D0; $i <=3D $#letterline; $i++){
	    $unigramfrequency{$letterline[$i]}++;
	    if($i < $#letterline ){
		$bigram =3D $letterline[$i] . $letterline[$i+1];
		$bigramfrequency{$bigram}++;
	    }
	}
    }
}

print "Unigrams frequencies:\n";
foreach $key (sort { $unigramfrequency{$b} <=3D> =
$unigramfrequency{$a} } keys %unigramfrequency){
    print "$key $unigramfrequency{$key}\n";
}
print "Bigrams frequencies:\n";
foreach $key (sort { $bigramfrequency{$b} <=3D> =
$bigramfrequency{$a} }keys %bigramfrequency){
    print "$key $bigramfrequency{$key}\n";
}

Exercise 5.4*

# Exercise 5.4 -- bigram language mode
#
# usage:	perl exc54.pl textfile
#
# (c) Gert Durieux


# read textfile from command line
while (<>) {
	@line =3D map lc, split //, $_;
	for ($i =3D 0; $i < @line; $i++) {

		# ignore spaces
		if ($line[$i] =3D~ /\S/) {
			$unigram =3D $line[$i];
			$unigrams{$unigram}++;
			if ($line[$i+1] =3D~ /\S/) {
				$bigram =3D $line[$i] . $line[$i+1];
				$bigrams{$bigram}++;
			}
		}

		# but keep track of word beginnings
		if ($i+1 < @line) {
			$bigram =3D $line[$i] . $line[$i+1];
			if ($bigram =3D~ /^.\b/) {
				$unigram =3D "^";
				$unigrams{$unigram}++;
				$bigram =3D "^" . $line[$i+1];
				$bigrams{$bigram}++;
			}
		}

	}
}

# process user input
print "Enter some words, line by line. Finish with ^D\n> ";
while (<>) {
	$line =3D $_;
	$line =3D~ s/^\s+//;
	$line =3D~ s/\s+$//;

	@words =3D split /\s+/, $line;
	foreach $word (@words) {

		# split words into letters and add BOS
		@letters =3D map lc, split //, $word;
		@letters =3D ("^", @letters);

		$probability =3D 1;
		for ($i =3D 0; $i < $#letters; $i++) {

			$unigram =3D $letters[$i];
			$bigram =3D $letters[$i] . $letters[$i+1];

			# smooth by adding 1 to each uni- and bigram freq
			if (exists($unigrams{$unigram})) {
				$ufreq =3D $unigrams{$unigram} + 1;
			} else {
				$ufreq =3D 1;
			}

			if (exists($bigrams{$bigram})) {
				$bfreq =3D $bigrams{$bigram} + 1;
			} else {
				$bfreq =3D 1;
			}

			$probability *=3D $bfreq / $ufreq;
		}
		print "$probability\n> ";
	}
}

print "\n";
exit(0);


Previous = | Home | Lecture = Notes | Exercises | = Next=20
Last update: March 10, 2000. zavrel@uia.ua.ac.be=20