#!/bin/ksh
# messUp reads a text from standard input. It will mess up the text
# and put at the standard output. This script can be used for creating
# corrupted texts which can be used as tests for error correcting
# programs. Invoke this program as messUp < InputFile > OutputFile
# Erik Tjong Kim Sang (erikt@strindberg.ling.uu.se)  
# Wed Oct  4 16:15:37 DFT 1995
#
# Note that all lines in this file that start with # are comment lines
#
# This program will mess up a text according to some error model. The 
# that will simulate the error model is awk. For awk things will work
# best if the input is presented one character at a time (one character
# per line). At least that worked best for me! So we will first have 
# to split the input over multiple lines.
#
# addNewlines is a little C-program that will do the job. If you present
# it "test" as input it will return "t[nl]e[nl]s[nl]t[nl]" in which [nl]
# is the new line character. So actually it will return four lines 
# containing one character.
#
# addNewlines reads input from the standard input. Therefore you should 
# put your text in a some input file and invoke this messUp program as
#
# messUp < someInputFile 
#
# Here the '<' means use someInputFile as the input of the messUp program.
#
./addNewlines |\
#
# Note | at the end of the line. This is called a pipe. It is a UNIX
# shell construct which ables commands to give their output to other
# commands. You could for example do something like:
#
# messUp < InputFile | messUp | messUp
#
# This means: run messUp with InputFile as input and give the output
# to another version of the messUp program which will process the
# corrupted and present its output to yet another copy of the messUp
# program. You can make this list as long as you want.
#
# We will give the output of addNewlines to awk. This is a string
# processing program. The program reads a line and performs actions
# which depend on the format of the line. The general structure of
# a rule in an awk program is:
#
# /Pattern/ { Action }
#
# if awk finds Pattern in a line of input text, it will perform
# Action. Our awk rules are simple. The first rule (BEGIN...) 
# initializes the random generator (don't bother about that). The
# second rule states that if the line contains [ae], which is 
# either an 'a' or an 'e', the performed action will be to print
# either an 'a' (probability 50%) or an 'e' (probability 50%).
# 
# Remember that our text now contains one character per line. What
# this awk program will do is replacing certain characters with 
# other characters in 50% of the cases. Remember the error model
# used by Yves Schabes and Mark Liberman that I have presented in
# class. This error model is similar but contains fewer replacement 
# possibilities.
#
awk '
     BEGIN  { srand() }
     /[ae]/ { 
              if (rand()>=0.5) { print "a" }
              else { print "e" }
            }
     /[io]/ { 
              if (rand()>=0.5) { print "i" }
              else { print "o" }
            }
     /[uy]/ { 
              if (rand()>=0.5) { print "u" }
              else { print "y" }
            }
     /[AE]/ { 
              if (rand()>=0.5) { print "A" }
              else { print "E" }
            }
     /[IO]/ { 
              if (rand()>=0.5) { print "I" }
              else { print "O" }
            }
     /[UY]/ { 
              if (rand()>=0.5) { print "U" }
              else { print "Y" }
            }
     /^$/   { print $0 }
     /[^aeiouyAEIOUY]/ { print $0 }
    ' |\
# 
# The final two rules of the awk program are necessary for processing
# lines that do not contain characters that can be replaced. The
# pattern ^$ means empty line and the pattern [^xyz] means all 
# possible characters except the characters 'x', 'y' and 'z'. Notice 
# that awk also has the pipe construct appended to it: it will send
# its output to another command as well.
#
# If you want to learn more about the awk command, type in "info" at
# the prompt and search for "awk". Then you will find the awk manual.
#
# When the messUp command is completed we have to remove the extra
# new lines we have added. The addNewlines command will do this job 
# for us when we call it with option -r (reverse).
# 
./addNewlines -r
#
# The addNewlines will put its output at the screen. If you want to
# view it, you can call messUp like messUp < someInputFile | more
# The more program will present the output to you screen by screen.
# However you ultimately will have save the output of messUp to a
# file and this can be done like this:
#
# messUp < someInputFile > someOutputFile
#
# With the > redirection construct you can send output of a command to
# a file. Congratulations, you have managed to produce a corrupted
# text. You we can start trying to improve it. Remember that you
# have the original text available for making comparisons.