Browse Source

Frequency

master
Daniel Hillerström 5 years ago
parent
commit
9fc66e6b62
  1. 17
      thesis.bib
  2. 104
      thesis.tex

17
thesis.bib

@ -1712,6 +1712,23 @@
OPTaddress = {Boston, MA, USA}
}
@book{PizziniBMG20,
author = {Ken Pizzini and Paolo Bonzini and Jim Meyering and Assaf Gordon},
@Comment = {David MacKenzie
@Comment and Jim Meyering
@Comment and Ross Paterson
@Comment and François Pinard
@Comment and Karl Berry
@Comment and Brian Youmans
@Comment and Richard Stallman},
title = {{GNU} sed, a stream editor},
note = {For version 4.8},
month = jan,
year = 2020,
publisher = {Free Software Foundation},
OPTaddress = {Boston, MA, USA}
}
# Expressiveness
@inproceedings{Felleisen90,
author = {Matthias Felleisen},

104
thesis.tex

@ -7151,7 +7151,7 @@ invoked with the resumption of the producer along with a thunk that
applies the consumer's resumption to the yielded value.
%
For aesthetics, we define a right-associative infix alias for pipe:
$p \mid c \defas \Pipe\,\Record{p;c}$.
$p \mid c \defas \lambda\Unit.\Pipe\,\Record{p;c}$.
Let us put the pipe operator to use by performing a simple string
frequency analysis on a file. We will implement the analysis as a
@ -7245,6 +7245,16 @@ the character was nil in which case the process
terminates. Alternatively, if the character was a newline the function
applies itself recursively with $n$ decremented by one. Otherwise it
applies itself recursively with the original $n$.
The $\head$ filter does not transform the shape of its data stream. It
both awaits and yields a character. However, the awaits and yields
need not operate on the same type within the same filter, meaning we
can implement a filter that transforms the shape of the data. Let us
implement a variation of the GNU coreutil \emph{paste} which merges
lines of files~\cite[Section~8.2]{MacKenzieMPPBYS20}. Our
implementation will join characters in its input stream into strings
separated by spaces and newlines such that the string frequency
analysis utility need not operate on the low level of characters.
%
\[
\bl
@ -7264,6 +7274,33 @@ applies itself recursively with the original $n$.
\el
\]
%
The heavy-lifting is delegated to the recursive function $paste'$
which accepts two parameters: 1) the next character in the input
stream, and 2) a string buffer for building the output string. The
function is initially applied to the first character from the stream
(returned by the invocation of $\Await$) and the empty string
buffer. The function $paste'$ is defined by pattern matching on the
character parameter. The first three definitions handle the special
cases when the received character is nil, newline, and space,
respectively. If the character is nil, then the function yields the
contents of the string buffer followed by a string with containing
only the nil character. If the character is a newline, then the
function yields the string buffer followed by a string containing the
newline character. Afterwards the function applies itself recursively
with the next character from the input stream and an empty string
buffer. The case when the character is a space is similar to the
previous case except that it does not yield a newline string. The
final definition simply concatenates the character onto the string
buffer and recurses.
Another useful filter is the GNU stream editor abbreviated
\emph{sed}~\cite{PizziniBMG20}. It is an advanced text processing
editor, whose complete functionality we will not attempt to replicate
here. We will just implement the ability to replace a string by
another. This will be useful for normalising the input stream to the
frequency analysis utility, e.g. decapitalise words, remove unwanted
characters, etc.
%
\[
\bl
\sed : \Record{\String;\String} \to \UnitType \eff \{\Await : \UnitType \opto \String;\Yield : \String \opto \UnitType\}\\
@ -7276,6 +7313,16 @@ applies itself recursively with the original $n$.
\el
\]
%
The function $\sed$ takes two string arguments. The first argument is
the string to be replaced in the input stream, and the second argument
is the replacement. The function first awaits the next string from the
input stream, then it checks whether the received string is the same
as $target$ in which case it yields the replacement $str'$ and
recurses. Otherwise it yields the received string and recurses.
Now let us implement the string frequency analysis utility. It work on
strings and count the occurrences of each string in the input stream.
%
\[
\bl
\freq : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \String;\Yield : \List\,\Record{\String;\Int} \opto \UnitType\}\\
@ -7300,20 +7347,50 @@ applies itself recursively with the original $n$.
\el
\]
%
\[
\bl
\intToString : \Int \to \String
\el
\]
The auxiliary recursive function $freq'$ implements the analysis. It
takes two arguments: 1) the next string from the input stream, and 2)
a table to keep track of how many times each string has occurred. The
table is implemented as an association list indexed by strings. The
function is initially applied to the first string from the input
stream and the empty list. The function is defined by pattern matching
on the string argument. The first definition handles the case when the
input stream has been exhausted in which case the function yields the
table. The other case is responsible for updating the entry associated
with the string $str$ in the table $tbl$. There are two subcases to
consider: 1) the string has not been seen before, thus a new entry
will have to created; or 2) the string already has an entry in the
table, thus the entry will have to be updated. We handle both cases
simultaneously by making use of the handler $\faild$, where the
default value accounts for the first subcase, and the computation
accounts for the second. The computation attempts to lookup the entry
associated with $str$ in $tbl$, if the lookup fails then $\faild$
returns the default value, which is the original table augmented with
an entry for $str$. If an entry already exists it gets incremented by
one. The resulting table $tbl'$ is supplied to a recursive application
of $freq'$.
We need one more building block to complete the pipeline. The utility
$\freq$ returns a value of type $\List~\Record{\String;\Int}$, we need
a utility to render the value as a string in order to write it to a
file.
%
\[
\bl
\printTable : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \List\,\Record{\String;\Int}\}\\
\printTable\,\Unit \defas
\dec{map}\,\Record{\lambda\Record{s;i}.s \concat \strlit{:} \concat \intToString~i \concat \strlit{;};\Do\;\Await~\Unit}
\map\,\Record{\lambda\Record{s;i}.s \concat \strlit{:} \concat \intToString~i \concat \strlit{;};\Do\;\Await~\Unit}
\el
\]
%
The function performs one invocation of $\Await$ to receive the table,
and then performs a $\map$ over the table. The function argument to
$\map$ builds a string from the string-integer pair.
%
Here we make use of an auxiliary function,
$\intToString : \Int \to \String$, that turns an integer into a
string. The definition of the function is omitted here for brevity.
%
%
% \[
% \bl
% \wc : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \Char;\Yield : \Int \opto \UnitType\}\\
@ -7343,13 +7420,14 @@ applies itself recursively with the original $n$.
\qquad\qquad\status\,(\lambda\Unit.
\ba[t]{@{}l}
\quoteHamlet~\redirect~\strlit{hamlet};\\
\Let\;cs \revto
\Let\;p \revto
\bl
(\lambda\Unit.\cat~\strlit{hamlet}) \mid (\lambda\Unit.\head~2) \mid \paste\\
\mid (\lambda\Unit.\sed\,\Record{\strlit{be,};\strlit{live}}) \mid (\lambda\Unit.\sed\,\Record{\strlit{To};\strlit{to}})\\
~~(\lambda\Unit.\cat~\strlit{hamlet}) \mid (\lambda\Unit.\head~2) \mid \paste\\
\mid (\lambda\Unit.\sed\,\Record{\strlit{be,};\strlit{be}}) \mid (\lambda\Unit.\sed\,\Record{\strlit{To};\strlit{to}})\\
\mid (\lambda\Unit.\sed\,\Record{\strlit{question:};\strlit{question}})\\
\mid \freq \mid \printTable
\el\\
\In\;(\lambda\Unit.\echo~cs)~\redirect~\strlit{analysis})})))}
\In\;(\lambda\Unit.\echo~(p\,\Unit))~\redirect~\strlit{analysis})})))}
\ea
\el \smallskip\\
\reducesto^+&
@ -7368,8 +7446,8 @@ applies itself recursively with the original $n$.
\ba[t]{@{}l}
\Record{2;
\ba[t]{@{}l@{}l}
\texttt{"}&\texttt{to:2;live:2;or:1;not:1;\nl:2;that:1;is:1}\\
&\texttt{the:1;question::1;"}},
\texttt{"}&\texttt{to:2;be:2;or:1;not:1;\nl:2;that:1;is:1}\\
&\texttt{the:1;question:1;"}},
\ea\\
\Record{1;
\ba[t]{@{}l@{}l}

Loading…
Cancel
Save