mirror of
https://github.com/dhil/phd-dissertation
synced 2026-03-13 11:08:25 +00:00
Frequency
This commit is contained in:
17
thesis.bib
17
thesis.bib
@@ -1712,6 +1712,23 @@
|
|||||||
OPTaddress = {Boston, MA, USA}
|
OPTaddress = {Boston, MA, USA}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@book{PizziniBMG20,
|
||||||
|
author = {Ken Pizzini and Paolo Bonzini and Jim Meyering and Assaf Gordon},
|
||||||
|
@Comment = {David MacKenzie
|
||||||
|
@Comment and Jim Meyering
|
||||||
|
@Comment and Ross Paterson
|
||||||
|
@Comment and François Pinard
|
||||||
|
@Comment and Karl Berry
|
||||||
|
@Comment and Brian Youmans
|
||||||
|
@Comment and Richard Stallman},
|
||||||
|
title = {{GNU} sed, a stream editor},
|
||||||
|
note = {For version 4.8},
|
||||||
|
month = jan,
|
||||||
|
year = 2020,
|
||||||
|
publisher = {Free Software Foundation},
|
||||||
|
OPTaddress = {Boston, MA, USA}
|
||||||
|
}
|
||||||
|
|
||||||
# Expressiveness
|
# Expressiveness
|
||||||
@inproceedings{Felleisen90,
|
@inproceedings{Felleisen90,
|
||||||
author = {Matthias Felleisen},
|
author = {Matthias Felleisen},
|
||||||
|
|||||||
104
thesis.tex
104
thesis.tex
@@ -7151,7 +7151,7 @@ invoked with the resumption of the producer along with a thunk that
|
|||||||
applies the consumer's resumption to the yielded value.
|
applies the consumer's resumption to the yielded value.
|
||||||
%
|
%
|
||||||
For aesthetics, we define a right-associative infix alias for pipe:
|
For aesthetics, we define a right-associative infix alias for pipe:
|
||||||
$p \mid c \defas \Pipe\,\Record{p;c}$.
|
$p \mid c \defas \lambda\Unit.\Pipe\,\Record{p;c}$.
|
||||||
|
|
||||||
Let us put the pipe operator to use by performing a simple string
|
Let us put the pipe operator to use by performing a simple string
|
||||||
frequency analysis on a file. We will implement the analysis as a
|
frequency analysis on a file. We will implement the analysis as a
|
||||||
@@ -7245,6 +7245,16 @@ the character was nil in which case the process
|
|||||||
terminates. Alternatively, if the character was a newline the function
|
terminates. Alternatively, if the character was a newline the function
|
||||||
applies itself recursively with $n$ decremented by one. Otherwise it
|
applies itself recursively with $n$ decremented by one. Otherwise it
|
||||||
applies itself recursively with the original $n$.
|
applies itself recursively with the original $n$.
|
||||||
|
|
||||||
|
The $\head$ filter does not transform the shape of its data stream. It
|
||||||
|
both awaits and yields a character. However, the awaits and yields
|
||||||
|
need not operate on the same type within the same filter, meaning we
|
||||||
|
can implement a filter that transforms the shape of the data. Let us
|
||||||
|
implement a variation of the GNU coreutil \emph{paste} which merges
|
||||||
|
lines of files~\cite[Section~8.2]{MacKenzieMPPBYS20}. Our
|
||||||
|
implementation will join characters in its input stream into strings
|
||||||
|
separated by spaces and newlines such that the string frequency
|
||||||
|
analysis utility need not operate on the low level of characters.
|
||||||
%
|
%
|
||||||
\[
|
\[
|
||||||
\bl
|
\bl
|
||||||
@@ -7264,6 +7274,33 @@ applies itself recursively with the original $n$.
|
|||||||
\el
|
\el
|
||||||
\]
|
\]
|
||||||
%
|
%
|
||||||
|
The heavy-lifting is delegated to the recursive function $paste'$
|
||||||
|
which accepts two parameters: 1) the next character in the input
|
||||||
|
stream, and 2) a string buffer for building the output string. The
|
||||||
|
function is initially applied to the first character from the stream
|
||||||
|
(returned by the invocation of $\Await$) and the empty string
|
||||||
|
buffer. The function $paste'$ is defined by pattern matching on the
|
||||||
|
character parameter. The first three definitions handle the special
|
||||||
|
cases when the received character is nil, newline, and space,
|
||||||
|
respectively. If the character is nil, then the function yields the
|
||||||
|
contents of the string buffer followed by a string with containing
|
||||||
|
only the nil character. If the character is a newline, then the
|
||||||
|
function yields the string buffer followed by a string containing the
|
||||||
|
newline character. Afterwards the function applies itself recursively
|
||||||
|
with the next character from the input stream and an empty string
|
||||||
|
buffer. The case when the character is a space is similar to the
|
||||||
|
previous case except that it does not yield a newline string. The
|
||||||
|
final definition simply concatenates the character onto the string
|
||||||
|
buffer and recurses.
|
||||||
|
|
||||||
|
Another useful filter is the GNU stream editor abbreviated
|
||||||
|
\emph{sed}~\cite{PizziniBMG20}. It is an advanced text processing
|
||||||
|
editor, whose complete functionality we will not attempt to replicate
|
||||||
|
here. We will just implement the ability to replace a string by
|
||||||
|
another. This will be useful for normalising the input stream to the
|
||||||
|
frequency analysis utility, e.g. decapitalise words, remove unwanted
|
||||||
|
characters, etc.
|
||||||
|
%
|
||||||
\[
|
\[
|
||||||
\bl
|
\bl
|
||||||
\sed : \Record{\String;\String} \to \UnitType \eff \{\Await : \UnitType \opto \String;\Yield : \String \opto \UnitType\}\\
|
\sed : \Record{\String;\String} \to \UnitType \eff \{\Await : \UnitType \opto \String;\Yield : \String \opto \UnitType\}\\
|
||||||
@@ -7276,6 +7313,16 @@ applies itself recursively with the original $n$.
|
|||||||
\el
|
\el
|
||||||
\]
|
\]
|
||||||
%
|
%
|
||||||
|
The function $\sed$ takes two string arguments. The first argument is
|
||||||
|
the string to be replaced in the input stream, and the second argument
|
||||||
|
is the replacement. The function first awaits the next string from the
|
||||||
|
input stream, then it checks whether the received string is the same
|
||||||
|
as $target$ in which case it yields the replacement $str'$ and
|
||||||
|
recurses. Otherwise it yields the received string and recurses.
|
||||||
|
|
||||||
|
Now let us implement the string frequency analysis utility. It work on
|
||||||
|
strings and count the occurrences of each string in the input stream.
|
||||||
|
%
|
||||||
\[
|
\[
|
||||||
\bl
|
\bl
|
||||||
\freq : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \String;\Yield : \List\,\Record{\String;\Int} \opto \UnitType\}\\
|
\freq : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \String;\Yield : \List\,\Record{\String;\Int} \opto \UnitType\}\\
|
||||||
@@ -7300,20 +7347,50 @@ applies itself recursively with the original $n$.
|
|||||||
\el
|
\el
|
||||||
\]
|
\]
|
||||||
%
|
%
|
||||||
\[
|
The auxiliary recursive function $freq'$ implements the analysis. It
|
||||||
\bl
|
takes two arguments: 1) the next string from the input stream, and 2)
|
||||||
\intToString : \Int \to \String
|
a table to keep track of how many times each string has occurred. The
|
||||||
\el
|
table is implemented as an association list indexed by strings. The
|
||||||
\]
|
function is initially applied to the first string from the input
|
||||||
|
stream and the empty list. The function is defined by pattern matching
|
||||||
|
on the string argument. The first definition handles the case when the
|
||||||
|
input stream has been exhausted in which case the function yields the
|
||||||
|
table. The other case is responsible for updating the entry associated
|
||||||
|
with the string $str$ in the table $tbl$. There are two subcases to
|
||||||
|
consider: 1) the string has not been seen before, thus a new entry
|
||||||
|
will have to created; or 2) the string already has an entry in the
|
||||||
|
table, thus the entry will have to be updated. We handle both cases
|
||||||
|
simultaneously by making use of the handler $\faild$, where the
|
||||||
|
default value accounts for the first subcase, and the computation
|
||||||
|
accounts for the second. The computation attempts to lookup the entry
|
||||||
|
associated with $str$ in $tbl$, if the lookup fails then $\faild$
|
||||||
|
returns the default value, which is the original table augmented with
|
||||||
|
an entry for $str$. If an entry already exists it gets incremented by
|
||||||
|
one. The resulting table $tbl'$ is supplied to a recursive application
|
||||||
|
of $freq'$.
|
||||||
|
|
||||||
|
We need one more building block to complete the pipeline. The utility
|
||||||
|
$\freq$ returns a value of type $\List~\Record{\String;\Int}$, we need
|
||||||
|
a utility to render the value as a string in order to write it to a
|
||||||
|
file.
|
||||||
%
|
%
|
||||||
\[
|
\[
|
||||||
\bl
|
\bl
|
||||||
\printTable : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \List\,\Record{\String;\Int}\}\\
|
\printTable : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \List\,\Record{\String;\Int}\}\\
|
||||||
\printTable\,\Unit \defas
|
\printTable\,\Unit \defas
|
||||||
\dec{map}\,\Record{\lambda\Record{s;i}.s \concat \strlit{:} \concat \intToString~i \concat \strlit{;};\Do\;\Await~\Unit}
|
\map\,\Record{\lambda\Record{s;i}.s \concat \strlit{:} \concat \intToString~i \concat \strlit{;};\Do\;\Await~\Unit}
|
||||||
\el
|
\el
|
||||||
\]
|
\]
|
||||||
%
|
%
|
||||||
|
The function performs one invocation of $\Await$ to receive the table,
|
||||||
|
and then performs a $\map$ over the table. The function argument to
|
||||||
|
$\map$ builds a string from the string-integer pair.
|
||||||
|
%
|
||||||
|
Here we make use of an auxiliary function,
|
||||||
|
$\intToString : \Int \to \String$, that turns an integer into a
|
||||||
|
string. The definition of the function is omitted here for brevity.
|
||||||
|
%
|
||||||
|
%
|
||||||
% \[
|
% \[
|
||||||
% \bl
|
% \bl
|
||||||
% \wc : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \Char;\Yield : \Int \opto \UnitType\}\\
|
% \wc : \UnitType \to \UnitType \eff \{\Await : \UnitType \opto \Char;\Yield : \Int \opto \UnitType\}\\
|
||||||
@@ -7343,13 +7420,14 @@ applies itself recursively with the original $n$.
|
|||||||
\qquad\qquad\status\,(\lambda\Unit.
|
\qquad\qquad\status\,(\lambda\Unit.
|
||||||
\ba[t]{@{}l}
|
\ba[t]{@{}l}
|
||||||
\quoteHamlet~\redirect~\strlit{hamlet};\\
|
\quoteHamlet~\redirect~\strlit{hamlet};\\
|
||||||
\Let\;cs \revto
|
\Let\;p \revto
|
||||||
\bl
|
\bl
|
||||||
(\lambda\Unit.\cat~\strlit{hamlet}) \mid (\lambda\Unit.\head~2) \mid \paste\\
|
~~(\lambda\Unit.\cat~\strlit{hamlet}) \mid (\lambda\Unit.\head~2) \mid \paste\\
|
||||||
\mid (\lambda\Unit.\sed\,\Record{\strlit{be,};\strlit{live}}) \mid (\lambda\Unit.\sed\,\Record{\strlit{To};\strlit{to}})\\
|
\mid (\lambda\Unit.\sed\,\Record{\strlit{be,};\strlit{be}}) \mid (\lambda\Unit.\sed\,\Record{\strlit{To};\strlit{to}})\\
|
||||||
|
\mid (\lambda\Unit.\sed\,\Record{\strlit{question:};\strlit{question}})\\
|
||||||
\mid \freq \mid \printTable
|
\mid \freq \mid \printTable
|
||||||
\el\\
|
\el\\
|
||||||
\In\;(\lambda\Unit.\echo~cs)~\redirect~\strlit{analysis})})))}
|
\In\;(\lambda\Unit.\echo~(p\,\Unit))~\redirect~\strlit{analysis})})))}
|
||||||
\ea
|
\ea
|
||||||
\el \smallskip\\
|
\el \smallskip\\
|
||||||
\reducesto^+&
|
\reducesto^+&
|
||||||
@@ -7368,8 +7446,8 @@ applies itself recursively with the original $n$.
|
|||||||
\ba[t]{@{}l}
|
\ba[t]{@{}l}
|
||||||
\Record{2;
|
\Record{2;
|
||||||
\ba[t]{@{}l@{}l}
|
\ba[t]{@{}l@{}l}
|
||||||
\texttt{"}&\texttt{to:2;live:2;or:1;not:1;\nl:2;that:1;is:1}\\
|
\texttt{"}&\texttt{to:2;be:2;or:1;not:1;\nl:2;that:1;is:1}\\
|
||||||
&\texttt{the:1;question::1;"}},
|
&\texttt{the:1;question:1;"}},
|
||||||
\ea\\
|
\ea\\
|
||||||
\Record{1;
|
\Record{1;
|
||||||
\ba[t]{@{}l@{}l}
|
\ba[t]{@{}l@{}l}
|
||||||
|
|||||||
Reference in New Issue
Block a user