3_feature_extraction.tex

\begin{savequote}[75mm] 
--Handle a book as a bee does a flower, extract its sweetness but do not damage it.
\qauthor{John Muir} 
\end{savequote}


% % % % % % % % % % % % % % % % % % % % % %


\chapter{Feature Extraction}
\label{ch:feature_extraction}

\chaplettrine{I}{n the previous chapter}, we described how to obtain references from documents and how confidence scores are computed that reflect how certain we are that they refer to the same entity.
During the link classification step we assumed that records consisted only of information directly relating to a reference, e.g. first names, last names, roles, titles, etc.
If the probability mass function of values for a property is known it can be incorporated, but it does not extend the records themselves.
When we consider the case of a reference containing a \emph{common name}, however, it becomes apparent that more information is required.
The confidence score is inversely proportional to the frequency of the reference's components, yielding a low score in the case of a common name.
This can merely improve the fraction of correctly classified matches (precision), but it cannot avoid missing matches because of a low score (recall).
In order to achieve that, more information is required.
Looking at the documents from which references are extracted, we see that more data is available that could potentially be useful in aiding classification.

In this chapter, we will look at two complementary approaches that extract contextual information.
The first is based on modeling the content of a section of text in which a reference appears, while the second is based on co-occurrence of other people.
The goal is to extract information from the text such that ambiguous cases, such as ``John Smith'', can still be linked.


% % % % % % % % % % % % % % % % % % % % % %


\section{Maximally informative $k$-itemsets}
\label{sec:miki}

The primary targets for the methods studied in this thesis are historical documents.
Often these documents comprise the proceedings of a meeting or a summarized version thereof.
They contain a description of events that took place on a specific date, who was present, etc.
When looking more closely at the contents it is often possible to extract a few interesting words.
As an example, consider the following sections taken from ``Officials of the Boards of Trade and Plantations'' \citep{TradePlantations}.

\begin{quote}
    A letter from the Secretary to Mr. Carkesse, desiring him to move the Commissioners of the Customs, that their Officers in the Out Ports may give this Board an Account of the quantities of Salt that is necessary and used in curing several species of Fish, was agreed and ordered to be sent.
\end{quote}

\begin{quote}
    Ordered that Mr. Carkesse be desired to let this Board have on Tuesday next, if possible, the Account of Fish exported, which was desired the 17th of the last month.
\end{quote}

\noindent In both cases, there is a reference to a ``Mr. Carkesse'', which might be the same person.
Both texts discuss matter that is related to the export of fish, so we could say the overarching topic is ``fish''.
The fact that Mr. Carkesse is mentioned twice in the context of this topic gives us additional information that we can exploit to determine whether these references should be matched.
Note that if there would have been multiple people called Mr. Carkesse, probably additional information would have been provided to avoid confusion.
It is therefore unlikely that two different persons are being referred to.
However, it is a good example of the kind of `circumstantial evidence' that we want to extract from the documents.

Any word from the text may potentially be used as a topic for the text; in fact, we could select all of them.
However, this would lead to topics, treated as features of the text, to be highly correlated.
If we were to select ``fish'' as a topic and decide to also include ``fishing'', it is easy to see that probably not much additional information is included.
Besides trying to decide on a topic by looking at a single document, we can instead look at the complete set of documents under consideration and construct a set of uncorrelated, or \emph{orthogonal}, topics.
This is very much related to the process of \emph{feature selection} \citep{Knobbe2006}, that strives to select a subset of features with the purpose of reducing problems caused when too many features are used, such as overfitting.

In our case, we aim for the selection of a set of topics that provide as good a distinction between texts as possible.
This is exactly what \emph{maximally informative $k$-itemsets} \citep{Knobbe2006}, or \emph{miki's} in short, provide.
As the name suggests, a miki is an itemset (equivalent to the mathematical \emph{set}) of size $k$ that provides maximum information to distinguish between the entities considered.
The amount of information of the itemset is measured as the \emph{joint entropy} of the itemset in \emph{Shannon bits} (or just \emph{bits}), which is defined as follows \citep{Knobbe2006}:

\begin{definition}[Joint entropy]
    Suppose that $X = \left\{ x_{1}, \dots, x_{k} \right\}$ is an itemset, and $B = \left( b_{1}, \dots, b_{k} \right) \in \left\{ 0, 1 \right\}^{k}$ is a tuple of binary values. The \emph{joint entropy} of $X$ is defined as
    \begin{equation*}
        H(X) = -\sum_{B \in \left\{ 0, 1 \right\}^{k}} p \left( x_{1} = b_{1}, \dots, x_{k} = b_{k} \right) \lg p \left( x_{1} = b_{1}, \dots, x_{k} = b_{k} \right)
    \end{equation*}
    \label{def:joint_entropy}
\end{definition}

\begin{definition}[Maximally informative $k$-itemset]
    Suppose that $I$ is a collection of $n$ items. An itemset $X \subseteq I$ of cardinality $k$ is a \emph{maximally informative k-itemset}, $\mathrm{iff}$ for all itemsets $Y \subseteq I$ of cardinality $k$,
    \begin{equation}
        H(Y) \leq H(X)
    \end{equation}
    \label{def:miki}
\end{definition}

We will write a $k$-element subset of $I$ as a list of integers that refer to the elements of $I$.

\begin{equation*}
    A = \left[ x_{1}, \dots, x_{k} \right]
\end{equation*}

\noindent where
\begin{equation*}
    x_{1} < \dots < x_{k}
\end{equation*}

Entropy can also be explained as a measure of disorder or uncertainty, i.e., the higher the entropy of an item the more unpredictable it is.
For example, a toss of a fair coin is maximally unpredictable and has an entropy of $1$, since its two outcomes ``heads'' and ``tails'' are equally likely.

Analogous to the coin toss example, an itemset $I$ of maximum entropy consists of elements that are uniformly distributed over the data, i.e., a sample contains an item, or combination of items, of $I$ with equal probability.
This is best illustrated with an example.

\begin{table}[!tbp]
\centering
    \begin{minipage}{.3\textwidth}
        \centering
        \begin{tabular}{?c ^c ^c ^c}
            \toprule
            \rowstyle{\bfseries}
            A & B & C & D \\
            \midrule
            1 & 1 & 1 & 0 \\
            1 & 1 & 0 & 0 \\
            1 & 1 & 1 & 0 \\ 
            1 & 0 & 0 & 0 \\
            0 & 1 & 1 & 0 \\
            0 & 0 & 0 & 1 \\
            0 & 0 & 1 & 1 \\
            0 & 0 & 0 & 1 \\
            \bottomrule
        \end{tabular}
    \end{minipage}%
    \begin{minipage}{.3\textwidth}
        \centering
        \sisetup{round-mode=places}
        \begin{tabular}{c S[round-precision=3]}
            \toprule
            $I$ & $H$ \\
            \midrule
            A & 1.0 \\
            B & 1.0 \\
            C & 1.0 \\
            D & 0.95443400292496494 \\
            \bottomrule
        \end{tabular}
    \end{minipage}%
    \begin{minipage}{.3\textwidth}
        \centering
        \sisetup{round-mode=places}
        \begin{tabular}{c c S[round-precision=3]}
            \toprule
            $I_{1}$ & $I_{2}$ & $H$ \\
            \midrule
            A & B & 1.8112781244591329 \\
            A & C & 2.0 \\
            A & D & 1.4056390622295665 \\
            B & B & 1.8112781244591329 \\
            B & D & 1.4056390622295665 \\
            C & D & 1.9056390622295665 \\
            \bottomrule
        \end{tabular}
    \end{minipage}
\caption{An example dataset of size $8$ and cardinality $4$.}
\label{tab:joint_entropy}
\end{table}

\cref{tab:joint_entropy} shows a small example dataset of size $8$ and cardinality $4$. Items $A$, $B$ and $C$ are uniformly distributed over the dataset, so they are $1$-miki's of entropy \num{1}.
It can be seen that item $A$ and $D$ are mostly complementary, i.e., if a sample includes $A$, it is unlikely that it will include $D$.
Another way of stating this fact is that $D$ becomes predictable given $A$, thus the itemset $\left\{ A, D \right\}$ is of low entropy (see \cref{tab:joint_entropy}).

The itemset $\left\{ A, C \right\}$ is uniformly spread over the database and therefore has the maximum entropy of \num{2} and is a $2$-miki.


% % % % % % % % % % % % % % % % % % % % % %


\section{Algorithms for computing MIKIs}
\label{sec:miki_computation}

\begin{algorithm}[!tbp]
    \caption{Computes the lexicographic successor of itemset $X$ of size $k$}
    \label{alg:lexicographic}
    \input{algorithms/lexicographic}
\end{algorithm}

\begin{algorithm}[!tbp]
    \caption{Computes an exact MIKI by exhaustive search}
    \label{alg:exhaustive_miki}
    \input{algorithms/exhaustive_miki}
\end{algorithm}

To compute a MIKI from a set of itemsets $Y$ of size $n$, an exhaustive search algorithm, as given in \cref{alg:exhaustive_miki}, can be employed \citep{Knobbe2006}.
This algorithm computes the joint entropy for all itemsets of given size $k$ and outputs the one for which this value is maximized.
The order in which the itemsets are processed is determined by the function \texttt{LexicographicSuccessor} (see \cref{alg:lexicographic}).

\begin{algorithm}[!tbp]
    \caption{Computes an approximate MIKI}
    \label{alg:forward_selection}
    \input{algorithms/forward_selection}
\end{algorithm}

Note that for the intended purpose of computing a MIKI out of a vocabulary of $n$ words, it quickly becomes infeasible to perform an exhaustive search, since the number of candidate itemsets is equal to $\dbinom{n}{k}$.
Knobbe et al. have defined various bounds on the joint entropy of an itemset, which can be utilized to reduce the number of candidate itemsets significantly \citep{Knobbe2006}.
They provide four algorithms that are guaranteed to find a MIKI and one called \texttt{ForwardSelection} (see \cref{alg:forward_selection}), that may or may not find a MIKI, but reduces the complexity to $\mathcal{O}(kn)$.

While the exact algorithms are considering all possible itemsets and skipping portions of the search space whenever possible, the approximation algorithm computes a MIKI by constructing it incrementally.
It starts by selecting the item with the highest entropy, resulting in the $1$-MIKI.
An (approximate) MIKI of size $2$ is then found by computing the joint entropy of the union of the current MIKI with one of the remaining items, while keeping track of the maximum value.
This process is repeated until an itemset of size $k$ is found.


% % % % % % % % % % % % % % % % % % % % % %


\section{Utilizing MIKIs for entity resolution}
\label{sec:miki_utilization}

By computing the MIKI for a given corpus we aim to capture distinct topics that occur in it.
In order to incorporate this information in the record linker, as described in the previous chapter, we treat each of the chosen words in the MIKI as a new binary attribute.
This is achieved by extending each reference with $k$ new attributes, each of which is $1$ if the corresponding word did occur in the section that the reference appeared in, and $0$ otherwise.
To classify candidate record pairs, the same procedure as described in \cref{sec:classification} can be used if a suitable similarity function for the binary attributes is provided.
For this purpose, the logical and function can be used: probability values are only retrieved if the values in both records are $1$, i.e., the intersection of words in both references are used.
The probabilities are equal to the fraction of sections that the word appears in.
The classification procedure remains unchanged: each attribute pair is tested for approximate equality and the probability of the value is obtained and fed into an aggregation function whenever this is the case, resulting in a confidence score.
The only addition on the part of the classification algorithm is the inclusion of the logical and function.