Lexicon API

Lexicons

CJKFrequencies.Lexicon — Type

Lexicon()
Lexicon(io_or_filename)
Lexicon(words)

Construct a lexicon. It can be empty (no parameters) or created from some IO-like object or a sequence/iterable of words.

A lexicon is a list of (known) words, each of which can be tagged with various tags (e.g. indicating how it is known, etc.).

source

CJKFrequencies.tagged_with — Function

tagged_with(lexicon, tag)

The set of words or characters in a lexicon tagged with tag.

source

Coverage and Mutual Information

"Coverage" is a family of statistics about the overlap between

character frequencies,
lexicons, and
text (as represented as its character frequency).

CJKFrequencies.coverage — Function

coverage([filter,] coverer, covered)

Compute an "intersection-over-latter" coverage metric.

The coverage of a lexicon of known words over a character frequency list is the ratio of tokens or types in the frequency list which are also present in the lexicon. There are two varieties:

token coverage counts each token separately (considering repeated characters)
type coverage counts each unique token once

Suppose the lexicon contains all the words you know and the frequency list represents words extracted from a text you wish to read. The token coverage represents how much of the text you are expected to know (by character), and the type coverage represents how much of the vocabulary you are. The lower the coverage, the higher the "switching cost" in terms of vocabulary.

Coverage can be computed between

lexicon over a frequency list
lexicon over text (represented as a frequency list)
frequency list over another frequency list

by both tokens and types.

Parameters

The first parameter must be a covering type, i.e. one of Accumulator, CJKFrequency, or Lexicon. The second parameter must be a coverable type, i.e. either Accumulator or CJKFrequency. Anything else must be convertible to CJKFrequency via the charfreq function.

If three arguments are provided, the first argument acts as a context filter. This must be a covering type. For example, if the arguments are a lexicon, frequency list, and some text (in that order), the coverage of the frequency list over the text will be computed, ignoring any characters in the text that do not appear in the lexicon.

Examples

source

Mutual information is a more rigorously defined concept from information theory. In this context, some possible interpretations are

"the amount of text you can understand from knowing a set of characters"
"the amount of one text you can read if you can read this other text"

CJKFrequencies.mutual_information — Function

mutual_information(charfreq1, charfreq2)

Compute the mutual information between two frequency lists, in bits.

Subject to refinement

Because there's not a good source of information for the joint PMF, this function currently approximates it using the average of the two marginal PMFs.

source

Differences between coverage and mutual information metrics:

Coverage is asymmetric as the intersection is normalized over the second argument; mutual information is symmetric.
Coverage considers counts of shared tokens or types; mutual information considers (entropy of) shared tokens only.
Due to its grounding in information theory, mutual information also includes a log factor.

A middle ground between mutual information and coverage is a symmetric "intersection-over-union" coverage:

set1 = CJKFrequency("a" => 5, "b" => 11, "c" => 3)
set2 = CJKFrequency("b" => 2, "c" => 4, "d" => 4)

size(set1 ∩ set2) / size(set1 ∪ set2)

0.7368421052631579

coverage(set1, set2)

(token_coverage = 0.6, type_coverage = 0.6666666666666666)

mutual_information(set1, set2)

1.2793598991105817