Lexicon API
Lexicons
CJKFrequencies.Lexicon
— TypeLexicon()
Lexicon(io_or_filename)
Lexicon(words)
Construct a lexicon. It can be empty (no parameters) or created from some IO-like object or a sequence/iterable of words.
A lexicon is a list of (known) words, each of which can be tagged with various tags (e.g. indicating how it is known, etc.).
CJKFrequencies.tagged_with
— Functiontagged_with(lexicon, tag)
The set of words or characters in a lexicon tagged with tag
.
Coverage and Mutual Information
"Coverage" is a family of statistics about the overlap between
- character frequencies,
- lexicons, and
- text (as represented as its character frequency).
CJKFrequencies.coverage
— Functioncoverage([filter,] coverer, covered)
Compute an "intersection-over-latter" coverage metric.
The coverage of a lexicon of known words over a character frequency list is the ratio of tokens or types in the frequency list which are also present in the lexicon. There are two varieties:
- token coverage counts each token separately (considering repeated characters)
- type coverage counts each unique token once
Suppose the lexicon contains all the words you know and the frequency list represents words extracted from a text you wish to read. The token coverage represents how much of the text you are expected to know (by character), and the type coverage represents how much of the vocabulary you are. The lower the coverage, the higher the "switching cost" in terms of vocabulary.
Coverage can be computed between
- lexicon over a frequency list
- lexicon over text (represented as a frequency list)
- frequency list over another frequency list
by both tokens and types.
Parameters
The first parameter must be a covering type, i.e. one of Accumulator
, CJKFrequency
, or Lexicon
. The second parameter must be a coverable type, i.e. either Accumulator
or CJKFrequency
. Anything else must be convertible to CJKFrequency
via the charfreq
function.
If three arguments are provided, the first argument acts as a context filter. This must be a covering type. For example, if the arguments are a lexicon, frequency list, and some text (in that order), the coverage of the frequency list over the text will be computed, ignoring any characters in the text that do not appear in the lexicon.
Examples
Mutual information is a more rigorously defined concept from information theory. In this context, some possible interpretations are
- "the amount of text you can understand from knowing a set of characters"
- "the amount of one text you can read if you can read this other text"
CJKFrequencies.mutual_information
— Functionmutual_information(charfreq1, charfreq2)
Compute the mutual information between two frequency lists, in bits.
Because there's not a good source of information for the joint PMF, this function currently approximates it using the average of the two marginal PMFs.
Differences between coverage and mutual information metrics:
- Coverage is asymmetric as the intersection is normalized over the second argument; mutual information is symmetric.
- Coverage considers counts of shared tokens or types; mutual information considers (entropy of) shared tokens only.
- Due to its grounding in information theory, mutual information also includes a log factor.
A middle ground between mutual information and coverage is a symmetric "intersection-over-union" coverage:
set1 = CJKFrequency("a" => 5, "b" => 11, "c" => 3)
set2 = CJKFrequency("b" => 2, "c" => 4, "d" => 4)
size(set1 ∩ set2) / size(set1 ∪ set2)
0.7368421052631579
coverage(set1, set2)
(token_coverage = 0.6, type_coverage = 0.6666666666666666)
mutual_information(set1, set2)
1.2793598991105817