Frequency List API
A character (or word) frequency can be computed or loaded via the charfreq
function, either from some text or a predefined corpus.
CJKFrequencies.charfreq
— Functioncharfreq(text)
charfreq(charfreq_type)
Create a character frequency mapping from either text or load it from a default location for pre-specified character frequency datasets (e.g. SimplifiedLCMC
, SimplifiedJunDa
, etc.).
Examples
When creating a character frequency from text, this method behaves almost exactly like DataStructures.counter
except that the return value always has type CharacterFrequency
(Accumulator{String, Int}
).
julia> text = split("王老师性格内向,沉默寡言,我除在课外活动小组“文学研究会”听过他一次报告,并听-邓知识渊博,是“老师的老师”外,对他一无所知。所以,研读他的作", "");
julia> charfreq(text)
CJKFrequency{SubString{String}, Int64}(Accumulator(除 => 1, 报 => 1, 是 => 1, 知 => 2, 并 => 1, 性 => 1, , => 6, 言 => 1, 邓 => 1, 外 => 2, 所 => 2, 对 => 1, 动 => 1, 寡 => 1, 。 => 1, 渊 => 1, 学 => 1, - => 1, 听 => 2, 我 => 1, 次 => 1, 一 => 2, 读 => 1, 作 => 1, 格 => 1, “ => 2, 博 => 1, 课 => 1, 老 => 3, 会 => 1, 告 => 1, 无 => 1, 活 => 1, 组 => 1, 内 => 1, 师 => 3, 的 => 2, 小 => 1, 文 => 1, 默 => 1, 究 => 1, 过 => 1, 在 => 1, 以 => 1, ” => 2, 研 => 2, 他 => 3, 向 => 1, 沉 => 1, 王 => 1), Base.RefValue{Int64}(71))
See the documentation for individual character frequency dataset structs for examples of the second case.
Supported Predefined Character Frequency Datasets
A Chinese character frequency dataset's struct
's name will be prefixed with either Traditional
or Simplified
depending on whether it is based on a traditional or simplified text corpus.
CJKFrequencies.SimplifiedLCMC
— TypeSimplifiedLCMC([categories])
A word frequency dataset: Lancaster Corpus for Mandarin Chinese, simplified terms only, based on simplified text corpus. See their website for more details about the corpus.
The word frequency can be based only on selected categories (see CJKFrequencies.LCMC_CATEGORIES
for valid category keys and corresponding category names). Any invalid categories will be ignored.
Examples
Loading all the categories:
julia> charfreq(SimplifiedLCMC())
DataStructures.Accumulator{String,Int64} with 45411 entries:
"一路… => 1
"舍得" => 9
"58" => 1
"神农… => 1
"十点" => 8
"随从" => 9
"荡心… => 1
"尺码" => 1
⋮ => ⋮
Or loading just a subset (argument can be any iterable):
julia> charfreq(SimplifiedLCMC("ABEGKLMNR"))
DataStructures.Accumulator{String,Int64} with 35488 entries:
"废… => 1
"蜷" => 1
"哇" => 13
"丰… => 1
"弊… => 3
"议… => 10
"滴" => 28
"美… => 1
⋮ => ⋮
Licensing/Copyright
Note: This corpus has some conflicting licensing information, depending on who is supplying the data.
- The original corpus is provided primarily for non-profit-making research. Be sure to see the full end user license agreement.
- Via the Oxford Text Archive, this corpus is distributed under the CC BY-NC-SA 3.0 license.
CJKFrequencies.SimplifiedJunDa
— TypeSimplifiedJunDa([list])
A character frequency dataset of modern Chinese compiled by Jun Da, for simplified characters.
By default, the modern Chinese list is fetched, but this can be set by providing a different list
argument. The available lists are as follows:
List Name | Symbol |
---|---|
Modern Chinese (default) | :modern |
Classical Chinese | :classical |
Modern + Classical Chinese | :combined |
《现代汉语常用字表》 | :common |
News Corpus Bigrams | :bigram_news |
Fiction Corpus Bigrams | :bigram_fiction |
Note that although :classical
uses a Classical Chinese corpus, it still uses the simplified character set.
Examples
julia> charfreq(SimplifiedJunDa())
DataStructures.Accumulator{String,Int64} with 9932 entries:
"蜷… => 837
"哇… => 4055
"湓… => 62
"滴… => 8104
"堞… => 74
"狭… => 6901
"尚… => 38376
"懈… => 2893
⋮ => ⋮
Licensing/Copyright
The original author maintains full copyright to the character frequency lists, but provides the lists for research and teaching/learning purposes only, no commercial use without permission from the author. See their full disclaimer and copyright notice.
CJKFrequencies.TraditionalHuangTsai
— TypeTraditionalHuangTsai()
A character frequency dataset initially created by Shih-Kun Huang and then further compiled and added to by Chih-Hao Tsai.
The original corpus was collected from 1993-94.
Licensing/Copyright
Copyright 1996-2006 Chih-Hao Tsai. Licensing information unknown, so use at your own risk.
CJKFrequencies.SimplifiedLeidenWeibo
— TypeSimplifiedLeidenWeibo()
A word frequency dataset built from Weibo messages This corpus also includes geo-lexical frequency keyed by city, but this is not included in this character frequency.
This data was collected in 2012.
Licensing/Copyright
The data is generated from the Leiden Weibo Corpus, which is released openly under the CC BY-NC-SA 3.0 license.
CJKFrequencies.SimplifiedSUBTLEX
— TypeSimplifiedSUBTLEX(form)
A word and character frequency dataset generated from film subtitles. To get the respective frequency list, pass either :char
or :word
for the form
parameter.
This dataset was published in 2010.
Licensing/Copyright
The dataset was developed under a non-commercial grant, and the researchers have released free access for research purposes.
Other data sets are planned to be added. To add a data set to this API, see the Developer Docs page.
Frequency List Type
CJKFrequencies.CJKFrequency
— TypeAccumulator-like data structure for storing frequencies of CJK words (although other tokens can be stored as well). This is usually like the type Accumulator{String, Int}
.
You generally don't need to explicitly call this struct's constructor yourself; rather, use the charfreq
function.
Common operations on CJKFrequency
:
DataStructures.inc!
DataStructures.dec!
DataStructures.reset!
- and most typical "iterable" or "indexable" functions.
Both length
and size
are defined: the length of a frequency list is the number of terms in the frequency list, whereas the size is the total count of all tokens.
CJKFrequencies.entropy
— Functionentropy(charfreq)
Compute the information theoretic entropy for a character frequency, defined as
\[-\sum_{(c, v)\in CF} \frac{v}{s}\log_2\left( \frac{v}{s} \right), \quad s=\sum_{(c, v) \in CF} v\]
where $c$ is the character and $v$ is the count for that value.