Frequency List API

A character (or word) frequency can be computed or loaded via the charfreq function, either from some text or a predefined corpus.

CJKFrequencies.charfreq — Function

charfreq(text)
charfreq(charfreq_type)

Create a character frequency mapping from either text or load it from a default location for pre-specified character frequency datasets (e.g. SimplifiedLCMC, SimplifiedJunDa, etc.).

Examples

When creating a character frequency from text, this method behaves almost exactly like DataStructures.counter except that the return value always has type CharacterFrequency (Accumulator{String, Int}).

julia> text = split("王老师性格内向，沉默寡言，我除在课外活动小组“文学研究会”听过他一次报告，并听-邓知识渊博，是“老师的老师”外，对他一无所知。所以，研读他的作", "");

julia> charfreq(text)
CJKFrequency{SubString{String}, Int64}(Accumulator(除 => 1, 报 => 1, 是 => 1, 知 => 2, 并 => 1, 性 => 1, ， => 6, 言 => 1, 邓 => 1, 外 => 2, 所 => 2, 对 => 1, 动 => 1, 寡 => 1, 。 => 1, 渊 => 1, 学 => 1, - => 1, 听 => 2, 我 => 1, 次 => 1, 一 => 2, 读 => 1, 作 => 1, 格 => 1, “ => 2, 博 => 1, 课 => 1, 老 => 3, 会 => 1, 告 => 1, 无 => 1, 活 => 1, 组 => 1, 内 => 1, 师 => 3, 的 => 2, 小 => 1, 文 => 1, 默 => 1, 究 => 1, 过 => 1, 在 => 1, 以 => 1, ” => 2, 研 => 2, 他 => 3, 向 => 1, 沉 => 1, 王 => 1), Base.RefValue{Int64}(71))

See the documentation for individual character frequency dataset structs for examples of the second case.

source

Supported Predefined Character Frequency Datasets

A Chinese character frequency dataset's struct's name will be prefixed with either Traditional or Simplified depending on whether it is based on a traditional or simplified text corpus.

CJKFrequencies.SimplifiedLCMC — Type

SimplifiedLCMC([categories])

A word frequency dataset: Lancaster Corpus for Mandarin Chinese, simplified terms only, based on simplified text corpus. See their website for more details about the corpus.

The word frequency can be based only on selected categories (see CJKFrequencies.LCMC_CATEGORIES for valid category keys and corresponding category names). Any invalid categories will be ignored.

Examples

Loading all the categories:

julia> charfreq(SimplifiedLCMC())
DataStructures.Accumulator{String,Int64} with 45411 entries:
  "一路…   => 1
  "舍得"   => 9
  "５８"   => 1
  "神农…   => 1
  "十点"   => 8
  "随从"   => 9
  "荡心…   => 1
  "尺码"   => 1
  ⋮      => ⋮

Or loading just a subset (argument can be any iterable):

julia> charfreq(SimplifiedLCMC("ABEGKLMNR"))
DataStructures.Accumulator{String,Int64} with 35488 entries:
  "废…  => 1
  "蜷"  => 1
  "哇"  => 13
  "丰…  => 1
  "弊…  => 3
  "议…  => 10
  "滴"  => 28
  "美…  => 1
  ⋮    => ⋮

Licensing/Copyright

Note: This corpus has some conflicting licensing information, depending on who is supplying the data.

The original corpus is provided primarily for non-profit-making research. Be sure to see the full end user license agreement.
Via the Oxford Text Archive, this corpus is distributed under the CC BY-NC-SA 3.0 license.

source

CJKFrequencies.SimplifiedJunDa — Type

SimplifiedJunDa([list])

A character frequency dataset of modern Chinese compiled by Jun Da, for simplified characters.

By default, the modern Chinese list is fetched, but this can be set by providing a different list argument. The available lists are as follows:

List Name	Symbol
Modern Chinese (default)	`:modern`
Classical Chinese	`:classical`
Modern + Classical Chinese	`:combined`
《现代汉语常用字表》	`:common`
News Corpus Bigrams	`:bigram_news`
Fiction Corpus Bigrams	`:bigram_fiction`

Note that although :classical uses a Classical Chinese corpus, it still uses the simplified character set.

Examples

julia> charfreq(SimplifiedJunDa())
DataStructures.Accumulator{String,Int64} with 9932 entries:
  "蜷… => 837
  "哇… => 4055
  "湓… => 62
  "滴… => 8104
  "堞… => 74
  "狭… => 6901
  "尚… => 38376
  "懈… => 2893
  ⋮   => ⋮

Licensing/Copyright

The original author maintains full copyright to the character frequency lists, but provides the lists for research and teaching/learning purposes only, no commercial use without permission from the author. See their full disclaimer and copyright notice.

source

CJKFrequencies.TraditionalHuangTsai — Type

TraditionalHuangTsai()

A character frequency dataset initially created by Shih-Kun Huang and then further compiled and added to by Chih-Hao Tsai.

The original corpus was collected from 1993-94.

Licensing/Copyright

source

CJKFrequencies.SimplifiedLeidenWeibo — Type

SimplifiedLeidenWeibo()

A word frequency dataset built from Weibo messages This corpus also includes geo-lexical frequency keyed by city, but this is not included in this character frequency.

This data was collected in 2012.

Licensing/Copyright

The data is generated from the Leiden Weibo Corpus, which is released openly under the CC BY-NC-SA 3.0 license.

source

CJKFrequencies.SimplifiedSUBTLEX — Type

SimplifiedSUBTLEX(form)

A word and character frequency dataset generated from film subtitles. To get the respective frequency list, pass either :char or :word for the form parameter.

This dataset was published in 2010.

Licensing/Copyright

The dataset was developed under a non-commercial grant, and the researchers have released free access for research purposes.

source

Other data sets are planned to be added. To add a data set to this API, see the Developer Docs page.

Frequency List Type

CJKFrequencies.CJKFrequency — Type

Accumulator-like data structure for storing frequencies of CJK words (although other tokens can be stored as well). This is usually like the type Accumulator{String, Int}.

You generally don't need to explicitly call this struct's constructor yourself; rather, use the charfreq function.

source

Common operations on CJKFrequency:

DataStructures.inc!
DataStructures.dec!
DataStructures.reset!
and most typical "iterable" or "indexable" functions.

Both length and size are defined: the length of a frequency list is the number of terms in the frequency list, whereas the size is the total count of all tokens.

CJKFrequencies.entropy — Function

entropy(charfreq)

Compute the information theoretic entropy for a character frequency, defined as

\[-\sum_{(c, v)\in CF} \frac{v}{s}\log_2\left( \frac{v}{s} \right), \quad s=\sum_{(c, v) \in CF} v\]

where $c$ is the character and $v$ is the count for that value.

source