Frequency List API

A character (or word) frequency can be computed or loaded via the charfreq function, either from some text or a predefined corpus.

CJKFrequencies.charfreqFunction
charfreq(text)
charfreq(charfreq_type)

Create a character frequency mapping from either text or load it from a default location for pre-specified character frequency datasets (e.g. SimplifiedLCMC, SimplifiedJunDa, etc.).

Examples

When creating a character frequency from text, this method behaves almost exactly like DataStructures.counter except that the return value always has type CharacterFrequency (Accumulator{String, Int}).

julia> text = split("王老师性格内向,沉默寡言,我除在课外活动小组“文学研究会”听过他一次报告,并听-邓知识渊博,是“老师的老师”外,对他一无所知。所以,研读他的作", "");

julia> charfreq(text)
CJKFrequency{SubString{String}, Int64}(Accumulator(除 => 1, 报 => 1, 是 => 1, 知 => 2, 并 => 1, 性 => 1, , => 6, 言 => 1, 邓 => 1, 外 => 2, 所 => 2, 对 => 1, 动 => 1, 寡 => 1, 。 => 1, 渊 => 1, 学 => 1, - => 1, 听 => 2, 我 => 1, 次 => 1, 一 => 2, 读 => 1, 作 => 1, 格 => 1, “ => 2, 博 => 1, 课 => 1, 老 => 3, 会 => 1, 告 => 1, 无 => 1, 活 => 1, 组 => 1, 内 => 1, 师 => 3, 的 => 2, 小 => 1, 文 => 1, 默 => 1, 究 => 1, 过 => 1, 在 => 1, 以 => 1, ” => 2, 研 => 2, 他 => 3, 向 => 1, 沉 => 1, 王 => 1), Base.RefValue{Int64}(71))

See the documentation for individual character frequency dataset structs for examples of the second case.

source

Supported Predefined Character Frequency Datasets

A Chinese character frequency dataset's struct's name will be prefixed with either Traditional or Simplified depending on whether it is based on a traditional or simplified text corpus.

CJKFrequencies.SimplifiedLCMCType
SimplifiedLCMC([categories])

A word frequency dataset: Lancaster Corpus for Mandarin Chinese, simplified terms only, based on simplified text corpus. See their website for more details about the corpus.

The word frequency can be based only on selected categories (see CJKFrequencies.LCMC_CATEGORIES for valid category keys and corresponding category names). Any invalid categories will be ignored.

Examples

Loading all the categories:

julia> charfreq(SimplifiedLCMC())
DataStructures.Accumulator{String,Int64} with 45411 entries:
  "一路…   => 1
  "舍得"   => 9
  "58"   => 1
  "神农…   => 1
  "十点"   => 8
  "随从"   => 9
  "荡心…   => 1
  "尺码"   => 1
  ⋮      => ⋮

Or loading just a subset (argument can be any iterable):

julia> charfreq(SimplifiedLCMC("ABEGKLMNR"))
DataStructures.Accumulator{String,Int64} with 35488 entries:
  "废…  => 1
  "蜷"  => 1
  "哇"  => 13
  "丰…  => 1
  "弊…  => 3
  "议…  => 10
  "滴"  => 28
  "美…  => 1
  ⋮    => ⋮

Licensing/Copyright

Note: This corpus has some conflicting licensing information, depending on who is supplying the data.

source
CJKFrequencies.SimplifiedJunDaType
SimplifiedJunDa([list])

A character frequency dataset of modern Chinese compiled by Jun Da, for simplified characters.

By default, the modern Chinese list is fetched, but this can be set by providing a different list argument. The available lists are as follows:

List NameSymbol
Modern Chinese (default):modern
Classical Chinese:classical
Modern + Classical Chinese:combined
《现代汉语常用字表》:common
News Corpus Bigrams:bigram_news
Fiction Corpus Bigrams:bigram_fiction

Note that although :classical uses a Classical Chinese corpus, it still uses the simplified character set.

Examples

julia> charfreq(SimplifiedJunDa())
DataStructures.Accumulator{String,Int64} with 9932 entries:
  "蜷… => 837
  "哇… => 4055
  "湓… => 62
  "滴… => 8104
  "堞… => 74
  "狭… => 6901
  "尚… => 38376
  "懈… => 2893
  ⋮   => ⋮

Licensing/Copyright

The original author maintains full copyright to the character frequency lists, but provides the lists for research and teaching/learning purposes only, no commercial use without permission from the author. See their full disclaimer and copyright notice.

source
CJKFrequencies.TraditionalHuangTsaiType
TraditionalHuangTsai()

A character frequency dataset initially created by Shih-Kun Huang and then further compiled and added to by Chih-Hao Tsai.

The original corpus was collected from 1993-94.

Licensing/Copyright

Copyright 1996-2006 Chih-Hao Tsai. Licensing information unknown, so use at your own risk.

source
CJKFrequencies.SimplifiedLeidenWeiboType
SimplifiedLeidenWeibo()

A word frequency dataset built from Weibo messages This corpus also includes geo-lexical frequency keyed by city, but this is not included in this character frequency.

This data was collected in 2012.

Licensing/Copyright

The data is generated from the Leiden Weibo Corpus, which is released openly under the CC BY-NC-SA 3.0 license.

source
CJKFrequencies.SimplifiedSUBTLEXType
SimplifiedSUBTLEX(form)

A word and character frequency dataset generated from film subtitles. To get the respective frequency list, pass either :char or :word for the form parameter.

This dataset was published in 2010.

Licensing/Copyright

The dataset was developed under a non-commercial grant, and the researchers have released free access for research purposes.

source

Other data sets are planned to be added. To add a data set to this API, see the Developer Docs page.

Frequency List Type

CJKFrequencies.CJKFrequencyType

Accumulator-like data structure for storing frequencies of CJK words (although other tokens can be stored as well). This is usually like the type Accumulator{String, Int}.

You generally don't need to explicitly call this struct's constructor yourself; rather, use the charfreq function.

source

Common operations on CJKFrequency:

  • DataStructures.inc!
  • DataStructures.dec!
  • DataStructures.reset!
  • and most typical "iterable" or "indexable" functions.

Both length and size are defined: the length of a frequency list is the number of terms in the frequency list, whereas the size is the total count of all tokens.

CJKFrequencies.entropyFunction
entropy(charfreq)

Compute the information theoretic entropy for a character frequency, defined as

\[-\sum_{(c, v)\in CF} \frac{v}{s}\log_2\left( \frac{v}{s} \right), \quad s=\sum_{(c, v) \in CF} v\]

where $c$ is the character and $v$ is the count for that value.

source