cnt.rulebase.const package

Submodules

cnt.rulebase.const.chinese_chars module

Consts for detecting chinese chars.

cnt.rulebase.const.chinese_chars.ITV_CHINESE_CHARS = [(11904, 12019), (12032, 12245), (12272, 12283), (12549, 12591), (12704, 12730), (12736, 12771), (13312, 19893), (19968, 40869), (40870, 40943), (58368, 58856), (58880, 59087), (59413, 59503), (63744, 64217), (131072, 173782), (173824, 177972), (177984, 178205), (178208, 183969), (183984, 191456), (194560, 195101)]

Chinese Chars. Pulled from https://www.qqxiuzi.cn/zh/hanzi-unicode-bianma.php Notice 3007 is a delimiter, hence should not be included.

Range generation:

lines = '''copy paste the table here'''
[l.split('\t') for l in lines.strip().split('\n')]

cnt.rulebase.const.delimiters module

Consts for detecting delimiter chars.

cnt.rulebase.const.delimiters.ITV_DELIMITERS = [(33, 47), (58, 64), (91, 96), (123, 126), (183, 183), (8208, 8231), (8237, 8238), (8240, 8286), (12289, 12351), (65072, 65103), (65281, 65295), (65306, 65312), (65339, 65344), (65371, 65380), (65504, 65518)]

Delimiters.

cnt.rulebase.const.digits module

Consts for detecting digit chars.

cnt.rulebase.const.digits.ITV_DIGITS = [(48, 57), (65296, 65305)]

Digits.

cnt.rulebase.const.english_chars module

Consts for detecting chinese chars.

cnt.rulebase.const.english_chars.ITV_ENGLISH_CHARS = [(65, 90), (97, 122), (65313, 65338), (65345, 65370)]

English Chars.

cnt.rulebase.const.utils module

Utils functions

cnt.rulebase.const.utils.normalize_cjk_compatibility_ideographs(seq)[source]
Return type

str

cnt.rulebase.const.utils.normalize_cjk_fullwidth_ascii(seq)[source]

Conver fullwith ASCII to halfwidth ASCII. See https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

Return type

str

cnt.rulebase.const.utils.sorted_chain(*ranges)[source]

Chain & sort ranges.

Return type

List[Tuple[int, int]]

Module contents

All consts for rule-based tasks.

Naming patterns:

  • EM_*: List of exact match strings.

  • ITV_*: List of closed intervals.

  • RE_*: List of regular expressions.