cnt.rulebase.rules.sentence_segmentation package¶
Submodules¶
cnt.rulebase.rules.sentence_segmentation.const module¶
Consts for detecting sentence endings.
-
cnt.rulebase.rules.sentence_segmentation.const.
EM_SENTENCE_ENDINGS
= ['?!"', '?!"', '?!”', '?!”', '?!″', '?!″', '……"', '……"', '……”', '……″', '。\'"', '。'"', "。'”", '。'”', "。'″", '。'″', '。’"', '。’"', '。’”', '。’″', '。′"', '。′"', '。′”', '。′″', '!\'"', '!'"', '!'”', "!'”", '!'″', "!'″", '!’"', '!’"', '!’”', '!’”', '!’″', '!’″', '!′"', '!′"', '!′”', '!′”', '!′″', '!′″', '。"', '。"', '。”', '。″', '!"', '!"', '!”', '!”', '!″', '!″', '?"', '?"', '?”', '?”', '?″', '?″', ';"', ';"', ';”', ';”', ';″', ';″', '?!', '?!', '…', '。', '!', '!', '?', '?', ';', ';']¶ For detecting sentence endings.
-
cnt.rulebase.rules.sentence_segmentation.const.
ITV_SENTENCE_VALID_CHARS
= [(33, 47), (48, 57), (58, 64), (65, 90), (91, 96), (97, 122), (123, 126), (183, 183), (8208, 8231), (8237, 8238), (8240, 8286), (11904, 12019), (12032, 12245), (12272, 12283), (12289, 12351), (12549, 12591), (12704, 12730), (12736, 12771), (13312, 19893), (19968, 40869), (40870, 40943), (58368, 58856), (58880, 59087), (59413, 59503), (63744, 64217), (65072, 65103), (65281, 65295), (65296, 65305), (65306, 65312), (65313, 65338), (65339, 65344), (65345, 65370), (65371, 65380), (65504, 65518), (131072, 173782), (173824, 177972), (177984, 178205), (178208, 183969), (183984, 191456), (194560, 195101)]¶ For detecting valid characters of sentence.
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter module¶
Chinese sentence segmentation.
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
CommaLabeler
(input_sequence, config)[source]¶ Bases:
cnt.rulebase.workflow.basic_workflow.BasicSequentialLabeler
Mark comma.
-
COMMAS
= (',', '‚', ',')¶
-
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
DelimitersLabeler
(input_sequence, config)[source]¶ Bases:
cnt.rulebase.workflow.interval_labeler.IntervalLabeler
Mark dilimiters for sentence ending extension.
-
ITV_RE_PATTERN
: Optional[re] = re.compile('[!-/:-@[-`{-~·-·‐-‧\u202d-\u202e‰-⁞、-〿︰-﹏!-/:-@[-`{-、¢-○]+')¶
-
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
DynamicSentenceEndingLabeler
(input_sequence, config)[source]¶ Bases:
cnt.rulebase.workflow.exact_match_labeler.ExactMatchLabeler
Support dynamic sentence endings that will be built in runtime.
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
SentenceEndingLabeler
(input_sequence, config)[source]¶ Bases:
cnt.rulebase.workflow.exact_match_labeler.ExactMatchLabeler
Mark sentence endings based on
cnt.rulebase.const.sentence_endings.EM_SENTENCE_ENDINGS
-
AC_AUTOMATION
: Any = <ahocorasick.Automaton object>¶
-
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
SentenceSegementationConfig
(enable_strict_sentence_charset, enable_comma_ending, extend_ending_with_delimiters, dynamic_endings)[source]¶
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
SentenceSegementationLabelProcessor
(input_sequence, index_labels_generator, config)[source]¶ Bases:
cnt.rulebase.workflow.basic_workflow.BasicLabelProcessor
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
SentenceSegementationOutputGenerator
(input_sequence, label_processor_result, config)[source]¶ Bases:
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter._SentenceSegementationOutputGeneratorLazy
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
SentenceSegementationOutputGeneratorLazy
(input_sequence, label_processor_result, config)[source]¶ Bases:
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter._SentenceSegementationOutputGeneratorLazy
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
SentenceValidCharacterLabeler
(input_sequence, config)[source]¶ Bases:
cnt.rulebase.workflow.interval_labeler.IntervalLabeler
Mark valid character of chinese sentence.
-
ITV_RE_PATTERN
: Optional[re] = re.compile('[!-/0-9:-@A-Z[-`a-z{-~·-·‐-‧\u202d-\u202e‰-⁞⺀-⻳⼀-⿕⿰-⿻、-〿ㄅ-\u312fㆠ-ㆺ㇀-㇣㐀-䶵一-龥龦-\u9fef\ue400-\ue5e8\ue600-\ue6cf\ue815-\ue86f豈-龎︰-﹏!-/0-9:-@A-Z[-`a-z{-、¢-○𠀀-𪛖𪜀-𫜴𫝀-𫠝𫠠-𬺡\U0002ceb0-\U0002ebe0丽-𪘀]+')¶
-
-
class
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.
WhitespaceLabeler
(input_sequence, config)[source]¶ Bases:
cnt.rulebase.workflow.interval_labeler.IntervalLabeler
Mark unicode whitespace.
-
ITV_RE_PATTERN
: Optional[re] = re.compile('\\s+')¶
-
Module contents¶
Chinese sentence segmentation.