cnt.rulebase.rules.sentence_segmentation package

Submodules

cnt.rulebase.rules.sentence_segmentation.const module

Consts for detecting sentence endings.

cnt.rulebase.rules.sentence_segmentation.const.EM_SENTENCE_ENDINGS = ['?!"', '?!"', '?!”', '?!”', '?!″', '?!″', '……"', '……"', '……”', '……″', '。\'"', '。'"', "。'”", '。'”', "。'″", '。'″', '。’"', '。’"', '。’”', '。’″', '。′"', '。′"', '。′”', '。′″', '!\'"', '!'"', '!'”', "!'”", '!'″', "!'″", '!’"', '!’"', '!’”', '!’”', '!’″', '!’″', '!′"', '!′"', '!′”', '!′”', '!′″', '!′″', '。"', '。"', '。”', '。″', '!"', '!"', '!”', '!”', '!″', '!″', '?"', '?"', '?”', '?”', '?″', '?″', ';"', ';"', ';”', ';”', ';″', ';″', '?!', '?!', '…', '。', '!', '!', '?', '?', ';', ';']

For detecting sentence endings.

cnt.rulebase.rules.sentence_segmentation.const.ITV_SENTENCE_VALID_CHARS = [(33, 47), (48, 57), (58, 64), (65, 90), (91, 96), (97, 122), (123, 126), (183, 183), (8208, 8231), (8237, 8238), (8240, 8286), (11904, 12019), (12032, 12245), (12272, 12283), (12289, 12351), (12549, 12591), (12704, 12730), (12736, 12771), (13312, 19893), (19968, 40869), (40870, 40943), (58368, 58856), (58880, 59087), (59413, 59503), (63744, 64217), (65072, 65103), (65281, 65295), (65296, 65305), (65306, 65312), (65313, 65338), (65339, 65344), (65345, 65370), (65371, 65380), (65504, 65518), (131072, 173782), (173824, 177972), (177984, 178205), (178208, 183969), (183984, 191456), (194560, 195101)]

For detecting valid characters of sentence.

cnt.rulebase.rules.sentence_segmentation.sentence_segmenter module

Chinese sentence segmentation.

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.CommaLabeler(input_sequence, config)[source]

Bases: cnt.rulebase.workflow.basic_workflow.BasicSequentialLabeler

Mark comma.

COMMAS = (',', '‚', ',')
label(index)[source]

Return boolean label for self.input_sequence[index]. Derived class must override this method.

Parameters

index (int) – The index of self.input_sequence.

Return type

bool

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.DelimitersLabeler(input_sequence, config)[source]

Bases: cnt.rulebase.workflow.interval_labeler.IntervalLabeler

Mark dilimiters for sentence ending extension.

ITV_RE_PATTERN: Optional[re] = re.compile('[!-/:-@[-`{-~·-·‐-‧\u202d-\u202e‰-⁞、-〿︰-﹏!-/:-@[-`{-、¢-○]+')
class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.DynamicSentenceEndingLabeler(input_sequence, config)[source]

Bases: cnt.rulebase.workflow.exact_match_labeler.ExactMatchLabeler

Support dynamic sentence endings that will be built in runtime.

intervals_generator()[source]
Return type

Generator[Tuple[int, int], None, None]

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceEndingLabeler(input_sequence, config)[source]

Bases: cnt.rulebase.workflow.exact_match_labeler.ExactMatchLabeler

Mark sentence endings based on cnt.rulebase.const.sentence_endings.EM_SENTENCE_ENDINGS

AC_AUTOMATION: Any = <ahocorasick.Automaton object>
class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceSegementationConfig(enable_strict_sentence_charset, enable_comma_ending, extend_ending_with_delimiters, dynamic_endings)[source]

Bases: cnt.rulebase.workflow.basic_workflow.BasicConfig

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceSegementationLabelProcessor(input_sequence, index_labels_generator, config)[source]

Bases: cnt.rulebase.workflow.basic_workflow.BasicLabelProcessor

result()[source]

Generate intervals indicating the valid sentences.

Return type

Generator[Tuple[int, int], None, None]

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceSegementationOutputGenerator(input_sequence, label_processor_result, config)[source]

Bases: cnt.rulebase.rules.sentence_segmentation.sentence_segmenter._SentenceSegementationOutputGeneratorLazy

result()[source]

Output generator could generate any return type. Derived class must override this method.

Return type

List[Tuple[str, Tuple[int, int]]]

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceSegementationOutputGeneratorLazy(input_sequence, label_processor_result, config)[source]

Bases: cnt.rulebase.rules.sentence_segmentation.sentence_segmenter._SentenceSegementationOutputGeneratorLazy

result()[source]

Output generator could generate any return type. Derived class must override this method.

Return type

Generator[Tuple[str, Tuple[int, int]], None, None]

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceValidCharacterLabeler(input_sequence, config)[source]

Bases: cnt.rulebase.workflow.interval_labeler.IntervalLabeler

Mark valid character of chinese sentence.

ITV_RE_PATTERN: Optional[re] = re.compile('[!-/0-9:-@A-Z[-`a-z{-~·-·‐-‧\u202d-\u202e‰-⁞⺀-⻳⼀-⿕⿰-⿻、-〿ㄅ-\u312fㆠ-ㆺ㇀-㇣㐀-䶵一-龥龦-\u9fef\ue400-\ue5e8\ue600-\ue6cf\ue815-\ue86f豈-龎︰-﹏!-/0-9:-@A-Z[-`a-z{-、¢-○𠀀-𪛖𪜀-𫜴𫝀-𫠝𫠠-𬺡\U0002ceb0-\U0002ebe0丽-𪘀]+')
class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.WhitespaceLabeler(input_sequence, config)[source]

Bases: cnt.rulebase.workflow.interval_labeler.IntervalLabeler

Mark unicode whitespace.

ITV_RE_PATTERN: Optional[re] = re.compile('\\s+')
cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.sentseg(text, enable_strict_sentence_charset=False, enable_comma_ending=False, extend_ending_with_delimiters=False, dynamic_endings=None)[source]
Return type

List[Tuple[str, Tuple[int, int]]]

cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.sentseg_lazy(text, enable_strict_sentence_charset=False, enable_comma_ending=False, extend_ending_with_delimiters=False, dynamic_endings=None)[source]
Return type

Generator[Tuple[str, Tuple[int, int]], None, None]

Module contents

Chinese sentence segmentation.