cnt.rulebase.rules.sentence_segmentation package¶

Submodules¶

cnt.rulebase.rules.sentence_segmentation.const module¶

Consts for detecting sentence endings.

cnt.rulebase.rules.sentence_segmentation.const.EM_SENTENCE_ENDINGS = ['?!"', '？！＂', '?!”', '？！”', '?!″', '？！″', '……＂', '……"', '……”', '……″', '。\'"', '。＇＂', "。'”", '。＇”', "。'″", '。＇″', '。’＂', '。’"', '。’”', '。’″', '。′"', '。′＂', '。′”', '。′″', '!\'"', '！＇＂', '！＇”', "!'”", '！＇″', "!'″", '！’＂', '!’"', '!’”', '！’”', '!’″', '！’″', '!′"', '！′＂', '！′”', '!′”', '！′″', '!′″', '。"', '。＂', '。”', '。″', '!"', '！＂', '!”', '！”', '!″', '！″', '?"', '？＂', '?”', '？”', '？″', '?″', ';"', '；＂', ';”', '；”', ';″', '；″', '?!', '？！', '…', '。', '！', '!', '?', '？', '；', ';']¶: For detecting sentence endings.

cnt.rulebase.rules.sentence_segmentation.const.ITV_SENTENCE_VALID_CHARS = [(33, 47), (48, 57), (58, 64), (65, 90), (91, 96), (97, 122), (123, 126), (183, 183), (8208, 8231), (8237, 8238), (8240, 8286), (11904, 12019), (12032, 12245), (12272, 12283), (12289, 12351), (12549, 12591), (12704, 12730), (12736, 12771), (13312, 19893), (19968, 40869), (40870, 40943), (58368, 58856), (58880, 59087), (59413, 59503), (63744, 64217), (65072, 65103), (65281, 65295), (65296, 65305), (65306, 65312), (65313, 65338), (65339, 65344), (65345, 65370), (65371, 65380), (65504, 65518), (131072, 173782), (173824, 177972), (177984, 178205), (178208, 183969), (183984, 191456), (194560, 195101)]¶: For detecting valid characters of sentence.

cnt.rulebase.rules.sentence_segmentation.sentence_segmenter module¶

Chinese sentence segmentation.

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.CommaLabeler(input_sequence, config)[source]¶

Bases: cnt.rulebase.workflow.basic_workflow.BasicSequentialLabeler

Mark comma.

COMMAS = ('，', '‚', ',')¶

label(index)[source]¶

Return boolean label for self.input_sequence[index]. Derived class must override this method.

Parameters: index (int) – The index of self.input_sequence.
Return type: bool

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.DelimitersLabeler(input_sequence, config)[source]¶

Bases: cnt.rulebase.workflow.interval_labeler.IntervalLabeler

Mark dilimiters for sentence ending extension.

ITV_RE_PATTERN: Optional[re] = re.compile('[!-/:-@[-`{-~·-·‐-‧\u202d-\u202e‰-⁞、-〿︰-﹏！-／：-＠［-｀｛-､￠-￮]+')¶

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.DynamicSentenceEndingLabeler(input_sequence, config)[source]¶

Bases: cnt.rulebase.workflow.exact_match_labeler.ExactMatchLabeler

Support dynamic sentence endings that will be built in runtime.

intervals_generator()[source]¶

Return type: Generator[Tuple[int, int], None, None]

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceEndingLabeler(input_sequence, config)[source]¶

Bases: cnt.rulebase.workflow.exact_match_labeler.ExactMatchLabeler

Mark sentence endings based on cnt.rulebase.const.sentence_endings.EM_SENTENCE_ENDINGS

AC_AUTOMATION: Any = <ahocorasick.Automaton object>¶

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceSegementationConfig(enable_strict_sentence_charset, enable_comma_ending, extend_ending_with_delimiters, dynamic_endings)[source]¶: Bases: cnt.rulebase.workflow.basic_workflow.BasicConfig

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceSegementationLabelProcessor(input_sequence, index_labels_generator, config)[source]¶

Bases: cnt.rulebase.workflow.basic_workflow.BasicLabelProcessor

result()[source]¶

Generate intervals indicating the valid sentences.

Return type: Generator[Tuple[int, int], None, None]

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceSegementationOutputGenerator(input_sequence, label_processor_result, config)[source]¶

Bases: cnt.rulebase.rules.sentence_segmentation.sentence_segmenter._SentenceSegementationOutputGeneratorLazy

result()[source]¶

Output generator could generate any return type. Derived class must override this method.

Return type: List[Tuple[str, Tuple[int, int]]]

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceSegementationOutputGeneratorLazy(input_sequence, label_processor_result, config)[source]¶

Bases: cnt.rulebase.rules.sentence_segmentation.sentence_segmenter._SentenceSegementationOutputGeneratorLazy

result()[source]¶

Output generator could generate any return type. Derived class must override this method.

Return type: Generator[Tuple[str, Tuple[int, int]], None, None]

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.SentenceValidCharacterLabeler(input_sequence, config)[source]¶

Bases: cnt.rulebase.workflow.interval_labeler.IntervalLabeler

Mark valid character of chinese sentence.

ITV_RE_PATTERN: Optional[re] = re.compile('[!-/0-9:-@A-Z[-`a-z{-~·-·‐-‧\u202d-\u202e‰-⁞⺀-⻳⼀-⿕⿰-⿻、-〿ㄅ-\u312fㆠ-ㆺ㇀-㇣㐀-䶵一-龥龦-\u9fef\ue400-\ue5e8\ue600-\ue6cf\ue815-\ue86f豈-龎︰-﹏！-／０-９：-＠Ａ-Ｚ［-｀ａ-ｚ｛-､￠-￮𠀀-𪛖𪜀-𫜴𫝀-𫠝𫠠-𬺡\U0002ceb0-\U0002ebe0丽-𪘀]+')¶

class cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.WhitespaceLabeler(input_sequence, config)[source]¶

Bases: cnt.rulebase.workflow.interval_labeler.IntervalLabeler

Mark unicode whitespace.

ITV_RE_PATTERN: Optional[re] = re.compile('\\s+')¶

cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.sentseg(text, enable_strict_sentence_charset=False, enable_comma_ending=False, extend_ending_with_delimiters=False, dynamic_endings=None)[source]¶

Return type: List[Tuple[str, Tuple[int, int]]]

cnt.rulebase.rules.sentence_segmentation.sentence_segmenter.sentseg_lazy(text, enable_strict_sentence_charset=False, enable_comma_ending=False, extend_ending_with_delimiters=False, dynamic_endings=None)[source]¶

Return type: Generator[Tuple[str, Tuple[int, int]], None, None]

Module contents¶

Chinese sentence segmentation.

cnt.rulebase.rules.sentence_segmentation package¶

Submodules¶

cnt.rulebase.rules.sentence_segmentation.const module¶

cnt.rulebase.rules.sentence_segmentation.sentence_segmenter module¶

Module contents¶

cnt.rulebase

Navigation

Related Topics