nltk.tokenize.StanfordSegmenter¶
- class nltk.tokenize.StanfordSegmenter[source]¶
Bases:
TokenizerI
Interface to the Stanford Segmenter
If stanford-segmenter version is older than 2016-10-31, then path_to_slf4j should be provieded, for example:
seg = StanfordSegmenter(path_to_slf4j='/YOUR_PATH/slf4j-api.jar')
>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter >>> seg = StanfordSegmenter() >>> seg.default_config('zh') >>> sent = u'这是斯坦福中文分词器测试' >>> print(seg.segment(sent)) 这 是 斯坦福 中文 分词器 测试 >>> seg.default_config('ar') >>> sent = u'هذا هو تصنيف ستانفورد العربي للكلمات' >>> print(seg.segment(sent.split())) هذا هو تصنيف ستانفورد العربي ل الكلمات
- __init__(path_to_jar=None, path_to_slf4j=None, java_class=None, path_to_model=None, path_to_dict=None, path_to_sihan_corpora_dict=None, sihan_post_processing='false', keep_whitespaces='false', encoding='UTF-8', options=None, verbose=False, java_options='-mx2g')[source]¶
- default_config(lang)[source]¶
Attempt to initialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables
- span_tokenize(s: str) Iterator[Tuple[int, int]] ¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –
- span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]] ¶
Apply
self.span_tokenize()
to each element ofstrings
. I.e.:return [self.span_tokenize(s) for s in strings]
- Yield
List[Tuple[int, int]]
- Parameters
strings (List[str]) –
- Return type
Iterator[List[Tuple[int, int]]]