jptranstokenizer.mainword.sudachi

class jptranstokenizer.mainword.sudachi.SudachiTokenizer(split_mode: str | None = 'A', config_path: str | None = None, resource_dir: str | None = None, dict_type: str | None = 'core', do_lower_case: bool = False, normalize_text: bool = True, ignore_max_byte_error: bool = False)[source]

Bases: MainTokenizerABC

Tokenizer to split into words using Sudachi. SudachiTra is required to use. For installation of SudachiTra, see https://pypi.org/project/SudachiTra/ You can import this module shortly:

>> from jptranstokenizer.mainword import SudachiTokenizer
Parameters:
  • split_mode (str, optional, defaults to "A") – The mode of splitting. "A", "B", or "C" can be specified. For detail, see: Sudachi#The modes of splitting or Sudachi#分割モード

  • config_path (str, optional) – Path to a config file of SudachiPy to be used for the sudachi dictionary initialization.

  • resource_dir (str, optional) – Path to a resource dir containing resource files, such as "sudachi.json".

  • dict_type (str, optional, defaults to "core") – Sudachi dictionary type to be used for tokenization. "small", "core", or "full" can be specified. For detail, see: Sudachi#Dictionaries or Sudachi#辞書の取得

  • do_lower_case (bool, optional, defaults to False) – Whether or not to lowercase the input when tokenizing.Defaults to None.

  • normalize_text (bool, optional, defaults to True) – Whether to apply unicode normalization to text before tokenization.

  • ignore_max_byte_error (bool, optional, defaults to False) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.

tokenize(text: str, **kwargs: Dict[str, Any]) List[str][source]

Converts a string in a sequence of words. Other kwargs (such as never_split) are ignored.

Parameters:

text (str) – A sequence to be encoded.

Returns:

A list of words.

Return type:

List[str]