jptranstokenizer.mainword.sudachi
- class jptranstokenizer.mainword.sudachi.SudachiTokenizer(split_mode: str | None = 'A', config_path: str | None = None, resource_dir: str | None = None, dict_type: str | None = 'core', do_lower_case: bool = False, normalize_text: bool = True, ignore_max_byte_error: bool = False)[source]
Bases:
MainTokenizerABCTokenizer to split into words using Sudachi. SudachiTra is required to use. For installation of SudachiTra, see https://pypi.org/project/SudachiTra/ You can import this module shortly:
>> from jptranstokenizer.mainword import SudachiTokenizer
- Parameters:
split_mode (
str, optional, defaults to"A") – The mode of splitting."A","B", or"C"can be specified. For detail, see: Sudachi#The modes of splitting or Sudachi#分割モードconfig_path (
str, optional) – Path to a config file of SudachiPy to be used for the sudachi dictionary initialization.resource_dir (
str, optional) – Path to a resource dir containing resource files, such as"sudachi.json".dict_type (
str, optional, defaults to"core") – Sudachi dictionary type to be used for tokenization."small","core", or"full"can be specified. For detail, see: Sudachi#Dictionaries or Sudachi#辞書の取得do_lower_case (
bool, optional, defaults toFalse) – Whether or not to lowercase the input when tokenizing.Defaults to None.normalize_text (
bool, optional, defaults toTrue) – Whether to apply unicode normalization to text before tokenization.ignore_max_byte_error (
bool, optional, defaults toFalse) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.
See also