jptranstokenizer.mainword.juman
- class jptranstokenizer.mainword.juman.JumanTokenizer(do_lower_case: bool = False, normalize_text: bool = True, ignore_max_byte_error: bool = False)[source]
Bases:
MainTokenizerABCTokenizer to split into words using Juman. Juman++ and pyknp are required to use. You can import this module shortly:
>> from jptranstokenizer.mainword import JumanTokenizer
- Parameters:
do_lower_case (
bool, optional, defaults toFalse) – Whether or not to lowercase the input when tokenizing.Defaults to None.normalize_text (
bool, optional, defaults toTrue) – Whether to apply unicode normalization to text before tokenization.ignore_max_byte_error (
bool, optional, defaults toFalse) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.
See also