jptranstokenizer.mainword.base
- class jptranstokenizer.mainword.base.MainTokenizerABC(do_lower_case: bool = False, normalize_text: bool = True)[source]
Bases:
ABCAbstract tokenizer class for main word division.
- Parameters:
do_lower_case (
bool, optional, defaults toFalse) – Whether or not to lowercase the input when tokenizing.normalize_text (
bool, optional, defaults toTrue) – Whether to apply unicode normalization to text before tokenization.
- class jptranstokenizer.mainword.base.Normalizer(do_lower_case: bool = False, normalize_text: bool = True)[source]
Bases:
MainTokenizerABCA main word tokenizer, which only normalize and make lower case.
- Parameters:
do_lower_case (
bool, optional, defaults toFalse) – Whether or not to lowercase the input when tokenizing.
- tokenize(text: str, **kwargs: Dict[str, Any]) List[str][source]
Only normalize and make lower case tokenizer. Maybe called for dummy main tokenizer. Other kwargs (such as never_split) are ignored.
- Parameters:
text (
str) – A sequence to be encoded.- Returns:
A list of a sentence.
- Return type:
List[str]