jptranstokenizer.tokenization_utils
- class jptranstokenizer.tokenization_utils.JapaneseTransformerTokenizer(vocab_file: str | PathLike | None = None, word_tokenizer_type: str = 'basic', subword_tokenizer_type: str = 'wordpiece', normalize_text: bool = True, ignore_max_byte_error: bool = False, do_lower_case: bool = False, do_word_tokenize: bool = True, do_subword_tokenize: bool = True, do_subword_by_word: bool = True, unk_token: str | AddedToken | None = '[UNK]', sep_token: str | AddedToken | None = '[SEP]', pad_token: str | AddedToken | None = '[PAD]', cls_token: str | AddedToken | None = '[CLS]', mask_token: str | AddedToken | None = '[MASK]', call_from_pretrained: bool = False, mecab_dic: str | None = 'ipadic', mecab_option: str | None = None, sudachi_split_mode: str | None = 'A', sudachi_config_path: str | None = None, sudachi_resource_dir: str | None = None, sudachi_dict_type: str | None = 'core', sp_model_kwargs: Dict[str, Any] | None = None, **kwargs)[source]
Bases:
BertJapaneseTokenizerJapanese tokenizer of main and sub word. Inherited from
transformers.BertJapaneseTokenizer. You can import this module shortly:>> from jptranstokenizer import JapaneseTransformerTokenizer
- Parameters:
vocab_file (
stroros.PathLike, optional, defaults to"") – _description_.word_tokenizer_type (
str, defaults to basic) – Type of word tokenizer. “mecab”, “juman”, “spacy-luw”, “sudachi”, “basic”, “none” (only normalize texts) can be specified.subword_tokenizer_type (
str, defaults to “wordpiece”) – Type of word tokenizer. “wordpiece”, “sentencepiece”, “character” (split by one token) can be specified.normalize_text (
bool, optional, defaults toTrue) – Whether to apply unicode normalization to text before tokenization.do_lower_case (
bool, optional, defaults toFalse) – Whether or not to lowercase the input when tokenizing.ignore_max_byte_error (
bool, optional, defaults toFalse) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.do_word_tokenize (
bool, optional, defaults toTrue) – Whether to do (main) word tokenization.do_subword_tokenize (
bool, optional, defaults toTrue) – Whether to do subword tokenization.do_subword_by_word (
bool, optional, defaults toTrue) – Whether to apply subword tokenization by word or not. In caseFalse, subword tokenization is performed to the whole input with spaceat once.unk_token (
strortokenizers.AddedToken, optional) – A special token representing an out-of-vocabulary token.sep_token (
strortokenizers.AddedToken, optional) – A special token separating two different sentences in the same input (used by BERT for instance).pad_token (
strortokenizers.AddedToken, optional) – A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation.cls_token (
strortokenizers.AddedToken, optional) – A special token representing the class of the input (used by BERT for instance).mask_token (
strortokenizers.AddedToken, optional) – A special token representing a masked token (used by masked-language modeling pretraining objectives, like BERT).call_from_pretrained (
bool, optional, defaults toFalse) – Whether __init__ is called from from_pretrained. You don’t need to set manually.mecab_dic (
str, optional, defaults to"ipadic") – (For MeCab) Name of dictionary to be used for MeCab initialization. Maybe"ipadic","unidic","unidic_lite"is used. If you are using a system-installed dictionary, set this option toNoneand modify mecab_option.mecab_option (
str, optional) – (For MeCab) String passed to MeCab constructor.sudachi_split_mode (
str, optional, defaults to"A") – (For Sudachi) The mode of splitting."A","B", or"C"can be specified.sudachi_config_path (
str, optional) – (For Sudachi) Path to a config file of SudachiPy to be used for the sudachi dictionary initialization.sudachi_resource_dir (
str, optional) – (For Sudachi) Path to a resource dir containing resource files, such as"sudachi.json".sudachi_dict_type (
str, optional, defaults to"core") – (For Sudachi) Sudachi dictionary type to be used for tokenization."small","core", or"full"can be specified.sp_model_kwargs (
str, optional) – (For sentencepiece) Optional arguments forsentencepiece.SentencePieceProcessor.
- convert_tokens_to_string(tokens: List[str])[source]
Converts a sequence of tokens (string) in a single string.
- classmethod from_pretrained(tokenizer_name_or_path: str | PathLike, **kwargs)[source]
Instantiate a
transformers.BertJapaneseTokenizer(or a derived class) from a predefined tokenizer.- Parameters:
tokenizer_name_or_path (
stroros.PathLike) –Can be either:
A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be namespaced under auser or organization name, like
cl-tohoku/bert-base-japanese.A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the
transformers.tokenization_utils_base.PreTrainedTokenizerBase.save_pretrainedmethod, e.g.,./my_model_directory/.(Deprecated, not applicable to all derived classes) A path or url to a single saved vocabulary file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g.,
./my_model_directory/vocab.txt.
word_tokenizer_type (
str, defaults to"basic") – Type of word tokenizer."mecab","juman","spacy-luw","sudachi","basic","none"(only normalize texts) can be specified.tokenizer_class (
str, optional) – Must be specified when tokenizer_name_or_path is not in the supported list."AlbertTokenizer","T5Tokenizer", and"BertJapaneseTokenizer"(whose classes are in transformers library) are available.normalize_text (
bool, optional, defaults toTrue) – Whether to apply unicode normalization to text before tokenization.ignore_max_byte_error (
bool, optional, defaults toFalse) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.do_lower_case (
bool, optional, defaults toFalse) – Whether or not to lowercase the input when tokenizing.do_word_tokenize (
bool, optional, defaults toTrue) – Whether to do (main) word tokenization.do_subword_by_word (
bool, optional, defaults toTrue) – Whether to apply subword tokenization by word or not. In caseFalse, subword tokenization is performed to the whole input with spaceat once.mecab_dic (
str, optional, defaults to"ipadic") – (For MeCab) Name of dictionary to be used for MeCab initialization. Maybe"ipadic","unidic","unidic_lite"is used. If you are using a system-installed dictionary, set this option to None and modify mecab_option.mecab_option (
str, optional) – (For MeCab) String passed to MeCab constructor.sudachi_split_mode (
str, optional, defaults to"A") – (For Sudachi) The mode of splitting."A","B", or"C"can be specified.sudachi_config_path (
str, optional) – (For Sudachi) Path to a config file of SudachiPy to be used for the sudachi dictionary initialization.sudachi_resource_dir (
str, optional) – (For Sudachi) Path to a resource dir containing resource files, such as"sudachi.json".sudachi_dict_type (
str, optional, defaults to"core") – (For Sudachi) Sudachi dictionary type to be used for tokenization."small","core", or"full"can be specified.sp_model_kwargs (
Dict[str, Any], optional) – (For sentencepiece) Optional arguments forsentencepiece.SentencePieceProcessor.
- jptranstokenizer.tokenization_utils.get_word_tokenizer(word_tokenizer_type: str, normalize_text: bool = True, ignore_max_byte_error: bool = False, do_lower_case: bool = False, mecab_dic: str | None = 'ipadic', mecab_option: str | None = None, sudachi_split_mode: str | None = 'A', sudachi_config_path: str | None = None, sudachi_resource_dir: str | None = None, sudachi_dict_type: str | None = 'core')[source]
Load mainword tokenizer dynamically. You can import this module shortly:
>> from jptranstokenizer import get_word_tokenizer
- Parameters:
word_tokenizer_type (
str, defaults to"basic") – Type of word tokenizer."mecab","juman","spacy-luw","sudachi","basic","none"(only normalize texts) can be specified.normalize_text (
bool, optional, defaults toTrue) – Whether to apply unicode normalization to text before tokenization.do_lower_case (
bool, optional, defaults toFalse) – Whether or not to lowercase the input when tokenizing.ignore_max_byte_error (
bool, optional, defaults toFalse) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.mecab_dic (
str, optional, defaults to"ipadic") – (For MeCab) Name of dictionary to be used for MeCab initialization. Maybe"ipadic","unidic", or"unidic_lite"is used. If you are using a system-installed dictionary, set this option toNoneand modify mecab_option.mecab_option (
str, optional) – (For MeCab) String passed to MeCab constructor.sudachi_split_mode (
str, optional, defaults to"A") – (For Sudachi) The mode of splitting."A","B", or"C"can be specified.sudachi_config_path (
str, optional) – (For Sudachi) Path to a config file of SudachiPy to be used for the sudachi dictionary initialization.sudachi_resource_dir (
str, optional) – (For Sudachi) Path to a resource dir containing resource files, such as"sudachi.json".sudachi_dict_type (
str, optional, defaults to"core") – (For Sudachi) Sudachi dictionary type to be used for tokenization."small","core", or"full"can be specified.