jptranstokenizer.tokenization_utils

class jptranstokenizer.tokenization_utils.JapaneseTransformerTokenizer(vocab_file: str | PathLike | None = None, word_tokenizer_type: str = 'basic', subword_tokenizer_type: str = 'wordpiece', normalize_text: bool = True, ignore_max_byte_error: bool = False, do_lower_case: bool = False, do_word_tokenize: bool = True, do_subword_tokenize: bool = True, do_subword_by_word: bool = True, unk_token: str | AddedToken | None = '[UNK]', sep_token: str | AddedToken | None = '[SEP]', pad_token: str | AddedToken | None = '[PAD]', cls_token: str | AddedToken | None = '[CLS]', mask_token: str | AddedToken | None = '[MASK]', call_from_pretrained: bool = False, mecab_dic: str | None = 'ipadic', mecab_option: str | None = None, sudachi_split_mode: str | None = 'A', sudachi_config_path: str | None = None, sudachi_resource_dir: str | None = None, sudachi_dict_type: str | None = 'core', sp_model_kwargs: Dict[str, Any] | None = None, **kwargs)[source]

Bases: BertJapaneseTokenizer

Japanese tokenizer of main and sub word. Inherited from transformers.BertJapaneseTokenizer. You can import this module shortly:

>> from jptranstokenizer import JapaneseTransformerTokenizer
Parameters:
  • vocab_file (str or os.PathLike, optional, defaults to "") – _description_.

  • word_tokenizer_type (str, defaults to basic) – Type of word tokenizer. “mecab”, “juman”, “spacy-luw”, “sudachi”, “basic”, “none” (only normalize texts) can be specified.

  • subword_tokenizer_type (str, defaults to “wordpiece”) – Type of word tokenizer. “wordpiece”, “sentencepiece”, “character” (split by one token) can be specified.

  • normalize_text (bool, optional, defaults to True) – Whether to apply unicode normalization to text before tokenization.

  • do_lower_case (bool, optional, defaults to False) – Whether or not to lowercase the input when tokenizing.

  • ignore_max_byte_error (bool, optional, defaults to False) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.

  • do_word_tokenize (bool, optional, defaults to True) – Whether to do (main) word tokenization.

  • do_subword_tokenize (bool, optional, defaults to True) – Whether to do subword tokenization.

  • do_subword_by_word (bool, optional, defaults to True) – Whether to apply subword tokenization by word or not. In case False, subword tokenization is performed to the whole input with spaceat once.

  • unk_token (str or tokenizers.AddedToken, optional) – A special token representing an out-of-vocabulary token.

  • sep_token (str or tokenizers.AddedToken, optional) – A special token separating two different sentences in the same input (used by BERT for instance).

  • pad_token (str or tokenizers.AddedToken, optional) – A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation.

  • cls_token (str or tokenizers.AddedToken, optional) – A special token representing the class of the input (used by BERT for instance).

  • mask_token (str or tokenizers.AddedToken, optional) – A special token representing a masked token (used by masked-language modeling pretraining objectives, like BERT).

  • call_from_pretrained (bool, optional, defaults to False) – Whether __init__ is called from from_pretrained. You don’t need to set manually.

  • mecab_dic (str, optional, defaults to "ipadic") – (For MeCab) Name of dictionary to be used for MeCab initialization. Maybe "ipadic", "unidic", "unidic_lite" is used. If you are using a system-installed dictionary, set this option to None and modify mecab_option.

  • mecab_option (str, optional) – (For MeCab) String passed to MeCab constructor.

  • sudachi_split_mode (str, optional, defaults to "A") – (For Sudachi) The mode of splitting. "A", "B", or "C" can be specified.

  • sudachi_config_path (str, optional) – (For Sudachi) Path to a config file of SudachiPy to be used for the sudachi dictionary initialization.

  • sudachi_resource_dir (str, optional) – (For Sudachi) Path to a resource dir containing resource files, such as "sudachi.json".

  • sudachi_dict_type (str, optional, defaults to "core") – (For Sudachi) Sudachi dictionary type to be used for tokenization. "small", "core", or "full" can be specified.

  • sp_model_kwargs (str, optional) – (For sentencepiece) Optional arguments for sentencepiece.SentencePieceProcessor.

convert_tokens_to_string(tokens: List[str])[source]

Converts a sequence of tokens (string) in a single string.

classmethod from_pretrained(tokenizer_name_or_path: str | PathLike, **kwargs)[source]

Instantiate a transformers.BertJapaneseTokenizer (or a derived class) from a predefined tokenizer.

Parameters:
  • tokenizer_name_or_path (str or os.PathLike) –

    Can be either:

    • A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be namespaced under auser or organization name, like cl-tohoku/bert-base-japanese.

    • A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the transformers.tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained method, e.g., ./my_model_directory/.

    • (Deprecated, not applicable to all derived classes) A path or url to a single saved vocabulary file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g., ./my_model_directory/vocab.txt.

  • word_tokenizer_type (str, defaults to "basic") – Type of word tokenizer. "mecab", "juman", "spacy-luw", "sudachi", "basic", "none" (only normalize texts) can be specified.

  • tokenizer_class (str, optional) – Must be specified when tokenizer_name_or_path is not in the supported list. "AlbertTokenizer", "T5Tokenizer", and "BertJapaneseTokenizer" (whose classes are in transformers library) are available.

  • normalize_text (bool, optional, defaults to True) – Whether to apply unicode normalization to text before tokenization.

  • ignore_max_byte_error (bool, optional, defaults to False) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.

  • do_lower_case (bool, optional, defaults to False) – Whether or not to lowercase the input when tokenizing.

  • do_word_tokenize (bool, optional, defaults to True) – Whether to do (main) word tokenization.

  • do_subword_by_word (bool, optional, defaults to True) – Whether to apply subword tokenization by word or not. In case False, subword tokenization is performed to the whole input with spaceat once.

  • mecab_dic (str, optional, defaults to "ipadic") – (For MeCab) Name of dictionary to be used for MeCab initialization. Maybe "ipadic", "unidic", "unidic_lite" is used. If you are using a system-installed dictionary, set this option to None and modify mecab_option.

  • mecab_option (str, optional) – (For MeCab) String passed to MeCab constructor.

  • sudachi_split_mode (str, optional, defaults to "A") – (For Sudachi) The mode of splitting. "A", "B", or "C" can be specified.

  • sudachi_config_path (str, optional) – (For Sudachi) Path to a config file of SudachiPy to be used for the sudachi dictionary initialization.

  • sudachi_resource_dir (str, optional) – (For Sudachi) Path to a resource dir containing resource files, such as "sudachi.json".

  • sudachi_dict_type (str, optional, defaults to "core") – (For Sudachi) Sudachi dictionary type to be used for tokenization. "small", "core", or "full" can be specified.

  • sp_model_kwargs (Dict[str, Any], optional) – (For sentencepiece) Optional arguments for sentencepiece.SentencePieceProcessor.

jptranstokenizer.tokenization_utils.get_word_tokenizer(word_tokenizer_type: str, normalize_text: bool = True, ignore_max_byte_error: bool = False, do_lower_case: bool = False, mecab_dic: str | None = 'ipadic', mecab_option: str | None = None, sudachi_split_mode: str | None = 'A', sudachi_config_path: str | None = None, sudachi_resource_dir: str | None = None, sudachi_dict_type: str | None = 'core')[source]

Load mainword tokenizer dynamically. You can import this module shortly:

>> from jptranstokenizer import get_word_tokenizer
Parameters:
  • word_tokenizer_type (str, defaults to "basic") – Type of word tokenizer. "mecab", "juman", "spacy-luw", "sudachi", "basic", "none" (only normalize texts) can be specified.

  • normalize_text (bool, optional, defaults to True) – Whether to apply unicode normalization to text before tokenization.

  • do_lower_case (bool, optional, defaults to False) – Whether or not to lowercase the input when tokenizing.

  • ignore_max_byte_error (bool, optional, defaults to False) – Whether or not to ignore error of max bytes (only valid with Juman and Sudachi). If valid, the tokenizer return empty list.

  • mecab_dic (str, optional, defaults to "ipadic") – (For MeCab) Name of dictionary to be used for MeCab initialization. Maybe "ipadic", "unidic", or "unidic_lite" is used. If you are using a system-installed dictionary, set this option to None and modify mecab_option.

  • mecab_option (str, optional) – (For MeCab) String passed to MeCab constructor.

  • sudachi_split_mode (str, optional, defaults to "A") – (For Sudachi) The mode of splitting. "A", "B", or "C" can be specified.

  • sudachi_config_path (str, optional) – (For Sudachi) Path to a config file of SudachiPy to be used for the sudachi dictionary initialization.

  • sudachi_resource_dir (str, optional) – (For Sudachi) Path to a resource dir containing resource files, such as "sudachi.json".

  • sudachi_dict_type (str, optional, defaults to "core") – (For Sudachi) Sudachi dictionary type to be used for tokenization. "small", "core", or "full" can be specified.