Quickstart
JapaneseTransformerTokenizer
JapaneseTransformerTokenizer
, some main/sub word tokenizers are available.Available Tokenizers
Following types of tokenizers are available:
MeCab (mainword, using
transformers.models.bert_japanese.MecabTokenizer
)fugashi is required (like
transformers.BertJapaneseTokenizer
)ipadic, unidic-lite, or unidic is also required for dictionary
JumanTokenizer()
(mainword)Juman++ and pyknp are required
SpacyluwTokenizer()
(mainword)LUW: Long-Unit-Word
spaCy and LUW model are required
SudachiTokenizer()
(mainword)sudachitra is required
Normalizer()
(mainword, only normalize withunicodedata
)SentencepieceTokenizer()
(subword)sentencepiece is required
WordPiece(subword, using
transformers.models.bert.tokenization_bert.WordpieceTokenizer
)
See also
fugashi: https://github.com/polm/fugashi
ipadic: https://pypi.org/project/ipadic/
unidic-lite: https://pypi.org/project/unidic-lite/
unidic: https://pypi.org/project/unidic/
Juman++: https://github.com/ku-nlp/jumanpp
LUW model: https://github.com/megagonlabs/UD_Japanese-GSD/releases/tag/r2.9-NE
sudachitra: https://github.com/WorksApplications/SudachiTra
sentencepiece: https://github.com/google/sentencepiece
Example 1
nlp-waseda/roberta-base-japanese
model.>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
>>> tokens = tokenizer.tokenize("外国人参政権")
# tokens: ['▁外国', '▁人', '▁参政', '▁権']
tokenizer_name_or_path
.Example 2
>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained(
"organization-name/model-name",
word_tokenizer="sudachi",
tokenizer_class="AlbertTokenizer",
sudachi_split_mode="C"
)
Example 3
You can load local files for tokenizers as follows:
>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer_1 = JapaneseTransformerTokenizer(
vocab_file="spm.model",
word_tokenizer="mecab",
subword_tokenizer="sentencepiece",
mecab_dic="unidic_lite"
)
>>> tokenizer_2 = JapaneseTransformerTokenizer(
vocab_file="vocab.txt",
word_tokenizer="juman",
subword_tokenizer="wordpiece"
)