jptranstokenizer.subword.sentencepiece

class jptranstokenizer.subword.sentencepiece.SentencepieceTokenizer(vocab_file: str | None = None, sp_model_kwargs: Dict[str, Any] | None = None, sp_model: Any | None = None)[source]

Bases: object

Runs sentencepiece tokenization. You can import this module shortly:

>> from jptranstokenizer.subword import SentencepieceTokenizer
Parameters:
  • vocab_file (str) – The sentencepiece model file path.

  • sp_model_kwargs (Dict[str, Any], optional) – Arguments of dict to pass sentencepiece.SentencePieceProcessor.

  • sp_model (sentencepiece.SentencePieceProcessor, optional) – Already trained SentencePieceProcessor model.

tokenize(text: str) List[str][source]

Converts a string in a sequence of tokens.

Parameters:

text (str) – A single token to be encoded.

Returns:

A list of sentencepiece tokens.

Return type:

List[str]