jptranstokenizer.mainword.spacy_luw

class jptranstokenizer.mainword.spacy_luw.SpacyluwTokenizer(do_lower_case: bool = False, normalize_text: bool = True)[source]

Bases: MainTokenizerABC

Tokenizer to split into words using ja_gsdluw in spaCy. spaCy and ja_gsdluw is required to use. For installation, spaCy and ja_gsdluw You can import this module shortly:

>> from jptranstokenizer.mainword import SpacyluwTokenizer
Parameters:
  • do_lower_case (bool, optional, defaults to False) – Whether or not to lowercase the input when tokenizing.Defaults to None.

  • normalize_text (bool, optional, defaults to True) – Whether to apply unicode normalization to text before tokenization.

tokenize(text: str, **kwargs: Dict[str, Any]) List[str][source]

Converts a string in a sequence of words. Other kwargs (such as never_split) are ignored.

Parameters:

text (str) – A sequence to be encoded.

Returns:

A list of words.

Return type:

List[str]