freebsd-ports/textproc/py-sentencepiece/pkg-descr

8 lines
477 B
Plaintext

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for
Neural Network-based text generation systems where the vocabulary size is
predetermined prior to the neural model training. SentencePiece implements
subword units (e.g., byte-pair-encoding (BPE)) and unigram language model
with the extension of direct training from raw sentences. SentencePiece
allows us to make a purely end-to-end system that does not depend on
language-specific pre/postprocessing.