lingvo.tools.bpe_word_tokenizer module
Generates the words_to_ids file from a BPE encoded corpus and BPE vocab file.
Extracts all the words in the corpus with their corresponding list of ids. Each subword in the vocab file is mapped to their line number as its id. The lines of the output file are like: … TAKE 43,7,50,14 THAT 16,35 THE 26 THEIR 16,4,9,56 … Which is compatible with the BPE tokenizer op in core/tokenizer.py.
Typical workflow:
subword-nmt learn-bpe train_file code_file subword-nmt apply-bpe code_file train_file train_bpe_file subword-nmt get-vocab train_bpe_file vocab_file
bpe_word_tokenizer train_bpe_file vocab_file words_to_ids_file
- lingvo.tools.bpe_word_tokenizer._GetVocabulary(vocab_filepath)[source]
Maps the first word in each line of the given file to its line number.
- lingvo.tools.bpe_word_tokenizer._ExtractTokenization(encoded_filepath, vocab)[source]
Maps the words in the encoded file to their list of token ids.
Reads all the subwords in encoded file. Concatenates them while they have @@ as their last two characters. The last token of a word is the subword without @@. Maps the full word to the list of corresponding vocab ids of the subwords from the vocab dictionary.
- Parameters
encoded_filepath – String, filepath of the BPE encoded file.
vocab – Dictionary of subwords (string) to token ids (int).
- Returns
Dictionary of words (string) to list of token ids (list of int).