Go to file
2019-11-05 11:30:21 +07:00
data add some files 2019-11-05 11:30:21 +07:00
lm add some files 2019-11-05 11:30:21 +07:00
.gitignore init 2019-11-05 11:24:25 +07:00
alphabet.py init 2019-11-05 11:24:25 +07:00
app.py init 2019-11-05 11:24:25 +07:00
cyclic_lr.py init 2019-11-05 11:24:25 +07:00
dataset.py init 2019-11-05 11:24:25 +07:00
LICENSE init 2019-11-05 11:24:25 +07:00
model.py init 2019-11-05 11:24:25 +07:00
n_gram.py init 2019-11-05 11:24:25 +07:00
predict_app.py init 2019-11-05 11:24:25 +07:00
predict.py init 2019-11-05 11:24:25 +07:00
preprocess.py init 2019-11-05 11:24:25 +07:00
README.md init 2019-11-05 11:24:25 +07:00
requirements.txt init 2019-11-05 11:24:25 +07:00
train.py init 2019-11-05 11:24:25 +07:00
visdom_tutorial.txt init 2019-11-05 11:24:25 +07:00
visualization.py init 2019-11-05 11:24:25 +07:00

aivivn-tone

Submission for AIviVN Vietnamese diacritics restoration contest https://www.aivivn.com/contests/3.

A more detailed summary of the approach can be found here (Vietnamese).

Requirements

Python > 3.6, PyTorch 1.0.1, torchtext, unidecode, dill, visdom, tqdm, kenlm.

visdom is mainly used for visualizing training loss and accuracy.

kenlm can be found here. I had some troubles with the version on the master branch, so the stable release may be better.

Overview

Character-level BiLSTM seq2seq model

The embedding layer and encoder are standard. The model consists of 3 decoders (each decoder has its own softmax prediction layer):

  • a left-to-right decoder
  • a right-to-left decoder
  • a combined decoder constructed by concatenating output LSTM states of two previous decoders

The final loss is a sum of 3 component losses: L = L_ltr + L_rtl + L_combined

Since only a certain set of characters requires diacritics restoration (a, d, e, i, o, u, y), we can apply teacher forcing at both training time and test time.

In addition, since each character only has a fixed set of targets (e.g., for i it's i, í, ì, ỉ, ĩ, ị), masked softmax can also be applied.

We run a standard beam search in 2 directions, left-to-right and right-to-left, and combine results. For any disagreements that may appear between the two searches, repeat the procedure until there are no disagreements left. We fall back on exhaustive search after a number of recursive calls in case of infinite recursion.

A 4-gram word-level language model is used to score candidates during beam search.

The beam search component is separated from the seq2seq model (not jointly trained during training time), so it can be used with any other models.

Replicating submission results

I filtered out sentences longer than 300 characters, and divided the training data into smaller splits so they could fit in my computer's limited RAM. The data I used can be found here.

I trained the seq2seq model until the accuracy on validation set stopped increasing. The final model can be found here.

The n-gram language model can be found here.

The main function in train.py and predict.py has examples of how to train the model from scratch and run predictions on test data. I set beam size to a very large number so it may take very long to run predictions.

Credits

Some of the code is taken from IBM's PyTorch seq2seq implementation.

The code to produce cleaned test data is written by Khoi Nguyen.

The data used to train n-gram language model are taken from this repo by @binhvq.

The Vietnamese dictionary used during beam search is taken from this repo by @undertheseanlp.

Finally, I'd like to thank the AIviVN admins for organizing the contest, providing the data, and preparing a script to convert predicted text file to submission file.