Training

The training process of our SMT system is performed by the script training.perl. The entire system can be run deployed in several steps, starting from a training bitext and the corresponding word-to-word alignments. Next, we show an example of training using the French-English example:

$> training.perl \
     --train-src data/newsco.fr \
     --train-trg data/newsco.en \
     --alignment-file data/newsco.fr2en.ali \
     --train-src-rules data/newsco.fr.pos \
     --train-src-bm data/newsco.fr \
     --train-trg-bm data/newsco.en \
     --name-src-bm w \
     --name-trg-bm w \
     --output-dir fr2en \
     --first-step 1 \
     --last-step 8

The first three parameters indicate the input files needed to build our SMT system (source and target training bitext and the corresponding alignment).

In the first step, the word distribution is built from the word alignments newsco.fr2de/lex.f2n, newsco.fr2de/lex.n2f. Translation units are then extracted (step 1) newsco.fr2de/unfoldNULL and filtered (step 2) to avoid source-NULLed units newsco.fr2de/unfold.

The final vocabulary of units is built after applying several pruning techniques unfold.maxs5.maxf4.tnb30.voc (step 3). Uncontextualized scores are also computed for each tuple unfold.maxs5.maxf4.tnb30.voc.lex1.lex2.rfreq1.rfreq2.

The parameters --train-src-bm and --train-trg-bm consist of the training bitext factors used to estimate a bilingual n-gram language model. Multiple bilingual n-grams can be estimated using different word factors (step 4). In the example, the bilingual n-gram is built from raw words (as it employs the word files: newsco.fr and newsco.en). Parameters --name-src-bm and --name-trg-bm are used to identify the model. The final model is placed in: fr2en/unfold.maxs5.maxf4.tnb30.voc.w-w.bil.-order_3_-unk_-gt3min_1_-kndiscount_-interpolate.lm.mmap.

Note that the additional file newsco.fr2de/unfold.maxs5.maxf4.tnb30.voc.w-w.bil.factor is also built. It is used by the decoder to identify the factor (token used in the language model) associated to each translation unit.

For instance, if we wanted a language model bilt from word lemmas (in both sides) we would use:

    training.perl --first-step 4 --last-step 4 \
      --train-src-bm data/newsco.fr.lem \
      --train-trg-bm data/newsco.en.lem \
      --name-src-bm l \
      --name-trg-bm l \
      --output-dir fr2en

Or a language model built from tuples with lemmas (in the source side) and words (in the target side):

    training.perl --first-step 4 --last-step 4 \
      --train-src-bm data/newsco.fr.lem \
      --train-trg-bm data/newsco.en \
      --name-src-bm l \
      --name-trg-bm w \
      --output-dir fr2en

The corresponding newsco.fr2de/unfold.maxs5.maxf4.tnb30.voc.l-l.bil.factor and newsco.fr2de/unfold.maxs5.maxf4.tnb30.voc.l-w.bil.factor are also built.

Rewrite rules are then extracted (in the example from pos tags --train-src-rules data/newsco.fr.pos) (step 5) in newsco.fr2de/posrules.max10.smooth0.... Lexicalized reordering is then estimated newsco.fr2de/unfold.maxs5.maxf4.tnb30.voc.msdcfb (step 6).

A Source-side unfolded (reordered) language model is also estimated (in the example from source pos tags --train-src-unf data/newsco.fr.pos) in the final step (step 7). Note that like bilingual n-gram models, a file describing the source factors used for each tuple is also built: newsco.fr2de/unfold.maxs5.maxf4.tnb30.voc.p-p.src.factor