A French-English sample

From here you can download a 'small' (23 Mb) data set (known as News Commentary corpus, French-English), that we use to detail the steps needed to train, optimize and test a Ncode French-to-English SMT system.

Unpack the tarball newsco.fr-en.tgz somewhere in your file system (in my case somewhere=/home/jmcrego) using:

$> tar xvzf newsco.fr2en.tgz

The next file structure is created:

newsco.fr-en
|-- BINCODER_fr2en.bash
|-- data
|   |-- newsco.en
|   |-- newsco.en.pos
|   |-- newsco.en2fr.ali
|   |-- newsco.fr
|   |-- newsco.fr.pos
|   |-- newsco.fr2en.ali
|   |-- test2009.en
|   |-- test2009.en.pos
|   |-- test2009.fr
|   |-- test2009.fr.pos
|   |-- test2010.en
|   |-- test2010.en.pos
|   |-- test2010.fr
|   `-- test2010.fr.pos

The data subdirectory contains training, tune and test bitexts as well as the corresponding training alignment files (obtained by GIZA++). Part-Of-Speech tags for each of the files were computed using the TreeTagger toolkit.

The code shown in this sample is available in the BINCODER_fr2en.bash script.

Training

Using the downloaded data, we can build the entire set of models used by Ncode. Hereinafter we assume available the software toolkits introduced in the Download and Install section as well as the corresponding environment variables.

The system is built using the training.perl script.

$> here=`pwd`
$> training.perl --first-step 0 --last-step 6 \
    --train-src $here/data/newsco.fr \
    --train-trg $here/data/newsco.en \
    --train-src-rules $here/data/newsco.fr.pos \
    --train-src-bm $here/data/newsco.fr \
    --train-trg-bm $here/data/newsco.en \
    --name-src-bm w \
    --name-trg-bm w \
    --alignment-file $here/data/newsco.fr2en.ali \
    --output-dir $here/fr2en

$> training.perl --first-step 4 --last-step 4 \
    --train-src-bm $here/data/newsco.fr.pos \
    --train-trg-bm $here/data/newsco.en.pos \
    --name-src-bm p \
    --name-trg-bm p \
    --output-dir $here/fr2en

Different models are built from the training bitexts and word alignments. Models are created in $here/fr2en.

In order to help the SMT decoder find the right translation hypothesis, you probably want to use a target n-gram language model. For this sample, use the target side of the training bitext to build a 'weak' target LM:

$> $SRILM/ngram-count -text $here/data/newsco.en \
    -lm $here/fr2en/newsco.en.3guki.lm \
    -order 3 -unk -kndiscount -interpolate

$> $KENLM/build_binary -trie $here/fr2en/newsco.en.3guki.lm  $here/fr2en/newsco.en.3guki.lm.mmap

Optimization

Before the optimization process, information about the development sentences (used to optimize our system) is collected to be passed to the SMT decoder. The process uses the reordering rules seen in the training process to hypothesize the reorderings needed to translate the new sentences. Furthermore it filters the tables built in the training process, keeping only the information useful to translate test sentences.

Hence, each file to be translated is preprocessed using binrules and binfiltr:

$> binrules -maxc 4 -maxr 9 \
    -wrd $here/data/test2009.fr \
    -tag $here/data/test2009.fr.pos \
    -rrules $here/fr2en/posrules.max10.smooth0.. \
    > $here/data/test2009.fr.rrules.maxc4.maxr9

$> cat $here/data/test2009.fr.rrules.maxc4.maxr9 | binfiltr -maxs 5 \
    -tunits $here/fr2en/unfold.maxs5.maxf4.tnb30.voc \
    -scores $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.lex1.lex2.rfreq1.rfreq2 \
    -lexrm $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.msdcfb \
    -bilfactor $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.w-w.bil.factor \
    -bilfactor $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.p-p.bil.factor \
    -trgfactor $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.w-w.trg.factor \
    > $here/data/test2009.fr.rrules.maxc4.maxr9.filtered.maxs5.bww.bpp.tww

Since Ncode implements a linear combination of multiple feature functions, it needs an optimization process in order to assign each feature its corresponding (optimal) weight. Such optimization is carried out by the next script, wich serves as wrapper for the MERT toolkit employed in the widely known Moses SMT system.

In our case, we use the mert-run.perl script, that works as a wrapper of the Moses MERT toolkit:

$> tlmww=$here/fr2en/newsco.en.3guki.lm.mmap
$> blmww=$here/fr2en/unfold.maxs5.maxf4.tnb30.voc.w-w.bil.-order_3_-unk_-gt3min_1_-kndiscount_-interpolate.lm.mmap
$> blmpp=$here/fr2en/unfold.maxs5.maxf4.tnb30.voc.p-p.bil.-order_3_-unk_-gt3min_1_-kndiscount_-interpolate.lm.mmap
$> optdir=$here/fr2en/mert.blmww.tlmww.msd
$> flags="-blm0 0,$blmww -tlm0 0,6,$tlmww -b 25 -threads 4"
$> lambdas="lb0:1 lt0:1 lp0:1 lp1:1 ld:1 la0:1 la1:1 la2:1 la3:1 lmc:1 lsc:1 ldc:1 lmp:1 lsp:1 ldp:1"

$> mert-run.perl --working-dir=$optdir \
    --input=$here/data/test2009.fr.rrules.maxc4.maxr9.filtered.maxs5.bww.bpp.tww \
    --refs=$data/test2009.en \
    --decoder-flags="$flags" \
    --lambdas="$lambdas" \
    --nbest=300 &> $optdir.log &

The $flags var indicates the default parameters used in the optimization, while $lambdas indicate which weights are optimized, and which are their initial values. $optdir indicates the directory where the optimization is run. The file $optdir.log contains a log for the optimization that can be later parsed when translating test files using the optimization results.

Test

Equivalently to the development set, test sentences are also preprocessed using binrules and binfiltr:

$> binrules -maxc 4 -maxr 9 \
    -wrd $here/data/test2010.fr \
    -tag $here/data/test2010.fr.pos \
    -rrules $here/fr2en/posrules.max10.smooth0.. \
    > $here/data/test2010.fr.rrules.maxc4.maxr9

$> cat $here/data/test2010.fr.rrules.maxc4.maxr9 | binfiltr -maxs 5 \
    -tunits $here/fr2en/unfold.maxs5.maxf4.tnb30.voc \
    -scores $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.lex1.lex2.rfreq1.rfreq2 \
    -lexrm $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.msdcfb \
    -bilfactor $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.w-w.bil.factor \
    -bilfactor $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.p-p.bil.factor \
    -trgfactor $here/fr2en/unfold.maxs5.maxf4.tnb30.voc.w-w.trg.factor \
    > $here/data/test2010.fr.rrules.maxc4.maxr9.filtered.maxs5.bww.bpp.tww

Once the optimization is finished, you can translate a test file using the set of weights found in the optimization. We can use mert-tst.perl:

$> mert-tst.perl -optimal $optdir.log \
    -i $here/data/test2010.fr.rrules.maxc4.maxr9.filtered.maxs5.bww.bpp.tww \
    -o $optdir/out.test2010 \
    -run

Translation hypotheses are output in $optdir/out.test2010.

We can finally evaluate (BLEU score) the quality of the translated file using evaluate.perl:

$> evaluate.perl -tst $optdir/out.test2010 -ref $data/test2010.en -b