Decoding

The decoding process of our SMT system is deployed in 3 steps:

reordering: where the word graph containing reordering hypotheses is built (binrules).
filtering: where for each sentence, tuples (and their corresponding uncontextual models) are filtered. Hence easing the the work to the search algorithm (binfiltr).
search: where translation hypotheses are produced (bincode).

The main reason to deploy the decoding process in three (sequential) steps is to save time when performing optimization work. Since the initials reordering and filtering steps do not need to be repeated in the tunning loop.

Word reorderings are computed typically based on POS tags (any other word factor can also be used). Input files of the algorithm require (-wrd) the test set (built from raw words); the same test set file but expressed using the factored form used in the collected reordering rules (-tag); and finally the file containing the reordering rules collected from the training corpus (-rrules).

Rules can be filtered according to their size (-maxr) and likelhood, or cost (-maxc).

   ============================================
     binrules: BIlingual N-gram rewrite RULES  
     by jmcrego@limsi.fr, January 2011          
   ============================================
usage: binrules -wrd s -tag s -rrules s [-maxr i] [-maxc f] [-verbose] [-help]
  Input files:
       -wrd       s   : file to translate (word forms)
       -tag       s   : file to translate (POS tags)
       -rrules    s   : reordering rules
  Rules/Units filtering:
       -maxc      f   : max cost reordering rules       [default 0]
       -maxr      i   : max size reordering rules       [default 9]
  Other:
       -verbose       : verbose output                  [default false]
       -silent        : silent output                   [default false]
       -help          : this help

The filtering step prepares the data that is going to be used by the decoder on a sentence-by-sentence basis. Mainly, it collects the set of tuples that can be used for each test sentence. It basically traverses the word lattice containing reordering hypotheses and outputs the tuples that can be used to translate each source segment (sequence of source words in the word lattice).

As it can be seen, appart from filtering the set of tuples (-tunits) it also outputs the corresponding uncontextualized scores (-scores), lexicalized reordering scores (-lexrm) bilingual, source and target factors (-bilfactor -srcfactor -trgfactor). In this step we can also filter units according to their size (-maxs). Next you can see an example tuple entry in the filtered file:

de@8 ||| from ||| 0.4993 1.6338 3.8078 0.9002 ||| 0.2177 4.6121 1.6834 0.2054 1.8465 3.5772 ||| 412513 267 ||| from PRP

The first two fields consist of the source and target tuple words. Note that source words are also tagged with their position in the test sentence (mainly used by the decoder to apply the distortion penalty and to calculate the orientation type of the lexicalized reordering model). The next for scores correspond to the uncontextualized tuple translation models. The next six scores correspond to lexicalized reordering scores. next we find the tuple Ids used for the bilingual n-gram language model (with tuples built from words) and the second bilingual n-gram language model (with tuples built from pos tags). The last two fields correspond to two different factor forms of the tuple target side (using words 'from' and POS tags 'PRP').

   ================================================
     binfiltr: BIlingual N-gram smt models FILTeR  
     by jmcrego@limsi.fr, January 2011             
   ================================================
usage: binfiltr -tunits s [-maxs i] [-verbose] [-help] < flattice
       flattice   s   : file with input lattices
       -tunits    s   : translation units
       -scores    s   : unit scores
       -lexrm     s   : lexicalized RM scores
       -bilfactor s   : bilingual factored units
       -srcfactor s   : src factored units
       -trgfactor s   : trg factored units
       -maxs      i   : max size translation units [default 5]
       -verbose       : verbose output             [default false]
       -help          : this help

Finally, the search process is in charge of producing translation hypotheses. It receives as input the previous filtered file (-i) and performs the search process. You can indicate the first and last (-f i, -l i) test sentences to be translated (within the input file), which would allow to further deploy the decoding effort of a test set in several machines.

The standard search strategy is 2^J which considers as many stacks as nodes exist in the input word graph (partial hypotheses are placed in the same stack if they translate exactly the same input words). However, you can also use the J strategy, that places in the same stack partial hypotheses that translate the same number of words. However, remember that our decoder does not implements the future cost estimation strategy typically needed to avoid the search bias derived of such strategy.

You can set different search parameters that may help to speed up the search: the maximum number of hypotheses to be expanded on each stack (-b). The maximum number of translation choices to be considered for each tuple source side (-t).

Global language model files are set using (-{b,t,s}lm(i) j[,k],s). b standing for bilingual, t for target and s for source. i is used to identify the i-th language model to which the file is referred. j is used to identify to which factor the language model is referreded. s consists of the path used to find the file, and k is the additional cost added to the LM score when requesting the unigram. The equivalent parameter structure is used for the sentence-based n-gram language models.
Using the filtered file (example) introduced in the previous lines. We can use the next parameter settings:

-blm0 0,6,bilngram_built_with_word_factors.mmap
-blm1 1,0,bilngram_built_with_pos_factors.mmap
-tlm0 0,0,targetngram_built_with_word_factors.mmap
-tlm1 1,0,targetgram_built_with_pos_factors.mmap

Where two bilingual (0:bilngram_built_with_word_factors.mmap and 1:bilngram_built_with_pos_factors) and two target n-gram language models are considered (0:targetngram_built_with_word_factors.mmap and 1:targetngram_built_with_pos_factors). The first bilingual n-gram model is estimated over the first tuple factor ('412513') while the second is estimated over the second tuple factor ('267'). Similarly, in the case of the target LMs, the first is estimated over words ('from') while the second over POS tags ('PRP').
Model weights are set using the -l* parameters. You can also decide the strategy to handle unknown words: either to pass them through as they appear or to drop them.

Appart from the output raw file, you can also output the search graph (-ograph), the Nbest translation hypotheses (-nbest n), and the set of units used in the one-best translation output (-units).

Finally, you can also set the number of threads (-threads i) to use when translating the test set. Roughly speaking, it allows to translate in parallel up to i sentences.

   ==========================================
     bincoder: BIlingual N-gram smt deCODER  
     jmcrego[at]limsi[dot]fr (May 2011)      
   ==========================================
usage : bincoder [I/O] [Search settings] [Model files] [Model weights] [Other]
  I/O:
    -i                   s : input file
    -f                   i : first sentence to translate in input file (0:first of ifile)    [default 0]
    -l                   i : last sentence to translate in input file  (0:last of ifile)     [default 0]
    -o                   s : output file
    -c                   s : config file
  Search settings:
    -s                   s : search strategy                                                 [default 2J]
                             'J' (J stacks) hyps covering the same number of source words
                             '2J' (up to 2^J stacks) hyps covering the same source words
    -t                   i : consider i-best tuple translation choices (0:all choices)       [default 25]
    -b                   i : expand at most i-best states of each stack (0:expand all)       [default 50]
  Model files:
    -{b,t,s}lm(i)  j[,k],s : i-th (b:bilingual, t:target, s:source) LM (factor j)            (i>=0, j>=0)
    -s{b,t,s}lm(i) j[,k],s : i-th sent-based (b:bilingual, t:target, s:source) LM (factor j) (i>=0, j>=0)
                             increase by k the cost of p(|...)                          [default 0]
    -s{b,t,s}xm(i)     j,s : i-th sent-based (b:bilingual, t:target, s:source) XM (factor j) (i>=0, j>=0)
  Model weights:
    -l{b,t,s}(i)         f : i-th (b:bilingual, t:target, s:source) LM                       (i>=0)
    -ls{b,t,s}(i)        f : i-th sentence-based (b:bilingual, t:target, s:source) LM        (i>=0)
    -lx{b,t,s}(i)        f : i-th sentence-based (b:bilingual, t:target, s:source) XM        (i>=0)
    -la(i)               f : i-th tuple (uncontextualized) model                             (i>=0)
    -l{m,s,d,c,f,b}{c,p} f : lexicalized RM
    -lg                  f : input graph
    -ld                  f : distortion penalty
    -lp{0,1,2}           f : bonus models (0:tuple, 1:target word, 2:source word)
  Other:
    -dropoov               : drop OOV (source) words                                         [default 0]
    -ograph                : ouput graph in ofile.GRAPH                                      [default 0]
    -nbest               i : ouput i-best hyps in ofile.NBEST                                [default 300]
    -units                 : write units in ofile.UNITS                                      [default 0]
    -threads             i : use i threads                                                   [default 1]
    -verbose               : verbose output in ofile.VERBOSE                                 [default 0]
    -help                  : this help