Feature Functions

The log-linear model approach underlying BNSMT allows for the combination of several feature functions. Each, scoring on the quality of the translation. The probability of a translation e of an input sentence f is computed as:

p(e|f) = Π_i h_i(e,f)^λ_i
where h_i are the feature functions and λ_i the corresponding weights. Note that the decoder internally uses negative logarithms (positive costs).

The tuning stage of the system is used to set the weights. Next, we detail the components (models) implemented in our system.

Bilingual n-gram Language Models

The 'main' translation model implemented by our SMT system is based on bilingual n-grams (tuples). This model constitutes actually a language model of a particular 'bi-language' composed of tuples. In this way, the translation model probabilities at the sentence level are approximated by using n-grams of tuples, such as described by the following equation:

p(f₁^J,e₁^I) = Σ_k=1^K p((f,e)_k|(f,e)_k-1, ..., (f,e)_k-n+1)
where e refers to target, f to input and (f,e)_k to the k^th tuple of a given bilingual sentence pair, and n is the language model order.

Multiple bilingual n-gram language models can be used in decoding time. Bilingual LMs must be computed over any tuple factored form available to the decoder. Typically, tuples are built from raw words, but many other factored forms can be used (POS tags, lemmas, stems, etc.).

Tuple Bonus Model

The use of any language model probabilities is associated with a length comparison problem. In other words, when two hypotheses compete in the search for the most probable path, the one using less number of elements (being words or translation units) will be favored against the one using more. The accumulated partial score is computed by multiplying a different number of probabilities. This problem results from the fact that the number of tokens used for translating a given sentence is not fixed and equivalent in all paths.

The tuple bonus model is used in order to compensate the system preference for sentences using less number of tuples (tuples with larger source side). It is implemented following the next equation:

p(f₁^J,e₁^I) = exp(K)

Target n-gram Language Models

This feature provides information about the target language structure and fluency, by favoring those partial-translation hypotheses which are more likely to constitute correctly structured target sentences over those which are not. The model implements a standard word n-gram model of the target language, which is computed according to the following expression:

p(f₁^J,e₁^I) = π_i=1^I p(e_i|e_i-1, ..., e_i-n+1)
where e refers to target, f to input and e_i to the i^th target word of a given translation hypothesis, and n is the language model order.

Multiple target n-gram language models can be used in decoding time. Each model must be computed over any tuple target side factored form available to the decoder.

Target Word Bonus Model

Similar to the tuple bonus model, the target word bonus model is used in order to compensate the system preference for shorter translation hypotheses. It is implemented following the next equation:

p(f₁^J,e₁^I) = exp(I)

Source n-gram Language Models

This feature aims at providing information about the reordered input language (reordered so as to allow the tuples target side matching the target language word order). The model implements a standard word n-gram model of the input language, which is computed according to the following expression:

p(f₁^J,e₁^I) = π_j=1^J p(f_j|f_j-1, ..., f_j-n+1)
where f_j refers to the j^th input word of a given translation hypothesis, and n is the language model order.

Multiple source n-gram language models can be used in decoding time. Each model must be computed over any tuple source side factored form available to the decoder.

Source Word Bonus Model

Similar to the previous bonus models, the source word bonus model is used in order to compensate the system preference for shorter input sentences. Hence, it is only applied when translating word lattices containing paths differently sized. It is implemented following the next equation:

p(f₁^J,e₁^I) = exp(J)

Tuple (unigram) models

As standard SMT systems, Ncoder introduces additional models to account for the statistical consistency of the pair of word sequences presents on each translation unit. All models follow the same equation:

p(f₁^J,e₁^I) = π_k=1^K p((f,e)_k)

Our current system implementation employs four different estimations for p((f,e)_k):

count(f,e) / Σ_e' count(f,e')
count(f,e) / Σ_f' count(f',e)
1/(I+1)^J (Π_j=1^J ∑_i=0^I p_lex(e_i,f_j))
1/(J+1)^I (Π_i=1^I ∑_j=0^J p_lex(f_j,e_i))

Where e_i and f_j are respectively the i-th and j-th words of the tuple target and source side. The probability distribution p_lex is computed based on counts using the training bitext word alignments.

Lexicalized Reordering

As in standard SMT systems, we consider a lexicalized reordering model that conditions reordering on the actual tuples. In our implementation, we consider four base reordering types:

(m) monotone order,
(s) switch with previous phrase,
(f) forward jump,
(b) backward jump.

In addition, we also consider two overlapped types:

(d) discontinuous (overlapped with (b) and (f)),
(c) continuous (overlapped with (m) and (s))

More formally, we want to introduce a reordering model p_o that predicts an orientation type, orientation ∈ {m,s,f,b} given the tuple currently used in translation: p_o(orientation|f,e)

In order to learn this reordering model, we count how often each extracted tuple is found with each of the four orientation types. The probability distribution p_o is then estimated based on these counts using the maximum likelihood principle: p_o(orientation|f,e) = count(orientation,f,e) / Σ_o count(o,f,e)

Given the sparse statistics of the orientation types, we may want to smooth the probability distribution with some factor σ: p_o(orientation|f,e) = ((σ/4) + count(orientation,f,e)) / (σ + Σ_o count(o,f,e))

where σ = 1 / Σ_o count(o,f,e)

The next example shows the reordering orientations (last column) computed over a sequence of tuples:

les          ||| NULL          ||| 0     ||| (m) 
opérations   ||| operations    ||| 1     ||| (m)
contre       ||| there against ||| 2     ||| (m)
les          ||| NULL          ||| 3     ||| (m)
des Talibans ||| Taliban       ||| 5 6   ||| (f)
et           ||| and           ||| 7     ||| (m)
d' Al-Qaeda  ||| al-Qaeda      ||| 8 9   ||| (m)
forces       ||| forces        ||| 4     ||| (b)
ont obtenu   ||| brought       ||| 10 11 ||| (f)
des          ||| NULL          ||| 12    ||| (m)
mitigés      ||| mixed         ||| 14    ||| (f)
résultats    ||| result        ||| 13    ||| (s)
.            ||| .             ||| 15    ||| (f)

A particular case considers tuples with non consecutive input words (see the third tuple in our example). In such case we arbitrary set to swap (s) the orientation of a unit that has words within the previous one (or when the previous one has words within the current one).

qui       ||| that     ||| 40    ||| (m)
vraiment  ||| truly    ||| 42    ||| (f)
ont été   ||| were     ||| 41 43 ||| (s)
terribles ||| dreadful ||| 44    ||| (m)
.         ||| .        ||| 45    ||| (m)

The reordering model may also be learned conditionned to the next tuple (not only the previous one). The same method is also used.

Distortion Lattice Model

Each rewrite rule used to extend the word lattice with a reordering path (see Section decoding) has associated a probability distribution p_l estimated base on counts using the maximum likelihood principle: p_l = count(f,f*) / Σ_f' count(f,f')
Where f is an original sequence of input words (or POS tags), and f* is a permutation (or reordering) of f.

Distortion Penalty

This model implements a 'weak' distance-based (measured in words) reordering model that penalizes the longest reorderings, only allowed when sufficiently supported by the rest of models. It follows the next equation:

p(f₁^J,e₁^I) = Π_j=1^J exp abs(position(j) - position(j-1)) - 1
where position(j) refers to the original position of the hypothesis input word j.