Feature Functions

The log-linear model approach underlying BNSMT allows for the combination of several feature functions. Each, scoring on the quality of the translation. The probability of a translation e of an input sentence f is computed as:

p(e|f) = Πi hi(e,f)λi
where hi are the feature functions and λi the corresponding weights. Note that the decoder internally uses negative logarithms (positive costs).

The tuning stage of the system is used to set the weights. Next, we detail the components (models) implemented in our system.


Bilingual n-gram Language Models

The 'main' translation model implemented by our SMT system is based on bilingual n-grams (tuples). This model constitutes actually a language model of a particular 'bi-language' composed of tuples. In this way, the translation model probabilities at the sentence level are approximated by using n-grams of tuples, such as described by the following equation:

p(f1J,e1I) = Σk=1K p((f,e)k|(f,e)k-1, ..., (f,e)k-n+1)
where e refers to target, f to input and (f,e)k to the kth tuple of a given bilingual sentence pair, and n is the language model order.

Multiple bilingual n-gram language models can be used in decoding time. Bilingual LMs must be computed over any tuple factored form available to the decoder. Typically, tuples are built from raw words, but many other factored forms can be used (POS tags, lemmas, stems, etc.).


Tuple Bonus Model

The use of any language model probabilities is associated with a length comparison problem. In other words, when two hypotheses compete in the search for the most probable path, the one using less number of elements (being words or translation units) will be favored against the one using more. The accumulated partial score is computed by multiplying a different number of probabilities. This problem results from the fact that the number of tokens used for translating a given sentence is not fixed and equivalent in all paths.

The tuple bonus model is used in order to compensate the system preference for sentences using less number of tuples (tuples with larger source side). It is implemented following the next equation:

p(f1J,e1I) = exp(K)


Target n-gram Language Models

This feature provides information about the target language structure and fluency, by favoring those partial-translation hypotheses which are more likely to constitute correctly structured target sentences over those which are not. The model implements a standard word n-gram model of the target language, which is computed according to the following expression:

p(f1J,e1I) = πi=1I p(ei|ei-1, ..., ei-n+1)
where e refers to target, f to input and ei to the ith target word of a given translation hypothesis, and n is the language model order.

Multiple target n-gram language models can be used in decoding time. Each model must be computed over any tuple target side factored form available to the decoder.


Target Word Bonus Model

Similar to the tuple bonus model, the target word bonus model is used in order to compensate the system preference for shorter translation hypotheses. It is implemented following the next equation:

p(f1J,e1I) = exp(I)


Source n-gram Language Models

This feature aims at providing information about the reordered input language (reordered so as to allow the tuples target side matching the target language word order). The model implements a standard word n-gram model of the input language, which is computed according to the following expression:

p(f1J,e1I) = πj=1J p(fj|fj-1, ..., fj-n+1)
where fj refers to the jth input word of a given translation hypothesis, and n is the language model order.

Multiple source n-gram language models can be used in decoding time. Each model must be computed over any tuple source side factored form available to the decoder.


Source Word Bonus Model

Similar to the previous bonus models, the source word bonus model is used in order to compensate the system preference for shorter input sentences. Hence, it is only applied when translating word lattices containing paths differently sized. It is implemented following the next equation:

p(f1J,e1I) = exp(J)


Tuple (unigram) models

As standard SMT systems, Ncoder introduces additional models to account for the statistical consistency of the pair of word sequences presents on each translation unit. All models follow the same equation:

p(f1J,e1I) = πk=1K p((f,e)k)

Our current system implementation employs four different estimations for p((f,e)k): Where ei and fj are respectively the i-th and j-th words of the tuple target and source side. The probability distribution plex is computed based on counts using the training bitext word alignments.


Lexicalized Reordering

As in standard SMT systems, we consider a lexicalized reordering model that conditions reordering on the actual tuples. In our implementation, we consider four base reordering types: In addition, we also consider two overlapped types: More formally, we want to introduce a reordering model po that predicts an orientation type, orientation ∈ {m,s,f,b} given the tuple currently used in translation: po(orientation|f,e)

In order to learn this reordering model, we count how often each extracted tuple is found with each of the four orientation types. The probability distribution po is then estimated based on these counts using the maximum likelihood principle: po(orientation|f,e) = count(orientation,f,e) / Σo count(o,f,e)

Given the sparse statistics of the orientation types, we may want to smooth the probability distribution with some factor σ: po(orientation|f,e) = ((σ/4) + count(orientation,f,e)) / (σ + Σo count(o,f,e))

where σ = 1 / Σo count(o,f,e)

The next example shows the reordering orientations (last column) computed over a sequence of tuples:
les          ||| NULL          ||| 0     ||| (m) 
opérations   ||| operations    ||| 1     ||| (m)
contre       ||| there against ||| 2     ||| (m)
les          ||| NULL          ||| 3     ||| (m)
des Talibans ||| Taliban       ||| 5 6   ||| (f)
et           ||| and           ||| 7     ||| (m)
d' Al-Qaeda  ||| al-Qaeda      ||| 8 9   ||| (m)
forces       ||| forces        ||| 4     ||| (b)
ont obtenu   ||| brought       ||| 10 11 ||| (f)
des          ||| NULL          ||| 12    ||| (m)
mitigés      ||| mixed         ||| 14    ||| (f)
résultats    ||| result        ||| 13    ||| (s)
.            ||| .             ||| 15    ||| (f)
A particular case considers tuples with non consecutive input words (see the third tuple in our example). In such case we arbitrary set to swap (s) the orientation of a unit that has words within the previous one (or when the previous one has words within the current one).
qui       ||| that     ||| 40    ||| (m)
vraiment  ||| truly    ||| 42    ||| (f)
ont été   ||| were     ||| 41 43 ||| (s)
terribles ||| dreadful ||| 44    ||| (m)
.         ||| .        ||| 45    ||| (m)
The reordering model may also be learned conditionned to the next tuple (not only the previous one). The same method is also used.


Distortion Lattice Model

Each rewrite rule used to extend the word lattice with a reordering path (see Section decoding) has associated a probability distribution pl estimated base on counts using the maximum likelihood principle: pl = count(f,f*) / Σf' count(f,f')
Where f is an original sequence of input words (or POS tags), and f* is a permutation (or reordering) of f.


Distortion Penalty

This model implements a 'weak' distance-based (measured in words) reordering model that penalizes the longest reorderings, only allowed when sufficiently supported by the rest of models. It follows the next equation:

p(f1J,e1I) = Πj=1J exp abs(position(j) - position(j-1)) - 1
where position(j) refers to the original position of the hypothesis input word j.