Mahdi & Company

Home > Blog

Parallel Corpora

05 Feb 2020

Traditionally, the term bitext is used to refer to documents along with their translation into other languages. But currently is used to refer to wide parallel sources including various translation of a common source (e.g. sacred texts, official United Nations documents, movies subtitles, etc.).

Furthermore, alternative translations into the same language are also considered a bitext (e.g. The Little Prince has been translated to Arabic and Persian by multiple translators with different translation style and degrees of freedom)

The most important feature of a bitext is some kind of correlation between the texts coupled together. In order to make the correlation more explicit we use Alignment through a different set of techniques.

The parallel corpora usually have a degree of inaccuracy of them that is resulted in alternative translations, unknown translation history, varying translation directions, incompatible degree of freedom in translation, etc. Based on the needs and the purpose that the corpus serves, it does not have to be complete in all parts.

Parallel corpora can be domain-specific or balanced (i.e. content from various domains in a balanced way). It could be static, gathered, and collected manually, or dynamically collected and growing. Designer of the collection needs to make an objective and explicit decision on the purpose of the corpora serves, and the structure and construction of it. Most corpora are created with specific purposes.

It is also possible that an ensemble corpus is constructed using other open-source or proprietary sources. The designer of such a corpus would need to take into account the quality of each source, the importance and relevance weight of bitexts, and how domain-specific and balanced corpora will be mixed together.

Before Alignment

Before starting the alignment process, the designer needs to carefully consider two primary questions:

  1. What is known about the input?
    1. How closely the texts are related?
    2. Is the translation of two bitexts complete?
    3. What is the degree of freedom in the translated text? How much variation is expected?
    4. Why, how, and for what reason the bitext was produced, and what set of limitation existed for the collector?
    5. What was the purpose? (e.g. translation, simplification, summarization, localization)
    6. Can we assume that both texts were targeting the same audience?
  2. What purpose does alignment serve?
    1. What do we want to do with the aligned bitexts?
    2. What degree of quality is accepted?
    3. What other parties could benefit from it? Do we need to keep their needs in mind, or have we decided to leave that for the future?

Alignment Strategy

Alignment of bitext is not simple if the search space is vast. Therefore, the very first step is to create the alignment in a top-down hierarchical way starting with a coarse level of segmentation (e.g. books or movie subtitles) and then proceed with the alignment of smaller segmentation. (e.g. chapters, sequences, etc.) By doing this, we would restrict the search space and increase the probability of a successful match.

This approach would especially work if the text has some kind of structure. For example, if we are creating bitext of sacred texts, and the text follows an inherit segmentations (e.g. different books of The Old Testament, or different surahs of the Quran). Typically, the outcome of this type of alignment is more reliable.

The relationship between bitexts is not always one to one. Languages can be characterized based on the average length of sentences. In language A and therefore texts produced in that language, there is a tendency to form long sentences, while language B and the texts generated in that language dictate using shorter length sentenced. Therefore, translators usually divide a sentenced or combine multiple sentences into one.


Collecting bitexts and aligning them to build parallel corpus is not a well-defined process and it depends on the applications of such corpus. The two sides of the spectrum are listed as follow, but any combination and different degrees of variation are possible:

  1. A large amount of text with noisy parallel data that are aligned the automatically high level of resolution.
  2. A small amount of text with High resolution of data that is mapped precisely and is highly structured.

For building Statistical Machine Translation (SMT), we typically need a large amount of text and a high level of noise can be tolerated. But for providing Translation Memorization, or in Computer-Aided Translation (CAT) tools we require precisely mapped and highly-structured data. Today the main application of parallel corpora is a data-driven machine translation.

Regardless of the application, for the bitext to be useful, it needs to be clean, well-defined, and tightly-scoped with known translation directions, and mapping metadata.

For understanding various applications of parallel corpora, it would be useful to divide the application into two categories based on the focus on alignment with word granularity.

1. General applications of the parallel corpus

Example-based Machine Translation

It is the translation by analogy that supports the idea that the translation does not occur using deep linguistic analysis. Instead, it supports the idea that translation is the result of decomposing a sentence into smaller fragments. A parallel corpus may be used to provide a database of examples of translations made by human to be used for the automation of the translation process.

Computer-aided language learning

By using computer software, the learning experience and quality of language learners can be improved. The parallel corpus can be used to teach by example rather than by theory.

Supporting translators using parallel corpora

Translation Memory that is produced by the help of parallel corpus is a valuable tool to improve the quality and speed of the translators by allowing them to choose the right sense for vocabularies that are particularly difficult to translate, due to differences in the sense of vocabulary and the context that they are used.

Computer-aided translation (CAT) tools can incorporate domain-specific or balanced corpora to speed up the translators. Furthermore, a memory translation can be generated from the ongoing translated bitext that helps the translators to remain consistent with their word choices and terminologies.

2. Word-aligned bitext

Rule-based Machine Translation

It is the classic approach to Machine Translation that is based on one or multiple, rules representing the regular structure of the source language, and regular structure of the target language. This approach suffers from multiple shortcomings and generally is abandoned in favour of Statistical Machine Translation. Nevertheless, word-aligned bitexts may be used to simplify or create translation rules.

Statistical Machine Translation

This approach is based on using statistical models to predict how a sentence should be translated with the highest probability. This is where a Parallel corpus that was built with a large amount of text and potentially contains a large amount of noise can shine.

Mahdi Mamouri - Principle Machine Learning Engineer of Mahdi & Co

Mahdi Mamouri

In love with building businesses around digital story telling, data mining, and data analytics.