Mahdi & Company

Translation Memory

10 Mar 2020

A translation memory is a database (static or dynamically growing) of existing translators, structured in a way that facilitates translation re-use. Amongst translators, sometimes the concept has been looked at with negativity because the proprietary software that advertises having this feature, sometimes were designed with a very basic, limiting, and error-prone implementation.

Nevertheless, the problem of computer-aided translation has been evolved with the introduction of the translation memory. Sometimes the translation memory is compiled using bitexts produced from other sources, and sometimes it is complemented by the existing contents that are being translated. This technique is of great interest in translating documents that have lots of similar sentences or is a regression of the previous versions (e.g. legal documents, technical manuals, product documentation, food receipt, etc.) This has economic importance by reducing the time required to translate updated documents.

Most Machine-Aided Human Translation systems offer dynamic terminology lookup and make use of translation memory. They can also offer a definition for various kinds of elements, such as location names, person names, organization names, etc.

One of the systems that we developed internally is to extract information from any named entities. The system gives priority to precision over recall, and it relies on various external open source and proprietary databases to extract information.

Smaller units of text and sequencing

Although a sentence is sometimes considered the smallest meaningful unit of the body of text, it is usually useful to divide and align the sentence into smaller clauses such as idiomatic expressions, compounds, or meaningful independent expressions (e.g. “DNA sequencing” in “DNA sequencing is the process of determining …”), and even single words.

The result of this process can be used to augment bilingual dictionaries and allow the implementation of a powerful user interface to speed up the efficiency, and reliability of the translations.

A translation memory is based on sequence-to-sequence relation of two or multiple texts. Many of the available products on the market do not provide satisfactory results for alignment. In most translation services, a great number of human work hours is wasted to correct the automatic alignment.

A reliable solution for sentence alignment would use statistical methods based on the characteristics of the sentence, lexical characteristics, lemma characteristics, and other mixed methods. Rules extracted based on heuristics may also play a partial role in the alignment.

A Computer-Aided translation tool that provides a Translation Memory database to improve the translators’ efficiency

In structure, Translation Memory is very similar to bilingual concordance, but its function is a bit different. The translation memory computes sentence to sentence similarity. Typically the Computer-Aided translation software would help translators by finding fuzzy or exact matches between sentences from the Translation Memory database. It would also update the database in real-time to include the sentences in the document currently being translated. If sufficiently valuable matches were found, the translators have a starting point and have all the independent research that they might need to do in front of their eyes. This is particularly valuable for documents that are closely related. For example the next versions of a legal or technical document. The ability to re-use significant parts of previous documents leads to substantial time and cost savings.

The core process of a Computer-Aided Translation tool that utilizes Translation Memory is to parse the source documents into “fragments” of text and perform similarly search on the Translation Memory database to find appropriate candidates and form an initial version for the target sentence to be reviewed, and completed by the translators.

The matching algorithm selects the sentences, but the translator might accept, reject, or edit the provided translation. The matching algorithm needs to sort the matches based on a score to reduce the amount of information overload and provide the best matches to the translators. Generally, there are two factors that contribute in the similiary score:

The sentence structure: String of characters, the sequence of the words, the sequence of lemmatized words
The metrics: Density of similar units, order, contiguity

Translators Efficiency

Obviously, the exact matches are highly preferred, but translators would be also benefited by a display of close matches. Based on the complexity of the language and the time required to choose the right sense in the target language, close matches can decrease the time and effort to correct them rather than starting from a blank page.

Example of a Computer-Aided Translation Tool Interface

In the above example, different fragments of the text are extracted, interesting matches has been queried, and the user interfaces auto-completed sections that have the exact match. The only thing that is left for the translator is to fill in the gaps. She can use the TAB key to move between gaps quicker, and we can provide autocomplete feature so she does not need to type the phrases completely.

The challenge for the future is to develop a new generation of Computer-Aided Translation systems which can help the translator to utilize the partial matches more efficiently. With enough information, the translator can quickly recognize the amount of the effort needed to translate from scratch vs modify the existing fragment of the translation. Furthermore, if the template was accepted by the translator, the software would automatically remove the rejected alternatives.

It is also possible to allow the translators to apply alignment correction interactively, or based on their behaviour automatically by simply observing the selected word for the target sentence. These analytics can be stored and used for subsequent translation needs by the translator herself or other users.

Once again, the biggest win, and also the biggest challenge is to perform statistical and linguistics analysis to match the fragments as closely and as accurately as possible. Otherwise, all the efforts would be unfruitful.

One more thing: The User Experience

The user interface is crucial to maximizing efficiency. If we used the most advanced techniques for Alignment, but our user interface obstructs the translator, or make it more difficult for her to the user the software, it will be abandoned.

It would be counter-productive to assume that the smallest unit of translation is a sentence. The user interface should provide adequate support for the translators to be able to translate by divide and conquer, but at the same time provide enough flexibility for them to merge the sentences if they deemed appropriate.

It is rare that the exact sentence matches form the translation memory can be used without any revisions. The goal of these systems is not to perform automatic translation, but rather to provide as much helpful information as possible to enable translators to make the correct decisions efficiently.

Mahdi Mamouri

In love with building businesses around digital story telling, data mining, and data analytics.

Mahdi & Company

Mahdi & Company is a data science, machine learning, and data visualization consultancy that provide full-stack data mining, machine learning, and data visualization services.

Consulting

Data Mining Data Visualization Natural Language Generation

Blog

Ants Death Spiral and Misinformation The fall of Symbolic AI and the rise of Deep Learning Translation Memory Parallel Corpora

Company

About Contact Careers