To Make Better Translation Algorithms, Look To The Bible

Medicine uses Latin because it is a 'dead' language - the meanings of the words will not change over time. But if you want to modernize translations to different languages, an ancient book may help: The Bible.Tools to translate text between languages are widely available - and rather awful. While they can create literal translations, style is hard to bring across without human intervention. If you tried to read a translation of China's Liu Cixin using a computer, you would miss everything, most importantly a great example of the best science-fiction culture since America of the 1950s.

Tools to translate text between languages are widely available - and rather awful. While they can create literal translations, style is hard to bring across without human intervention. If you tried to read a translation of China's Liu Cixin using a computer, you would miss everything, most importantly a great example of the best science-fiction culture since America of the 1950s.

Big Data can help, but it takes an enormous amount of data to make it possible. That's where the Bible comes in. Each version of the Bible contains more than 31,000 verses and it's in every language, which means it has what they call "a large, previously untapped dataset of aligned parallel text."

Bible photo credit Chris Downer. Composite illustration courtesy of Keith Carlson. Provided by Dartmouth College

Using The Bible, researchers were able to produce over 1.5 million unique pairings of source and target verses from 34 versions of the English-language Biblefor machine-learning training sets. The Bible is also thoroughly indexed by the consistent use of book, chapter and verse numbers. The predictable organization of the text across versions eliminates the risk of alignment errors that could be caused by automatic methods of matching different versions of the same text.

To define "style" for the study, the researchers reference sentence length, the use of passive or active voices, and word choice that could result in texts with varying degrees of simplicity or formality. According to the authors, "Different wording may convey different levels of politeness or familiarity with the reader, display different cultural information about the writer, be easier to understand for certain populations."

The team used 34 stylistically distinct Bible versions ranging in linguistic complexity from the "King James Version" to the "Bible in Basic English." The texts were fed into two algorithms - a statistical machine translation system called "Moses" and a neural network framework commonly used in machine translation, "Seq2Seq."

While different versions of the Bible were used to train the computer code, systems could ultimately be developed that translate the style of any written text for different audiences. As example, a style translator could take an English-language selection from "Moby Dick" and translate it into different versions suitable for young readers, non-native English speakers, or any one of a variety of audiences.