This is the seventh post in our multi-part series covering our ediscovery chapter of a legal informatics textbook. In this series, we’re covering the history of the Electronic Discovery Reference Model (EDRM), core technical ediscovery concepts, the technologies powering ediscovery (encryption, machine learning, transcoding, etc.); as well as the future of ediscovery.
Today we’re covering a critical piece of technology in any case with data in multiple languages: machine translation. You can also get the ebook in full here.
Machine translation is simply translation from one human language to another, performed by a computer. In computing history, the idea is quite old, but it is only very recently that we’ve begun to enjoy machine translation that can approximate the quality of human translation. Even short of that milestone, however, machine translation has made it far more efficient to review foreign-language content.
In its most basic form, machine translation works by substituting each word in the source document with the equivalent word in the target language. This can yield comically poor results, however, because it doesn’t take into account the context of each word’s meaning within the wider phrase or sentence. There are several techniques to get at these more nuanced translations, ranging from rules-based (carefully parsing the structure of each sentence, based on the linguistic rules of the source and target languages) to statistical (making informed guesses based on an analysis of massive databases of previously-translated documents).
Most recently, however, the industry has begun a shift to neural machine translation. This technique is based on a cutting-edge type of machine learning, neural networks, whose base unit of artificial intelligence is modeled on the biological neuron. Neural networks have experienced an exhilarating resurgence in the last decade, pushing to the forefront of AI techniques and reaching new levels of accuracy in previously stagnant areas of AI research.
One primary driver of this resurgence is the technique of combining multiple layers of neurons to create more sophisticated analyses; it’s known as deep learning. Deep learning has recently become more tractable for three main reasons: first, improvements in algorithms for training complicated networks; second, the existence of very large data sets, often acquired by leading consumer cloud companies (e.g., Google), with which to train them; and third, the repurposing of graphical processing units, or GPUs—historically used for video gaming and graphics applications—to speed up this training process by several orders of magnitude.
One use of deep learning’s layered approach is to identify the relevant features of the data at ever-greater levels of abstraction. So, instead of looking at text as words or phrases, predetermined by the developer, a deep learning algorithm might first interpret the text as a series of characters, then look at the syllables, then look at the words, and so forth, until it has developed a sense for both the features and their relevance. Compared to statistical machine translation models, neural machine translation is both more efficient and more accurate.
In our increasingly globalized economy, the likelihood of encountering foreign-language content in ediscovery is ever greater. The cost of having humans translate all of that content can be enormous, sometimes prohibitively so. This cost is particularly difficult to bear when you know that much (if not most) of the translated content will turn out to be not only irrelevant, but obviously junk. Machine translation changes the equation, making it possible to quickly and accurately determine whether foreign-language content is likely to be relevant to the matter at hand. While this is currently not a substitute for the certified human translators required to generate translations suitable for use in court, machine translation is a powerful tool for focusing those valuable human translators on only the documents that matter.
In our next post, we’ll walk through transcoding, or the process of converting information from one format to another.