New Feature Preview: Foreign Language Detection

by Everlaw

Reading Time — 4 minutes

November 14, 2014

Globalization is on the rise: we conduct business across borders and continents, breaking down language barriers daily. When it comes to litigation, this means that you are more likely to see documents in a foreign language. Dealing effectively with those documents is part of modern ediscovery. This month, we will release a feature to help you do that: foreign language detection. Though the feature is not yet live, I wanted to provide a behind-the-scenes preview of the new functionality!

What Is Language Detection, and Why Is It Important?

This new language detection feature will enable users to find documents by language. It searches through vast document collections to identify text in any foreign language. This way, reviewers will be able to isolate documents needing to be translated or assessed by an expert. This type of batch processing improves ediscovery efficiency:

It eliminates the need for a reviewer to guess at languages that (s)he sees in a document. This results in faster identification of the type of expertise needed during review. For example, a firm can run this check at the start of review, ensuring that the reviewers they hire possess the language skills needed to understand the documents in the collection.
It helps language specialists to receive all of the files they will be working on at one time. This means a faster, more streamlined review, without the inefficiency of many separate sessions.

Moreover, when foreign text is detected in a document, a machine-generated translation is immediately available without having to leave the tool! Though not intended as a replacement for professional translation, this definition can help reviewers prioritize documents needing more extensive review. It enables early identification of hot documents, case trends, or additional needed expertise. For example, if the initial translation reveals Portuguese vocabulary pertaining to IP, a Portuguese speaker with IP experience can be sought.

At present, the tool is able to detect 53 different languages, ranging from broadly-used languages like Chinese (both simplified and traditional character sets) to more infrequently-used languages like Swedish. (A full list is at the end of this post.) In any given document, it can highlight up to three different languages. This can be critical for cases involving multilingual employees or international offices.

What Made Foreign Language Detection Challenging?

The biggest challenge that engineer Jordan came across when creating the feature was how to make the system accurate enough. There are two parts to accuracy: precision and sensitivity.

Precision is a tool’s ability to detect the correct language when a foreign language document is found. In other words, if the tool classifies a document as being in French, how likely is it to actually be French—and not, for example, Italian?
Sensitivity, on the other hand, measures how accurate the tool is in finding a French document in the first place. For example, if there are French words in a document, how likely is the tool to find this text?

After tweaking the algorithm to maximize precision and sensitivity, the tool averages about 99% in both metrics on conventional data sets. Accuracy statistics vary from language to language, and Jordan will continue testing and improving the algorithm after the release.

Another challenge that Jordan came across was in accurately distinguishing between similar languages. Whether text is in English or in Chinese is easy to detect, in part because of their different character sets. However, finding the difference between statistically-similar languages like Spanish and Portuguese is more difficult. Case administrators can help in this effort by specifying target languages that they expect to appear in their document collections.

How Does It Work?

The feature leverages state-of-the-art open source software that uses machine learning to detect foreign languages. To maximize accuracy the tool combines the results of two separate language detectors that take different approaches to pre-processing and scanning the text. This dual-detector system improves precision and sensitivity beyond what either detector achieves on its own.

First, the system analyzes enormous amounts of text to calculate how frequently specific combinations of characters appear in each language. Each sequence of n characters is classified as an “n-gram” (e.g. two characters are a bi-gram and four characters are a quad-gram). The system then determines the probability that each n-gram will appear in various languages. For instance, based on an analysis of a huge collection of text in different languages:

If the word begins with “z,” there is a 53% chance that the word is in German, and a 2% chance that the word is in English. If you think about it, this should make sense, since there are not that many words in English that start with “z.”
If a word contains the bi-gram “th,” there is a 74% chance that the word is in English, and a 1% chance that the word is in French.

The system uses this information to create statistical language profiles, and then uses these profiles to hunt for those languages within the document set.

How Is It Better?

We’re not the first ediscovery platform to offer language detection, but whenever we take on a new feature, we always strive to be the best. This is how we think our version excels:

Clear highlighting: The tool doesn’t just tell you that a document has foreign text: it highlights that content for you, to quickly draw your eye to the relevant section. For example, the highlighting will bring your attention immediately to the use of a casual phrase like “C’est la vie!” in an email. That way, you can remove it from the translation batch, saving time and money.
In-place translation: But if you don’t happen to know what “C’est la vie” means, you can immediately see a rough translation. There’s no need to open Google Translate or to email a colleague: you have real-time machine translation directly in the tool.
Multiple languages at once: The tool can detect up to three different languages in each document. So, for example, if an email has English and Japanese text, and the latter includes Chinese Kanji characters, it will detect all three! Document-writers don’t just stick to one language, so why should your ediscovery tool?
Language targeting: The tool lets admins specify target languages. This helps focus reviewers, so they won’t be distracted by false positives—and it helps better batch by language, improving the workflow.

Languages Available for Detection in Everlaw

Everlaw supports the following 53 languages.

Albanian	Hebrew	Punjabi
Afrikaans	Hindi	Romanian
Arabic	Hungarian	Russian
Bulgarian	Indonesian	Slovak
Bengali	Italian	Slovene
Chinese (Simplified)	Japanese	Somali
Chinese (Traditional)	Kannada	Spanish
Croatian	Korean	Swahili
Czech	Latvian	Swedish
Danish	Lithuanian	Tagalog
Dutch	Macedonian	Tamil
English	Malayalam	Telugu
Estonian	Marathi	Thai
Finnish	Nepali	Turkish
French	Norwegian	Ukrainian
German	Persian	Urdu
Greek	Polish	Vietnamese
Gujarati	Portuguese

Everlaw’s advanced technology empowers organizations to navigate the increasingly complex ediscovery landscape, tackle the most pressing technological challenges, and chart a straighter path to the truth—transforming their approach to discovery, investigations, and litigation in the process. See more articles from this author.