Today we’ll continue our weekly series on our ediscovery chapter of a legal informatics book. In this series, we’re covering the ediscovery basics, core technical ediscovery concepts, the technologies powering ediscovery, and the future of ediscovery.
We’re pleased today to dive into one of the more popular technologies used in modern ediscovery: machine learning, also often referred to as artificial intelligence or AI. You can also get the ebook in full here.
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Instead, the computer uses your input to make educated guesses beyond that input. Machine learning is the foundation of the predictive coding technology that has transformed ediscovery, reducing the amount of data that need to be reviewed by upwards of 80% in some matters. It has also transformed machine translation, machine transcription, optical character recognition, and other technologies employed in ediscovery.
Of course, ediscovery is not the only field that has been disrupted by machine learning. Many of the services you use every day incorporate this technology. For example, any time you give a song a thumbs-up on Pandora, thereby receiving more (and more accurate) recommendations for additional music, you are taking advantage of machine learning. The recommendations you receive on Netflix or Amazon, after having watched some movies or shopped for some items, respectively, are also driven by machine learning. Even your search results on Google are informed by a machine learning algorithm that examines and learns from your past search and browsing behavior.
In most cases, the machine learning process is very similar:
- The machine learning system catalogs the features of each object in the corpus (e.g., songs, movies, products, web sites, etc.). For a song, the features might include the key, tempo, lyrics, artist(s), etc.
- A human classifies some objects as relevant or irrelevant. In the Pandora example, this is giving songs a thumbs up or thumbs down.
- The machine learning system analyzes the human input to determine which features affect relevance, and how. There are at least a dozen different algorithmic approaches to this particular analysis.
- The machine learning system uses this information to classify the remaining objects as relevant or irrelevant.
In ediscovery, machine learning for document classification goes by many names, but the two most common are technology assisted review (TAR) and predictive coding, which are used more or less interchangeably. Regardless of what they are called, they generally fall into one of two categories:
- Simple Passive Learning (or TAR 1.0): A subject matter expert classifies some documents to be used for training, which the system then uses to test the reliability of the predictions as more documents are classified. Once the performance is acceptable, the prediction model is rolled out to all remaining documents.
- Continuous Active Learning (or TAR 2.0): All review decisions automatically train the system, and the system continually updates the predictions as new human classifications are made.
TAR 2.0 has many advantages over TAR 1.0. TAR 1.0 requires experts to do the initial training, and it is less effective, because it can’t learn from subsequent decisions. TAR 1.0 also cannot handle rolling productions without having to start over. Finally, TAR 1.0 doesn’t work well when the proportion of relevant documents is low.
As with any specialized domain, there is some jargon with which it helps to be familiar:
- Richness or prevalence: the percentage of documents in the dataset that are relevant. So, if there are 100 documents in total, and only 10 are relevant, the prevalence is 10%.
- Recall: the percentage of relevant documents retrieved. So, if there are 10 relevant documents in the dataset, and the system correctly identifies 6 of them, the recall would be 60%. This is a measure of the completeness of the results.
- Precision: the percentage of retrieved documents that are relevant. So, if the system correctly identifies 6 relevant documents, but incorrectly identifies another 54 as relevant, then precision would be 10%. This is a measure of the purity of the results.
Precision and recall are used quite universally as closely-related measures of a predictive coding model’s performance. There is an inherent tradeoff between the two: you can maximize recall by simply classifying everything as relevant, but that would minimize precision because your “relevant” set would include many irrelevant documents as well. Similarly, classifying only one document as relevant might maximize precision (assuming this one document is indeed relevant), but it would minimize recall if there were in fact more than one relevant document. The ideal predictive coding model seeks to optimize this tradeoff, although there are some situations where one metric is more important than the other (e.g., recall, in the hunt for a “smoking gun” document).
The predictions generated by a predictive coding model can be enormously powerful. They can be used to avoid the manual review of documents that are almost certain to be uninteresting, saving potentially thousands or even millions of dollars in review time. They can be used to prioritize documents for relevance review, making the review process much more efficient, or for other kinds of review, such as identifying potentially privileged documents, reducing the chance of a costly or damaging clawback. They can also be used to double-check review work (i.e., by looking for misalignment between the human and machine classifications), making the review process much more accurate. With these benefits, it is not surprising that the vast majority of corporate counsel are using predictive coding on their cases, nor that courts are routinely approving the use of predictive coding in the cases before them.
Next in our series, we’ll cover machine translation and its use during foreign language review of a dataset.