This is the tenth post in our weekly ediscovery series covering our ediscovery chapter of a legal informatics textbook. In this series, we’ve covered the ediscovery basics, core technical ediscovery concepts, the technologies powering ediscovery (encryption, machine learning, transcoding, etc.); and we’ll soon get to the future of ediscovery. You can also download the ebook in full.
Today we’ll dive into something you may all use, but not be familiar with the technical details of—optical character recognition, or “OCR.”
Optical Character Recognition
Even more basic than recognizing text in audio and video files is the task of recognizing text in images. It’s not unheard of, for instance, for emails to be produced in ediscovery as TIFFs without either embedded text or accompanying text files. In those situations, making the text searchable with accurate optical character recognition (OCR) is the only solution.
OCR is a complex process. Because OCR engines must deal with a wide variety of inputs—including everything from scanned receipts to photos of book pages—they commonly perform a number of pre-processing steps to normalize inbound data. This includes deskewing (aligning the page to a perfectly vertical or horizontal plane), removing lines and spots, and analyzing the layout of the page and structure of the text.
With pre-processing complete, the task of recognizing characters begins. There are two primary approaches: pattern matching and feature extraction. The former compares each character, pixel-by-pixel, with a library of stored character images to look for a match. The latter is more modern and, predictably, uses machine learning to develop a more nuanced understanding of the features defining the text and the wider document. This yields accuracy of up to 99%.
Over time, it is likely that the tools used for OCR will merge with those used for machine translation and transcription, as providers aim to consolidate and harmonize their machine learning approaches. Indeed, Microsoft and Google both offer on-demand OCR services as part of their computer vision tools for recognizing people, places, objects, and other elements beyond merely the text within a given image. Regardless of how it is packaged, however, OCR is likely to decline in importance over time as written materials make up less of the data in litigation.
The next post in our series will expound on the importance of a consumer-grade user experience in any modern ediscovery platform.