Jun's Corner

Jun Kim's method is an excellent tool for anyone to enhance their text cleaning process and optimize their own data analysis.

Python Cleaning Method

Text cleaning is the process of removing unwanted or irrelevant information from a text document. This includes removing special characters, punctuation, and duplicate words. The goal of text cleaning is to prepare the text for analysis, make it easier to understand, and a close match to the original text.

Precleaning Text in Preparation for Hand Cleaning

By precleaning the text, we were able to greatly reduce the amount of work that the RA’s had to do and overall speed up the hand-cleaning portion. On good quality scans, we were able to achieve >94% accuracy prior to hand cleaning. The most common issues we’ve found were related to the kerning and leading of the text which resulted in misspelled words as well as the OCR hiccupping and identifying f as long s.

Cleaning the Scans

Some scans had too much noise, skewing, and other obstacles that affected the overall output of the OCR, and so we went forth in designing a method to automate the processes of cleaning the images.

As we can see, by applying various filters, we can correct issues that can affect the quality of the OCR. However, we had to be careful as the filters’ themselves can also degrade the quality of the text.

As seen here, after running a Median Blur filter, the letters became less identifiable.