Jun's Corner
Jun Kim's method is an excellent tool for anyone to enhance their text cleaning process and optimize their own data analysis.
Python Cleaning Method
Text cleaning is the process of removing unwanted or irrelevant information from a text document. This includes removing special characters, punctuation, and duplicate words. The goal of text cleaning is to prepare the text for analysis, make it easier to understand, and a close match to the original text.
Precleaning Text in Preparation for Hand Cleaning
By precleaning the text, we were able to greatly reduce the amount of work that the RA’s had to do and overall speed up the hand-cleaning portion. On good quality scans, we were able to achieve >94% accuracy prior to hand cleaning. The most common issues we’ve found were related to the kerning and leading of the text which resulted in misspelled words as well as the OCR hiccupping and identifying f as long s.
Cleaning the Scans
Some scans had too much noise, skewing, and other obstacles that affected the overall output of the OCR, and so we went forth in designing a method to automate the processes of cleaning the images.
As we can see, by applying various filters, we can correct issues that can affect the quality of the OCR. However, we had to be careful as the filters’ themselves can also degrade the quality of the text.
As seen here, after running a Median Blur filter, the letters became less identifiable.
Through trial and error, we were able to find a set of filters and strengths that allowed us to clean the page and improve the quality.
Original VS Altered
Step By Step Process
Step 1.
Create new folder and run custom .bat file to create the folders and files necessary.
Step 2.
Convert pdf files into .jpg
Step 3.
Run preprocessing script and select the images.
Step 4.
Run Rescribe and select the folder with the preprocessed images.
Step 5.
Locate the text folder and run the postprocessing script.
Step 6.
Package and zip the postprocessed text for use.