Shape Codebook based Handwritten and Machine Printed Text Zone Extraction

TitleShape Codebook based Handwritten and Machine Printed Text Zone Extraction
Publication TypeConference Papers
Year of Publication2011
AuthorsKumar J, Prasad R, Cao H, Abd-Almageed W, Doermann D, Natarajan P
Conference NameDocument Recognition and Retrieval
Date Published2011/01//
Conference LocationSan Francisco
Abstract

We present a novel method for extracting handwritten and printed text zones from noisy document images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape characteristics of text in a consistent manner. We first construct two different codebooks of the shape features extracted from a set of handwritten and printed text documents. In the next step, we compute the normalized histogram of codewords for each segmented zone and use it to train Support Vector Machine (SVM) classifier. Due to a codebook based approach, our method is robust to the background noise present in the image. The TAS features used are invariant to translation, scale and rotation of text. In our experimental results, we show that a pixel-weighted zone classification accuracy of 98% can be achieved for noisy Arabic documents. Further, we demonstrate the effectiveness of our method for document page classification and show that a high precision can be achieved for machine printed documents. The proposed method is robust to the size of zones, which may contain text content at word, line or paragraph level.