Generating Synthetic Data for Text Analysis Systems

TitleGenerating Synthetic Data for Text Analysis Systems
Publication TypeConference Papers
Year of Publication1995
AuthorsDoermann D, Yao S
Conference NameSDAIR
Date Published1995///
Abstract

In this paper we describe work on a sys-tem for modeling errors in the output of
OCR systems. The project is motivated by
the desire to evaluate the performance of
various text analysis systems under varying,
yet controlled conditions. We describe a set
of symbol and page models which are used
to degrade an ideal text by introducing er-
rors which typically occur during scanning,
decomposition and recognition of document
images. A rst generation of the software is
described which implements the page mod-
els and allows the use of transition proba-
bilities, either extracted from real data or
generated synthetically, to corrupt text.