Term selection for searching printed Arabic
Title | Term selection for searching printed Arabic |
Publication Type | Conference Papers |
Year of Publication | 2002 |
Authors | Darwish K, Oard D |
Conference Name | Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval |
Date Published | 2002/// |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 1-58113-561-0 |
Keywords | arabic, Information retrieval, OCR, term selection |
Abstract | Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions. |
URL | http://doi.acm.org/10.1145/564376.564423 |
DOI | 10.1145/564376.564423 |