Database and Evaluation Protocols for Arabic Printed Text Recognition - HES SO Valais Publications

BROWSE
EXPORT
- Export all publications
SITE
- Help
- About this site
GUEST USER
anonymous
Topic Subscribe
LOGIN
Name:

Password:

Deutsch, English, Nederlands, Norsk, Português, <more...>

Type of publication:	Techreport
Citation:	slim09:tr296
Number:	296-09-01
Year:	2009
Institution:	University of Fribourg, Department of Informatics
URL:	http://www.hennebert.org/downl...
Abstract:	We report on the creation of a database composed of images of Arabic Printed Text. The purpose of this database is the large-scale benchmarking of open-vocabulary, multi-font, multi-size and multi-style text recognition systems in Arabic. Such systems take as input a text image and compute as output a character string corresponding to the text included in the image. The database is called APTI for Arabic Printed Text Image. The challenges that are addressed by the database are in the variability of the sizes, fonts and style used to generate the images. A focus is also given on low-resolution images where anti-aliasing is generating noise on the characters to recognize. The database is synthetically generated using a lexicon of 113’284 words, 10 Arabic fonts, 10 font sizes and 4 font styles. The database contains 45’313’600 single word images totaling to more than 250 million characters. Ground truth annotation is provided for each image thanks to a XML file. The annotation includes the number of characters, the number of PAWs (Pieces of Arabic Word), the sequence of characters, the size, the style, the font used to generate each image, etc.
Keywords:	arabic, Benchmarking, OCR
Authors	Slimane, Fouad Ingold, Rolf Kanoun, Slim Alimi, Adel Hennebert, Jean
Added by:	[]
Total mark:	0
Attachments

Notes

Topics
Institute of Informatics (II) 0/1381

processing time: 1.4404 seconds.