Tesseract (software)

Tesseract (software)

Infobox Software
name = Tesseract



caption =
author = Ray Smith, Hewlett-Packard cite web|url = http://code.google.com/p/tesseract-ocr/|title = tesseract-ocr|accessdate = 2008-07-12|last = Google|authorlink = |year = 2008]
developer = Google
released =
latest release version = 2.03
latest release date = release date|2008|04|22
latest preview version =
latest preview date =
programming language = C and C++
operating system = Linux, Windows and (unofficially) Mac OS X
platform =
language =
status =
genre = Optical character recognition
license = Apache License v2.0
website = http://code.google.com/p/tesseract-ocr/
In computer software, Tesseract is a free optical character recognition engine. It was originally developed at Hewlett-Packard from 1985 until 1995. After ten years with no development, Hewlett Packard and UNLV released it in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0. cite web|url =http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html|title = Announcing Tesseract OCR|accessdate = 2008-06-26|last = Vincent|first = Luc|authorlink = |year = 2006|month = August] cite web|url = https://help.ubuntu.com/community/OCR|title = OCR|accessdate = 2008-07-12|last = Canonical Ltd.|authorlink = |year = 2008|month = June]

Tesseract is considered one of the the most accurate free software OCR engines currently available. cite web|url = http://www.linux.com/articles/57222|title = Google's Tesseract OCR engine is a quantum leap forward|accessdate = 2008-07-18|last = Willis |first = Nathan|authorlink = |year = 2006|month = September]

The current version of Tesseract is 2.03, released April 22, 2008. cite web|url = http://code.google.com/p/tesseract-ocr/downloads/list|title = tesseract-ocr downloads|accessdate = 2008-07-12|last = Google|authorlink = |year = 2008]

About the Tesseract OCR Engine

Tesseract is a raw OCR engine. It has no document layout analysis, no output formatting, and no graphical user interface. It only processes a TIFF image of a single column and creates text from it. TIFF compression is not supported unless libtiff is installed. It can detect fixed pitch vs proportional text. The engine was in the top 3 in terms of character accuracy in 1995. It compiles and runs on Linux, Windows and Mac OS X, however, due to limited resources only Windows and Ubuntu Linux are rigorously tested by developers.

Tesseract can process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. It can be trained to work in other languages as well.

Tesseract is suitable for use as a backend, and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus. Further integration with programs such as OCRopus, to better support complicated layouts, is planned. Likewise, frontends such as FreeOCR can add a GUI to make the software easier to use for manual tasks. cite web|url = http://softi.co.uk/freeocr.htm|title = FreeOCR.net V2.4 Free OCR Software |accessdate = 2008-06-26|last = Softi Software|authorlink = |year = 2008]

History

The Tesseract engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.

Currently Tesseract builds under Linux with GCC 2.95 or later and under Windows with Visual C++ 6. The C++ code makes heavy use of a list system using macros. This predates the C++ Standard Template Library and may be more efficient than Standard Template Library lists, but is reportedly harder to debug if you get a segmentation fault. Another side-effect of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. This is clumsy, and the C++izing of the C code is a step towards eliminating the conversion, but it has not happened yet.

Usage

Tesseract is an OCR engine and does not have a graphical user interface. Tesseract runs from the command line. Tesseract may be called from command line using the following format:http://code.google.com/p/tesseract-ocr/wiki/ReadMe Google Code - Tesseract Readme] tesseract "image.tif" "output" " [options] "Tesseract handles image files in TIFF format (with filename extension .tif) ; other file formats need to be be converted to TIFF before being submitted to Tesseract.

Tesseract does not support layout analysis, which means that it cannot interpret multi-column text, images or equations and in these cases will produce a garbled text output.

References

See also

*OCRopus
*Document Layout Analysis

External links

* [http://code.google.com/p/tesseract-ocr/ Tesseract OCR] Project page on Google Code
* [http://www.isri.unlv.edu/ Information Science Research Institute at the University of Nevada, Las Vegas] Information Science Research Institute at the University of Nevada, Las Vegas
* http://www.ocropus.org/ - A high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau and novel high-performance layout analysis framework, currently using Tesseract as the OCR plugin.
* http://tesseract-ocr.repairfaq.org/ - C/C++ structure of Tesseract extracted from Doxyfied source code (based on Tesseract V1.03)
* [http://sourceforge.net/projects/archivista Archivista Box] - A complete GPL document management system based on Tesseract and Linux.
* [http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html Tesseract - Summary] - some patches for training on a 64-bit machine.


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • Tesseract (Software) — Tesseract Maintainer Ray Smith u.a. Aktuelle Version 3.00.1 (5. Nov. 2010) Betriebssystem Windows, Linux, Mac OS X Programmier­sprache …   Deutsch Wikipedia

  • Tesseract (disambiguation) — Tesseract may mean:* Tesseract mdash; the 4 dimensional analogue of the cube. * Tesseract (software) mdash; optical character recognition software. * The Tesseract, a novel by Alex Garland. * The Tesseract (film), 2003 film starring Jonathan Rhys …   Wikipedia

  • Tesseract — bezeichnet: eine Texterkennungssoftware, siehe Tesseract (Software) eine englische Band, siehe Tesseract (Band) ein niederländisches Technolabel, siehe Tesseract Records Siehe auch: Tesserakt …   Deutsch Wikipedia

  • Tesseract OCR — Tesseract es un motor OCR libre. Fue desarrollado originalmente por Hewlett Packard como software propietario entre 1985 y 1995. Tras diez años sin ningún desarrollo, fue liberado como código abierto en el año 2005 por Hewlett Packard y la… …   Wikipedia Español

  • Tesseract — Первый выпуск середина 1980 х Последняя версия 3.02 / 28 октября 2012[1] Написана на C++ Операционная система Linux, Mac OS X и др. UNIX подобные, Windows Тип …   Википедия

  • Cantitruncated tesseract — In geometry, the cantitruncated tesseract is a uniform polychoron (or uniform 4 dimensional polytope) that is bounded by 56 cells: 8 great rhombicuboctahedra, 16 truncated tetrahedra, and 32 triangular prisms.ConstructionThe cantitruncated… …   Wikipedia

  • List of optical character recognition software — An OCR SDK is a software development kit for adding optical character recognition capabilities to forms processing applications, document imaging management systems, e discovery systems and records management solutions. In order to avoid the… …   Wikipedia

  • Truncated tesseract — In geometry, a truncated tesseract is a uniform polychoron (4 dimensional uniform polytope) which is bounded by 24 cells: 8 truncated cubes, and 16 tetrahedra.ConstructionThe truncated tesseract may be constructed by truncating the vertices of… …   Wikipedia

  • CuneiForm (software) — CuneiForm Original author(s) Cognitive Technologies Developer(s) Cognitive Technologies Stable release 1.1 / April 19, 2011; 6 months ago (2011 04 19) …   Wikipedia

  • Xena (software) — Para otros usos de este término, véase Xena. Xena es software de código abierto para uso en preservación digital. Xena es la abreviatura de XML Electronic Normalising for Archives (XML electrónico normalizado para archivos). Xena es una… …   Wikipedia Español

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”