Information extraction


Information extraction

In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents. An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company_1, company_2, date), from an online news sentence such as: "Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp." A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data.

The significance of IE is determined by the growing amount of information available in unstructured (i.e. without metadata) form, for instance on the Internet. This knowledge can be made more accessible by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with.

A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted. Current approaches to IE use natural language processing techniques that focus on very restricted domains. For example, the "Message Understanding Conference" (MUC) is a competition-based conference that focused on the following domains in the past:
*MUC-1 (1987), MUC-2 (1989): Naval operations messages.
*MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
*MUC-5 (1993): Joint ventures and microelectronics domain.
*MUC-6 (1995): News articles on management changes.
*MUC-7 (1998): Satellite launch reports.

Natural Language texts may need to use some form of a Text simplification to create a more easily machine readable text to extract the sentences.

Typical subtasks of IE are:
* Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
* Coreference: identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference.
* Terminology extraction: finding the relevant terms for a given corpus
* Relation Extraction: identification of relations between entities, such as:
**PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
**PERSON located in LOCATION (extracted from the sentence "Bill is in France.")

ee also

* Concept mining
* HAREM, a Portuguese named entity recognition contest
* General Architecture for Text Engineering "General Architecture for Text Engineering", which is bundled with a free Information Extraction system
* ECHELON

External links

* [http://www.opencalais.com OpenCalais] Automated information extraction tool from Reuters
* [http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ MUC]
* [http://projects.ldc.upenn.edu/ace/ ACE] (LDC)
* [http://www.itl.nist.gov/iad/894.01/tests/ace/ ACE] (NIST)
* [http://lcl2.di.uniroma1.it TermExtractor]
* [http://labs.translated.net/terminology-extraction/ TermFinder] , online terminology extractor for EN, FR & IT - web application
* [http://www.cs.washington.edu/research/textrunner/ TextRunner] Part of the KnowItAll Project of the [http://turing.cs.washington.edu/ Turing Center] at the University of Washington
* [http://gate.ac.uk/ GATE]


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Information Extraction — Unter Informationsextraktion (engl. Information Extraction, IE) versteht man die ingenieursmäßige Anwendung von Verfahren aus der praktischen Informatik, der künstlichen Intelligenz und der Computerlinguistik auf das Problem der automatischen… …   Deutsch Wikipedia

  • Information Awareness Office — seal The Information Awareness Office (IAO) was established by the Defense Advanced Research Projects Agency (DARPA) in January 2002 to bring together several DARPA projects focused on applying surveillance and information technology to track and …   Wikipedia

  • Information retrieval — This article is about information retrieval in general. For the fictional government department, see Brazil (film). Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for… …   Wikipedia

  • Information filtering system — An Information filtering system is a system that removes redundant or unwanted information from an information stream using (semi)automated or computerized methods prior to presentation to a human user. Its main goal is the management of the… …   Wikipedia

  • Information-Retrieval — [ˌɪnfɚˈmeɪʃən ɹɪˈtɹiːvəl] (IR) bzw. Informationswiedergewinnung, gelegentlich Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem inhaltsorientiertem Suchen beschäftigt. Es ist ein Teilgebiet der Informationswissenschaft …   Deutsch Wikipedia

  • Information retrieval — [ˌɪnfɚˈmeɪʃən ɹɪˈtɹiːvəl] (IR) bzw. Informationswiedergewinnung, gelegentlich Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem inhaltsorientiertem Suchen beschäftigt. Es ist ein Teilgebiet der Informationswissenschaft …   Deutsch Wikipedia

  • Extraction De L'uranium — L industrie d extraction de l uranium est une industrie minière qui va de la prospection initiale jusqu au produit transportable (le yellowcake). Elle fait partie du cycle du combustible nucléaire (ensemble d opérations visant à fournir le… …   Wikipédia en Français

  • Information forensics — is the science of investigation into systemic processes that produce information. Systemic processes utilize primarily computing and communication technologies to capture, treat, store and transmit data. Manual processes complement technology… …   Wikipedia

  • Extraction de connaissances à partir de bases de données — Exploration de données L’exploration de données, aussi connue sous les noms fouille de données, data mining (forage de données) ou encore Extraction de Connaissances à partir de Données (ECD en français, KDD en Anglais), a pour objet l’extraction …   Wikipédia en Français

  • Information Awareness Office — Siegel des Information Awareness Office Das Information Awareness Office (IAO) war ein Projekt, das von der DARPA, einer Agentur des Verteidigungsministeriums der Vereinigten Staaten, gegründet wurde. Aufgabe des IAO war es, innerhalb einer… …   Deutsch Wikipedia