Data extraction


Data extraction

Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.[1]

Usually, the term data extraction is applied when (experimental) data is first imported into a computer from primary sources, like measuring or recording devices. Today's electronic devices will usually present a electrical connector (e.g. USB) through which 'raw data' can be streamed into a personal computer.

Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping.

The act of adding structure to unstructured data takes a number of forms

  • Using text pattern matching such as regular expressions to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;
  • Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc using a standard set of commonly used headings (these would differ from language to language), eg Education might be found under Education/Qualification/Courses;
  • Using text analytics to attempt to understand the text and link it to other information

Notes

  1. ^ Definition of data extraction.

External links

  • Data Extraction as a part of the ETL process in a Data Warehousing environment



Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Extraction — may refer to: *Extraction (dental), surgical removal of a tooth from the mouth *Extraction (fragrance), obtaining fragrant oils and compounds from odorous raw materials *Extraction (military), removal of someone from a hostile area to a secure… …   Wikipedia

  • Data migration — is the process of transferring data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks. It is required when… …   Wikipedia

  • Data Moving Tool — (DMT) is a Windows based ETL that Extracts, Transforms and Loads data to and from any data source. DMT was created by S.E.R. Software Solutions, a small California software company. DMT is designed to handle small to medium data volumes. Contents …   Wikipedia

  • Data Toolbar — Developer(s) DataTool Services Operating system Microsoft Windows Type Browser toolbar, Web scraping Website ww …   Wikipedia

  • Data integrity — in its broadest meaning refers to the trustworthiness of system resources over their entire life cycle. In more analytic terms, it is the representational faithfulness of information to the true state of the object that the information represents …   Wikipedia

  • Data Intensive Computing — is a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size and typically referred to as Big Data. Computing applications which devote most of their …   Wikipedia

  • Data Vault Modeling — is a database modeling method that is designed to provide historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that, apart from the modeling aspect, deals with issues such as… …   Wikipedia

  • Data-centric programming language — defines a category of programming languages where the primary function is the management and manipulation of data. A data centric programming language includes built in processing primitives for accessing data stored in sets, tables, lists, and… …   Wikipedia

  • Data transformation — Data transformation/Source transformation Concepts metadata · data mapping data transformation · model transf …   Wikipedia

  • Data Web — refers to a government open source project that was started in 1995 to develop open source framework that networks distributed statistical databases together into a seamless unified virtual data warehouse. Originally funded by the U.S. Census… …   Wikipedia