Data scraping

Data scraping

Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.

Contents

Description

Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable at all.

Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped was intended for display to an end-user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.

Data scraping is most often done to either (1) interface to a legacy system which has no other mechanism which is compatible with current hardware, or (2) interface to a third-party system which does not provide a more convenient API. In the second case, the operator of the third-party system may even see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content.

Data scraping is generally considered an ad hoc, inelegant technique, often used only as a "last resort" when no other mechanism is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but computer programs will often crash or produce incorrect results.

Screen scraping

Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in web scraping. Originally, screen scraping referred to the practice of reading text data from a computer display terminal's screen. This was generally done by reading the terminal's memory through its auxiliary port, or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human.

As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s — the dawn of computerized data processing. Computer to user interfaces from that era were often simply text-based dumb terminals which were not much more than virtual teleprinters (such systems are still in use today, for various reasons). The desire to interface such a system to more modern systems is common. A robust solution will often require things no longer available, such as source code, system documentation, APIs, and/or programmers with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper which "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system.

In the 1980s, financial data providers such as Reuters, Telerate, and Quotron displayed data in 24x80 format intended for a human reader. Users of this data, particularly investment banks, wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper shredder.

More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an OCR engine, or in the case of GUI applications, querying the graphical controls by programmatically obtaining references to their underlying programming objects.

Web scraping

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site.

Report mining

Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human readable format, such as HTML, PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without needing to program an API to the source system.**

See also

References

Further reading

  • Hemenway, Kevin and Calishain, Tara. Spidering Hacks. Cambridge, Massachusetts: O'Reilly, 2003. ISBN 0-596-00577-6.

Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Data driven journalism — is a journalistic process based on analyzing and filtering large data sets for the purpose of creating a new story. Data driven journalism deals with open data that is freely available online and analyzed with open source tools.[1] Data driven… …   Wikipedia

  • Data warehouse — Overview In computing, a data warehouse (DW) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations… …   Wikipedia

  • Data extraction — is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by… …   Wikipedia

  • Data Toolbar — Developer(s) DataTool Services Operating system Microsoft Windows Type Browser toolbar, Web scraping Website ww …   Wikipedia

  • Data aggregator — A data aggregator is an organization such as Acxiom and ChoicePoint involved in compiling information from detailed databases on individuals and selling that information to others.[1] Contents 1 Description 2 Role of the Internet …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

  • Screen scraping — is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the… …   Wikipedia

  • Web scraping — (sometimes called harvesting) generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites… …   Wikipedia

  • Web-scraping software comparison — This article provides a basic feature comparison for several types of web scraping software. Additional feature details are available from the individual products websites and/or articles. This article is not all inclusive or necessarily up to… …   Wikipedia

  • Screen-Scraping — Der Begriff Screen Scraping (engl., etwa: „Bildschirm auskratzen“) umfasst generell alle Verfahren zum Auslesen von Texten aus Computerbildschirmen. Gegenwärtig wird der Ausdruck jedoch beinahe ausschließlich in Bezug auf Webseiten verwendet… …   Deutsch Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”