Screen scraping


Screen scraping

Screen scraping is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Screen scraping often involves ignoring binary data (usually images or multimedia data) and formatting elements that obscure the essential, desired text data. Optical character recognition (OCR) software is a kind of visual scraper.

There are a number of synonyms for screen scraping, including: Data scraping, data extraction, web scraping, page scraping, web page wrapping and HTML scraping (the last four being specific to scraping web pages).

Description

Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable at all.

In contrast, output intended to be human-readable is often the antithesis of this, with display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or inimical to automated processing. However, when the only output available is such a human-oriented display, screen scraping becomes the only automated way of accomplishing a data transfer.

Originally, screen scraping referred to the practice of reading text data from a computer display terminal's screen. This was generally done by reading the terminal's memory through its auxiliary port, or by connecting the terminal output port of one computer system to an input port on another. By analogy, screen scraping has also come to mean computerized parsing of the HTML text in web pages. In all cases, the screen scraper has to be programmed to not only process the text data of interest, but also to recognize and discard unwanted data, images, and display formatting.

Screen scraping is most often done to either (1) interface to a legacy system which has no other mechanism which is compatible with current hardware, or (2) interface to a third-party system which does not provide a more convenient API. In the second case, the operator of the third-party system may even see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content.

Screen scraping is generally considered an ad-hoc, inelegant technique, often used only as a "last resort" when no other mechanism is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but computer programs will often crash or produce incorrect results.

Screen scraping generally requires intensive text parsing algorithms. Computer languages that have strong support for regular expressions and other text processing are thus a popular choice for writing screen scraping programs.

Web scraping

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. Even general-purpose search engines and other web crawlers use many techniques in the same vein as web scraping.

Scraping by design: towards the Semantic Web

The emergence of XML and web services has lent itself to the creation of technologies that improve the process of extracting machine-friendly data from web pages. Indeed, an explicit goal of the Semantic Web project is to enable the creation of documents which are easily read by both humans and machines. While this is seen as less efficient in terms of computer resources, it is asserted that computer technology has advanced to the point where such efficiency arguments are no longer a primary concern.

Extracting data from a web page or service explicitly designed to be machine-readable differs somewhat from the traditional meaning of screen scraping, which implies a preferred mechanism is not available. However, the techniques used in traditional web scraping are so similar that the same tools are often usable in both situations.

Screen scraping has thus recently taken a new dimension with tools such as Piggy Bank --a part of W3C and MIT's SIMILE joint project. The purpose of such technologies is to give the Internet community tools to increase the interoperability of disparate digital resources by adding a new semantic layer to online information. Some of these tools use user-designed scrapers, others analyze the data structure of Web pages and store structures and annotations as metadata, sometimes putting it back online as shared repositories, linking to the original sources.

Tools and software products like those listed below enable wrappers to be created for all kinds of web sites, meaning data can be harvested from web sites and converted to XML. More advanced tools like EasyWrap Mashup Studio automate the creation of web wrappers and even allow the creation of RESTful APIs for accessing web sites programmatically.----

Technical measures to stop scraping

With the prevalence of web scraping, many website owners have begun using anti-screen scraping techniques. See Web scraping

Examples

As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s -- the dawn of computerized data processing. Computer to user interfaces from that era were often simply text-based dumb terminals which were not much more than virtual teleprinters. (Such systems are still in use today, for various reasons.) The desire to interface such a system to more modern systems is common. An elegant solution will often require things no longer available, such as source code, system documentation, APIs, and/or programmers with experience in a 45 year old computer system. In such cases, the only feasible solution may be to write a screen scraper which "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system.

Modern web scrapers are much easier to find. For example, there are numerous programs and utilities which query commercial web sites (e.g., Google Product Search) to get product information and display it out of the context of the commercial service. Such usage is also an example of why some web-site operators see web scraping as undesirable. A popular method to protect a site from being web scraped is the use of CAPTCHA, which attempts to block automated access to a website.

Legality

In late summer 2008, Ryanair sued V-tours and BravoFly and also cancelled thousands of its own customers' bookings after they were made through screen scraping price comparison web sites of internet travel agents. [ [http://news.bbc.co.uk/1/hi/business/7549547.stm Internet travel agents face lawsuit for screen scraping] ] The company claimed such activities to be illegal and violate its terms and conditions and that the technology used slows down its web site for other users and that the passengers using them were being forced to pay more for fares and other services.

The move was criticised by the Association of British Travel Agents and the consumer group Which?.

Implementations

The Perl language, and modules from the Comprehensive Perl Archive Network, contain many features suitable for screen scraping, some purpose-built for it.

Microsoft has built into its implementation of web services the ability to create a web service which extracts its data from a web page with the help of an extension to the WSDL standard and the use of regular expressions.

The PHP programming language has developed in areas suited to creating web scraping applications. The release of PHP5 included many new XML and DOM additions, including functions to parse badly formed HTML documents into DOM-trees, and work on them as if they were well-formed XML.

Java offers support for web scraping techniques, by leveraging the W3C's XQuery specification.

Python and Ruby also have libraries for web scraping.

Scroogle is a screen scraping proxy that allows users to perform Google searches without receiving Google advertisements.

Many Greasemonkey or Opera user scripts work by interpreting and adapting website code.

The Outwit platform is a Web Collection Engine and development platform for Web automation. A library of recognition and extraction functions (OutWit Kernel) is available as a Firefox extension, to be used in specific collection applications.

In Unix-like environments, one can render formatted output with e.g., $ lynx -dump URL $ w3m -dump URL

oftware

* Web-scraping software comparison
* [http://www.tethyssolutions.com/screen-scrape.htm Automation Anywhere]
* [http://softbytelabs.com/us/products.html Black Widow]
* [http://www.newprosoft.com/web-content-extractor.htm Content Extractor]
* Dapper
* Data Extractor
* Data Ferret
* [http://www.irobotsoft.com/ IRobotSoft]
* [http://www.kapowtech.com/ Kapow RoboMaker]
* List Extractor Pro
* [http://www.gooseeker.com/en/ MetaSeeker]
* [http://www.mozenda.com/ Mozenda]
* [http://www.screen-scraper.com/ screen-scraper]
* [http://www.sitescraper.co.uk/ site scraper]
* Visual Web Spider
* WSO2 Mashup Server
* [http://www.websundew.com/ WebSundew]
* Web Scraper Lite
* [http://www.velocityscape.com/ Web Scraper Plus+]
* [http://www.webdataextractor.com Web Data Extractor]

References

Books

* Hemenway, Kevin and Calishain, Tara. "Spidering Hacks". Cambridge, Massachusetts: O'Reilly, 2003. ISBN 0-596-00577-6.

External links

* [http://www.tethyssolutions.com/web-data-extraction-article.htm The Whats and Whys of Web Data Extraction]
* [http://community.screen-scraper.com/overview_of_screen-scraping Article about what screen scraping is and how is is accomplished.]
* [http://www.oooff.com PHP & cURL Screen Scraping Tutorials]
* [http://www.scrapingpages.com PHP scraping] Web site about web scraping using PHP
* [http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails Data extraction for Web 2.0: Screen scraping in Ruby/Rails] - Article about web scraping using Ruby
* [http://www.perl.com/pub/a/2003/01/22/mechanize.html Screen-scraping with WWW::Mechanize] - Article about web scraping using Perl
* [http://simile.mit.edu/piggy-bank/screen-scrapers-howto.html How to write screen scrapers] - Article on writing Javascript-based screen scrapers
* [http://msdn.microsoft.com/en-us/library/xxb0bsdh(VS.71).aspx Creating XML Web Services That Parse the Contents of a Web Page] - Microsoft MSDN article
* [http://blog.screen-scraper.com/2006/03/21/three-common-methods-for-data-extraction/ Three common methods for data extraction] - Article from a blog about Screen Scraping
* [http://www.perl.com/pub/a/2006/06/01/fear-api.html FEAR-less Site Scraping] - An article about how to do screen scraping using [http://freshmeat.net/projects/fear-api/ FEAR::API]
* [http://www.iopus.com/imacros/tutorials/java.htm Web scraping with Java] - Article about web scraping using the Java programming language (requires commercial library)
* [http://www.vogel-nest.de/wiki/Main/WebScraping1 Web scraping with PHP and Tcl] - Articles about web scraping using PHP and Tcl
* [http://www.ttss.net/ TTSS. Rapid implimentation of Scanning systems. Since 1991] Innovators in Scanning Airlines and Tour Operators Systems
* [http://www.techreform.com/web-scraping/ Techreform - web scraping] - A commercial provider of web scraping services based in the United Kingdom
* [http://www.outwit.com/ OutWit Technologies] - Publishers of a Web Collection Engine for Firefox
* [http://simile.mit.edu/wiki/Piggy_Bank Piggy Bank] - A joint project by W3C and MIT
* [http://lethain.com/entry/2008/aug/10/an-introduction-to-compassionate-screenscraping/ An Introduction to Compassionate Screen Scraping in python]


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Screen scraping — es el nombre en inglés de una técnica de programación que consiste en tomar una presentación de una información (normalmente texto, aunque puede incluir información gráfica) para, mediante ingeniería inversa, extraer los datos que dieron lugar a… …   Wikipedia Español

  • Screen-Scraping — Der Begriff Screen Scraping (engl., etwa: „Bildschirm auskratzen“) umfasst generell alle Verfahren zum Auslesen von Texten aus Computerbildschirmen. Gegenwärtig wird der Ausdruck jedoch beinahe ausschließlich in Bezug auf Webseiten verwendet… …   Deutsch Wikipedia

  • Screen Scraping — Der Begriff Screen Scraping (engl., etwa: „Bildschirm auskratzen“) umfasst generell alle Verfahren zum Auslesen von Texten aus Computerbildschirmen. Gegenwärtig wird der Ausdruck jedoch beinahe ausschließlich in Bezug auf Webseiten verwendet… …   Deutsch Wikipedia

  • Screen scraping — Capture de données d écran La capture de données d écran[1] (screen scraping en anglais) est une technique selon laquelle un programme récupère les données envoyée à un dispositif de sortie (généralement un moniteur) par un autre programme. Notes …   Wikipédia en Français

  • screen-scrape — verb To extract data from (a source such as a webpage) by picking it out from among the human readable content. Screen scraping programs, by their nature, are very reliant on the layout of the information source that they are analyzing …   Wiktionary

  • screen scraper — /ˈskrin skreɪpə/ (say skreen skraypuh) noun Computers an application which automatically collects visual data from a source such as a VDU, usually because the data is stored in legacy software. Compare web scraper. –screen scraping, noun …   Australian English dictionary

  • Web Scraping — Der Begriff Screen Scraping (engl., etwa: „Bildschirm auskratzen“) umfasst generell alle Verfahren zum Auslesen von Texten aus Computerbildschirmen. Gegenwärtig wird der Ausdruck jedoch beinahe ausschließlich in Bezug auf Webseiten verwendet… …   Deutsch Wikipedia

  • Web scraping — (sometimes called harvesting) generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites… …   Wikipedia

  • Data scraping — is a technique in which a computer program extracts data from human readable output coming from another program. Contents 1 Description 2 Screen scraping 3 Web scraping 4 …   Wikipedia

  • Web-scraping software comparison — This article provides a basic feature comparison for several types of web scraping software. Additional feature details are available from the individual products websites and/or articles. This article is not all inclusive or necessarily up to… …   Wikipedia