ICDL crawling

ICDL crawling

ICDL crawling is an open distributed web crawling technology based on Website Parse Template (WPT).

What is Website Parse Template?

Website Parse Template (WPT) is an XML based open format which provides HTML structure description of website pages. WPT format allows web crawlers to generate Semantic Web’s RDF triplets for web pages. WPT is compatible with existing Semantic Web concepts defined by W3C (RDF and OWL) and UNL specifications.

Distributed ICDL crawling

ICDL crawling involves parsing of websites’ content considering HTML structure templates represented in WPT files.

Distributed crawling is carried out by open source client/server application installed on volunteers’ personal computers. After authentication procedures, application registers each PC as a Distributed Crawling node. Crawler periodically receives tasks from management console to download specified websites, parse their content and submit the results into Parsed Content Storage. Crawling processes are activated when user’s computer is in idle and Internet connection is not in use.

Internet content parse results from several Crawlers are compared by management console to increase crawling results' accuracy grade. Crawling results can be stored to be used by thematic and general search engines with different search algorithms, such as Google, Live, Yahoo!, Froogle, etc. to perform more accurate web search.

ee also

* Website Parse Template
* Distributed web crawling
* Web search engine
* Web crawler
* OMFICA

External links

* [http://www.w3c.org World Wide Web Consortium]
* [http://www.google.com/about.html Google]
* [http://www.msn.com Live Search]
* [http://www.yahoo.com Yahoo!]
* [http://www.omfica.org OMFICA]
* [http://www.google.com/products Google Product Search]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Website Parse Template — Infobox file format name = Website Parse Template icon = extension = .icdl mime = type code = uniform type = magic = owner = [http://www.omfica.org/ OMFICA] genre = Website Parse Template container for = ICDL Crawling extended from = XML extended …   Wikipedia

  • Open Market For Internet Content Accessibility — OMFICA LTD Type Company Limited by Guarantee Industry Internet, Search Technologies Founded London, UK (February 4, 2008) …   Wikipedia

  • Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… …   Wikipedia

  • Developmental robotics — (DevRob), sometimes called epigenetic robotics, is a methodology that uses metaphors from neural development and developmental psychology to develop the mind for autonomous robots. The focus is on a single or multiple robots going through stages… …   Wikipedia

  • List of free and open source software packages — This article is about software free to be modified and distributed. For examples of software free in the monetary sense, see List of freeware. This is a list of free and open source software packages: computer software licensed under free… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”