Efficient watcher based web crawler design
Abstract
Purpose
The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages.
Design/methodology/approach
In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process.
Findings
Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems.
Originality/value
The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.
Keywords
Acknowledgements
The authors would like to thank Assistant Professor Dr Yıltan Bitirim (Eastern Mediterranean University) for his valuable suggestions and comments that greatly improved the manuscript.
Citation
ALQARALEH, S., RAMADAN, O. and SALAMAH, M. (2015), "Efficient watcher based web crawler design", Aslib Journal of Information Management, Vol. 67 No. 6, pp. 663-686. https://doi.org/10.1108/AJIM-02-2015-0019
Publisher
:Emerald Group Publishing Limited
Copyright © 2015, Emerald Group Publishing Limited