ARCOMEM crawling architecture

  • The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.

Download full text files

Export metadata

Metadaten
Author:Vassilis Plachouras, Florent Carpentier, Muhammad Faheem, Julien Masanès, Thomas RisseORCiDGND, Pierre Senellart, Patrick Siehndel, Yannis Stavrakas
URN:urn:nbn:de:hebis:30:3-536511
DOI:https://doi.org/10.3390/fi6030518
ISSN:1999-5903
Parent Title (English):Future Internet
Publisher:MDPI
Place of publication:Basel
Document Type:Article
Language:English
Year of Completion:2014
Date of first Publication:2014/08/19
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Release Date:2020/05/25
Tag:content acquisition; crawling architecture; web archiving
Volume:6
Issue:3
Page Number:24
First Page:518
Last Page:541
Note:
This is an open access article distributed under the Creative Commons Attribution License
HeBIS-PPN:465074693
Institutes:Zentrale Einrichtung / Universitätsbibliothek
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Sammlungen:Universitätspublikationen
Licence (German):License LogoCreative Commons - Namensnennung-Keine kommerzielle Nutzung-Weitergabe unter gleichen Bedingungen