Frontera (web crawling)


Frontera is an open-source, web crawling framework implementing crawl frontier component and providing scalability primitives for web crawler applications.

Overview

The content and structure of the World Wide Web changes rapidly. Frontera is designed to be able to adapt quickly to these changes. Most large scale web crawlers operate in batch mode with sequential phases of injection, fetching, parsing, deduplication, and scheduling. This leads to a delay in updating the crawl when the web changes. The design is mostly motivated by the relatively low random access performance of hard disks compared to sequential access. Frontera instead relies on modern key value storage systems, using efficient data structures and powerful hardware to crawling, parsing and schedule indexing of new links concurrently. It's an open-source project designed to fit various use cases, with high flexibility and configurability.
Large-scale web crawls are Frontera's only purpose. Its flexibility allows crawls of moderate size on a single machine with a few cores by leveraging single process and distributed spiders run modes.

Features

Frontera is written mainly in Python. Data transport and formats are well abstracted and out-of-box implementations include support of MessagePack, JSON, Kafka and ZeroMQ.
Although, Frontera isn't a web crawler itself, it requires a streaming crawling architecture rather than a batch crawling approach.
StormCrawler is another stream-oriented crawler built on top of Apache Storm whilst using some components from the Apache Nutch ecosystem. was designed by ISTResearch with precise monitoring and management of the queue in mind. These systems provide fetching and/or queueing mechanisms, but no link database or content processing.

Battle testing

At Scrapinghub Ltd. there is a crawler processing 1600 requests per second at peak, built using primarily Frontera using Kafka as a message bus and HBase as storage for link states and link database. Such crawler operates in cycles, each cycle takes 1.5 months and results in 1.7B of downloaded pages.
Crawl of Spanish internet resulted in 46.5M pages in 1.5 months on AWS cluster with 2 spider machines.

History

First version of Frontera operated in single process, as part of custom scheduler for Scrapy, using on-disk SQLite database to store link states and queue. It was able to crawl for days. After getting to some noticeable volume of links it started to spend more and more time on SELECT queries, making crawl inefficient. This time Frontera is developed under DARPA's Memex program and included in its catalog of open source projects.
In 2015 subsequent versions of Frontera used HBase for storing link database and queue. Application was distributed on two parts: backend and fetcher. Backend was responsible for communicating with HBase by means of Kafka and fetcher was only reading Kafka topic with URLs to crawl, and producing crawl results to another topic consumed by backend, thus creating a closed cycle. First priority queue prototype suitable for web scale crawling was implemented during that time. The queue was producing batches with limits on a number of hosts and requests per host.
Next significant milestone of Frontera development was the introduction of crawling strategy and strategy worker, along with abstraction of the message bus. It became possible to code the custom crawling strategy without dealing with low-level backend code operating with the queue. An easy way to say what links should be scheduled, when and with what priority made Frontera a truly crawl frontier framework. Kafka was quite a heavy requirement for small crawlers and message bus abstraction allowed to integrate almost any messaging system with Frontera.