Skip to content

Latest commit

 

History

History
53 lines (37 loc) · 2.87 KB

File metadata and controls

53 lines (37 loc) · 2.87 KB
id cb-intro-cb
title Introduction into Crawlbot
sidebar_label Introduction

Crawlbot works hand-in-hand with a Diffbot API (either automatic or custom). It quickly spiders a site for appropriate links and hands these links to a Diffbot API for processing. All structured page results are then compiled into a single "collection," which can be downloaded in full or searched using the Search API.

Crawlbot is limited to Extraction API Plus plans and above, and is accessible in the Developer Dashboard here. Note that the limit of active crawls on a single token is 1000. More information here.

Robots.txt

By default Crawlbot adheres to a site’s robots.txt instructions, including the disallow and crawl-delay directives.

In specific cases — typically because of a partnership or agreement you have with the site to be crawled — the robots.txt instruction can be ignored/overridden. This is often faster than waiting for the third-party site to update its robots.txt file.

To whitelist Crawlbot for a site, specify the “Diffbot” user-agent in the site’s robots.txt:

User-agent: Diffbot 
Disallow: 

Note that Crawlbot does not adhere to the Allow directive.

Data Retention

Depending on your Diffbot Plan, inactive crawls will be removed from your account either 14 or 30 days after completion.

This includes the extracted data as well as the job meta information (name, settings, etc.).

“Active” crawls are those that are recurring/repeating and that are not in a permanently “paused” state. Currently active jobs will not be deleted or removed from your account. After a recurring crawl completes its final round it will be subject to regular deletion policies.

Crawlbot basics

Crawlbot debugging