Docs
Crawler Node

Crawler Node Documentation

You can use Firecrawl with the Crawl node to systematically browse and index websites. Whether you're mapping website structures or extracting specific data, Crawl offers a seamless and customizable solution for discovering and organizing site information. It simplifies the extraction of web data, making it accessible and ready for AI applications.

logic.png

Features

Key Functionalities
  1. Comprehensive Crawling: It recursively traverses websites, identifying and accessing all subpages to ensure thorough data collection. It begins with a specified URL, analyzes the sitemap (if available), and follows links to uncover all accessible subpages.

  2. Dynamic Content Handling: It effectively manages dynamic content rendered with JavaScript, ensuring comprehensive data extraction from all accessible subpages.

  3. Modular Design: Create reusable workflow components

Benefits
  1. Reliability: It handles common web scraping challenges, including proxies, rate limits, and anti-scraping measures, ensuring consistent and dependable data extraction. Ease of Use -

  2. Efficiency: It intelligently manages requests to minimize bandwidth usage and avoid detection, optimizing the data extraction process.

Prerequisites

Before using Crawler Node, ensure the following:

  • A valid Firecrawl API Key (opens in a new tab).
  • Access to the Firecrawl service host URL.
  • Properly configured credentials for Firecrawl.
  • A webhook endpoint for receiving notifications (required for the crawler).

Installation

Step 1: Obtain API Credentials

  1. Register on Firecrawl (opens in a new tab).
  2. Generate an API key from your account dashboard.
  3. Note the Host URL and Webhook Endpoint.

Step 2: Configure Firecrawl Credentials

Use the following format to set up your credentials:

Key NameDescriptionExample Value
Credential NameName to identify this set of credentialsmy-firecrawl-creds
Firecrawl API KeyAuthentication key for accessing Firecrawl servicesfc_api_xxxxxxxxxxxxx
HostBase URL where Firecrawl service is hostedhttps://api.firecrawl.dev

Setup 3: Connect with Notification Webhook

  1. Create a Webhook flow to receive crawl updates and results

Configuration Reference

ParameterDescriptionExample Value
Credential NameSelect previously saved credentialsmy-firecrawl-creds
URLStarting point URL for the crawlerhttps://example.com
Notification WebhookEndpoint to receive crawl updates and resultshttps://your-webhook.com/callback
Exclude PathURL patterns to exclude from the crawl"admin/*", "private/*"
Include PathURL patterns to include in the crawl"blog/*", "products/*"
Crawl DepthMaximum depth to crawl relative to the entered URL3
Crawl LimitMaximum number of pages to crawl1000
Crawl Sub PagesToggle to enable or disable crawling sub pagestrue

Low-Code Example

nodes:
  - nodeId: crawlerNode_547
    nodeType: crawlerNode
    nodeName: Crawler
    values:
      credentials: ....
      url: https://lamatic.ai/docs
      webhook: ""
      crawlSubPages: false
      crawlLimit: 10
      crawlDepth: 1
      excludePath: []
      includePath: []
    needs:
      - triggerNode_1

Troubleshooting

Common Issues

ProblemSolution
Invalid API KeyEnsure the API key is correct and has not expired.
Connection IssuesVerify that the host URL is correct and reachable.
Webhook ErrorsCheck if the webhook endpoint is active and correctly configured.
Crawling ErrorsReview the inclusion/exclusion paths for accuracy.
Dynamic Content Not LoadedIncrease the Wait for Page Load time in the configuration.

Debugging

  • Check Firecrawl logs for detailed error information.
  • Test the webhook endpoint to confirm it is receiving updates.

Was this page useful?

Questions? We're here to help

Subscribe to updates