Docs
Scraper Node

Scraper Node Documentation

Scraper Node is a web scraping tool designed to transform entire websites into structured, LLM-compatible data using Firecrawl. It Extract targeted content from specific web pages using customizable rules.

scraper.png

Features

Key Functionalities
  1. Customizable URL Scraping: Allows users to specify the exact URL to scrape, ensuring targeted data extraction.

  2. Selective Content Scraping: Features options to scrape only the main content or include/exclude specific tags, providing control over the scraped data.

  3. TLS Verification: Includes an option to skip TLS verification, enabling compatibility with a wider range of websites.

  4. Device Emulation: Offers the ability to emulate mobile devices, ensuring accurate scraping of mobile-optimized pages.

  5. Adjustable Load Timing: Allows users to set a custom wait time (in milliseconds) for pages to load, accommodating different website performance speeds.

Benefits
  1. Precision: Enables targeted scraping by including or excluding specific tags and controlling the scraping scope.

  2. Flexibility: Supports mobile device emulation and TLS verification bypass for greater adaptability across websites.

  3. Efficiency: The adjustable wait time ensures optimal scraping speed without missing dynamic content.

  4. Ease of Use: A user-friendly interface with clear options makes it accessible for both technical and non-technical users.

Prerequisites

Before using Crawler Node, ensure the following:

  1. A valid Firecrawl API Key (opens in a new tab).
  2. Access to the Firecrawl service host URL.
  3. Properly configured credentials for Firecrawl.
  4. A webhook endpoint for receiving notifications (required for the crawler).

Installation

Step 1: Obtain API Credentials

  1. Register on Firecrawl (opens in a new tab).
  2. Generate an API key from your account dashboard.
  3. Note the Host URL and Webhook Endpoint.

Step 2: Configure Firecrawl Credentials

Use the following format to set up your credentials:

Key NameDescriptionExample Value
Credential NameName to identify this set of credentialsmy-firecrawl-creds
Firecrawl API KeyAuthentication key for accessing Firecrawl servicesfc_api_xxxxxxxxxxxxx
HostBase URL where Firecrawl service is hostedhttps://api.firecrawl.dev

Configuration Reference

ParameterDescriptionExample Value
Credential NameSelect previously saved credentialsmy-firecrawl-creds
URLTarget URL to scrapehttps://example.com/page
Main ContentExtract only the main content of the page, excluding headers, navs, footers, etc.true
Skip TLS VerificationBypass SSL certificate validationfalse
Include TagsHTML tags to include in extractionp, h1, h2, article
Exclude TagsHTML tags to exclude from extractionnav, footer, aside
Emulate Mobile DeviceSimulate mobile browser accesstrue
Wait for Page LoadTime to wait for dynamic content (ms)123

Low-Code Example

nodes:
  - nodeId: scraperNode_680
    nodeType: scraperNode
    nodeName: Scraper
    values:
      credentials: ''
      url: https://lamatic.ai/docs
      onlyMainContent: false
      skipTLsVerification: false
      mobile: false
      waitFor: 123
      includeTags: []
      excludeTags: []
    needs:
      - triggerNode_1

Troubleshooting

Common Issues

ProblemSolution
Invalid API KeyEnsure the API key is correct and has not expired.
Connection IssuesVerify that the host URL is correct and reachable.
Webhook ErrorsCheck if the webhook endpoint is active and correctly configured.
Crawling ErrorsReview the inclusion/exclusion paths for accuracy.
Dynamic Content Not LoadedIncrease the Wait for Page Load time in the configuration.

Debugging

  • Check Firecrawl logs for detailed error information.
  • Test the webhook endpoint to confirm it is receiving updates.

Was this page useful?

Questions? We're here to help

Subscribe to updates