Docs
Extract from File Node

Extract from File Node Documentation

The Extract from File node is designed to parse and extract data from multiple file formats. It provides a flexible interface for handling different file types with customizable configuration options, ensuring accurate data extraction and formatting for downstream processing.

file.png

Features

Key Functionalities
  1. Multiple Format Support: Handles CSV, JSON, Text, HTML, PDF, DOCX, and XLSX files with format-specific parsing options.

  2. Configurable Parsing: Offers detailed configuration options for each file type to control how data is extracted and processed.

  3. Encoding Options: Supports multiple file encodings including UTF-8, ASCII, and UTF-16LE for text-based formats.

  4. Data Transformation: Provides options to clean and transform data during extraction (trimming, filtering, etc.).

  5. Base64 Encoding: Supports encoding the extracted data as base64.

Benefits
  1. Versatility: Single node solution for handling various file formats commonly used in data processing.

  2. Precision Control: Fine-grained control over data extraction through format-specific configuration options.

  3. Data Quality: Built-in options for data cleaning and validation during extraction.

  4. Seamless Integration: Easy integration with other nodes in the workflow for comprehensive data processing.

What can I build?

  1. Data processing pipelines that handle multiple file formats
  2. Automated document parsing systems
  3. Data extraction flow for business intelligence
  4. Content aggregation systems from various file sources
  5. Extract data from files as base64 and use it in AI nodes

Setup

Select the Extract from File Node

  1. Choose the appropriate operation for your file type
  2. Configure format-specific parameters
  3. Provide the file URL
  4. Deploy and test the extraction

Configuration Reference

Common Parameters

ParameterDescriptionRequiredDefault
nodeNameName of the node instanceYes"Extract from File"
operationType of file to extract fromYes"extractFromCSV"
fileUrl(s)URL or path to the file/Array of URLsYes""
formatfile formatYes"auto"
encodeAsBase64encode the extracted data as base64 of structure: data:content-type;base64,encodedString. Eg: data:text/plain;base64,SW52YWxpZCBwYXJhbWV0ZXJzCg==Nofalse

Format-Specific Parameters

CSV Configuration
ParameterDescriptionRequiredDefault
delimiterThe delimiter that will separate columns, usually a commaNo","
headersIf selected than will return data as list of objects with keys as column namesNotrue
quoteThe character to use to quote fields that contain a ',' delimiter. (e.g. "first,name",last name => ["first,name", "last name"])No"""
ignoreEmptyIf true this will discard columns that are all white space or delimiters.Nofalse
commentIf your CSV contains comments you can use this option to ignore lines that begin with the specified character (e.g. #)Nonull
discardUnmappedColumnsIf you want to discard columns that do not map to a header. This is only valid in the case when the number of parsed columns is greater than the number of headers.Notrue
trimTrim all white space from columns if trueNofalse
rtrimRight trim all columns if trueNofalse
ltrimLeft trim all columns if trueNofalse
maxRowsMaximum number of rows to parse. 0 means no limitNo0
skipRowsNumber of rows to skip at the beginningNo0
encodingSelect the encoding of the fileNo"utf8"
JSON Configuration
ParameterDescriptionRequiredDefault
encodingSelect the encoding of the fileNo"utf8"
Text Configuration
ParameterDescriptionRequiredDefault
encodingSelect the encoding of the fileNo"utf8"
HTML Configuration
ParameterDescriptionRequiredDefault
returnRawTextSet to true to return the raw data instead of parsingNofalse
PDF Configuration
ParameterDescriptionRequiredDefault
joinPagesCombine all pages into a single stringNofalse
password(s)Password/ Array of Passwords for the PDF file to try from, if the PDF is encryptedNo""
DOCX Configuration

No additional config required

Image Configuration

No additional config required

XLSX Configuration
ParameterDescriptionRequiredDefault
ignoreEmptyDiscard empty columns/rowsNofalse
headersIf selected than will return data as list of objects with keys as column namesNotrue

Sample Input

   File URL(s): ["https://calibre-ebook.com/downloads/demos/demo.docx","https://example-files.online-convert.com/document/txt/example.txt","https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf"]
   Format: Auto Detect
   Everything Else Default

Low-Code Example

nodes:
  - nodeId: extractFromFileNode_983
    nodeType: extractFromFileNode
    nodeName: Extract from File
    values:
      trim: false
      ltrim: false
      quote: '"'
      rtrim: false
      comment: "null"
      fileUrl: >-
        ["https://calibre-ebook.com/downloads/demos/demo.docx","https://example-files.online-convert.com/document/txt/example.txt","https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf"]
      headers: true
      maxRows: "0"
      encoding: utf8
      maxPages: "0"
      password: ""
      skipRows: "0"
      delimiter: ","
      joinPages: true
      format: auto
      skipLines: "0"
      ignoreEmpty: false
      returnRawText: false
      discardUnmappedColumns: false
    needs:
      - triggerNode_1

Output

files

  • An array of objects, each representing a file and its extracted content along with associated metadata.

metadata

  • A nested object containing descriptive attributes of the file.

    • mime_type: Specifies the fileā€™s MIME type, indicating its format and encoding.
    • type: Categorizes the fileā€™s general type.
    • filename: The name of the file.
    • extension: The fileā€™s extension.
    • url: The source URL of the file.
    • size: The file size in bytes, or null if not available.
    • file_id: A unique identifier for the file within the processing context.

data

  • An array of strings containing the extracted content from the file.

additional_fields

  • A nested object containing supplementary data or metadata about file processing.

raw

  • An array of objects providing unprocessed or detailed extraction data.

    • metadata: A nested object within raw detailing file format and creation metadata.
      • format: The fileā€™s format version.
      • title: The fileā€™s title.
      • author: The fileā€™s author.
      • subject: A description of the fileā€™s subject.
      • keywords: Keywords associated with the file.
      • creator: The software or tool used to create the file.
      • producer: The software or library that produced the file.
      • creationDate: The fileā€™s creation date.
      • modDate: The fileā€™s last modification date.
      • trapped: Related to PDF trapping status.
      • encryption: Indicates the fileā€™s encryption status, or null if not encrypted.

file_path

  • A string specifying the temporary or local file path used during processing.

page_count

  • An integer indicating the total number of pages in the file.

page

  • An integer specifying the page number of the extracted content.

toc_items

  • An array containing the table of contents items.

tables

  • An array of objects describing tables extracted from the file.

    • bbox: An array of coordinates defining the bounding box of a table.
    • rows: The number of rows in a table.
    • columns: The number of columns in a table.

images

  • An array of image data extracted from the file.

graphics

  • An array of graphic elements extracted from the file.

text

  • A string containing the raw text content extracted from a specific page or section.

words

  • An array of individual words extracted from the text.

Example Output

"files": [
      {
        "metadata": {
          "mime_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
          "type": "document",
          "filename": "demo.docx",
          "extension": "docx",
          "url": "https://calibre-ebook.com/downloads/demos/demo.docx",
          "size": 1311881,
          "file_id": 0
        },
        "data": [
          "data"
        ]
      },
      {
        "metadata": {
          "file_id": 1,
          "type": "document",
          "url": "https://example-files.online-convert.com/document/txt/example.txt",
          "filename": "example.txt",
          "extension": "txt",
          "mime_type": "text/plain; charset=UTF-8",
          "size": null
        },
        "data": [
          "data"
        ]
      }
    ]

Troubleshooting

Common Issues

ProblemSolution
File Not FoundVerify the file URL is accessible and correct
Parsing ErrorsCheck file format matches selected operation
Encoding IssuesTry different encoding options for text-based files

Debugging

  1. Check file accessibility
  2. Verify file format matches operation
  3. Review format-specific configuration
  4. Check node logs for detailed error messages

Was this page useful?

Questions? We're here to help

Subscribe to updates