Docs
AWS S3

AWS S3 Node Documentation

The AWS S3 node in Lamatic automates fetching and synchronizing files from Amazon S3 buckets. It supports various file types, including text files, PDFs, Word Documents, and others stored in the bucket. This node enables regular file synchronization to support vectorization and indexing for Retrieval-Augmented Generation (RAG) flows.

Features

✅ Key Functionalities

  • Batch Trigger: Automates file fetching and synchronization on a schedule or in real-time using S3 event notifications.
  • File Type Support: Handles text files, PDFs, Word Documents, and other compatible formats.

✅ Benefits

  • Streamlines file collection from Amazon S3 buckets.
  • Prepares files for vectorization and indexing to enhance RAG flows.
  • Scales efficiently with growing data volumes.

Prerequisites

Before setting up the AWS S3 node, ensure you have the following:

  • An AWS account with appropriate bucket access permissions.
  • The target S3 bucket name and configuration.
  • An understanding of IAM policies and credentials.

Installation

Step 1: Set Up AWS Credentials

  1. Create IAM Policy:
    • Navigate to the IAM console (opens in a new tab).

    • Create a new policy with required S3 permissions.

    • Use the provided JSON policy template.

          {
              "Version": "2012-10-17",
              "Statement": [
                  {
                      "Effect": "Allow",
                      "Action": [
                          "s3:GetObject",
                          "s3:ListBucket",
                          "s3:ListAllMyBuckets"
                      ],
                      "Resource": "*"
                  }
              ]
          }
      ℹ️

      Note: If you want to given permission only to specific buckets then add them to resource key, refer the below example

      {
          "Version": "2012-10-17",
          "Statement": [
              {
              "Effect": "Allow",
              "Action": [
                  "s3:GetObject",
                  "s3:ListBucket"
              ],
              "Resource": [
                      "arn:aws:s3:::{your-bucket-name-1}/*",
                      "arn:aws:s3:::{your-bucket-name-1}",
                      "arn:aws:s3:::{your-bucket-name-2}/*",
                      "arn:aws:s3:::{your-bucket-name-2}"
                  ]
              }
          ]
      }
      💡

      Note: At this time, object-level permissions alone are not sufficient to successfully authenticate the connection. Please ensure you include the bucket-level permissions as provided in the example above.

  2. Configure IAM User:
    • Create or select an IAM user.

    • Attach the created policy.

    • Generate and securely store access credentials.

      ️⚠️

      Caution: Your Secret Access Key will only be visible once upon creation. Be sure to copy and store it securely.

      For more information on managing your access keys, please refer to the official AWS documentation (opens in a new tab).

Step 2: Configure Node in Lamatic

  1. Add the S3 node to your flow.
  2. Provide necessary credentials and bucket details.
  3. Configure sync settings and schedule.

Configuration Reference

s3

FieldDescriptionOptions/ExamplesRequirement/Default
CredentialsSpecifies the credentials required to access the S3 bucket. Ensure you have appropriate IAM credentials configured for the node.Pre-configured S3 credentials (e.g., "S3").Mandatory
BucketSpecifies the name of the Amazon S3 bucket that the node will interact with.Example: new-testing-12345.Mandatory
Parsing StrategyDefines how files in the S3 bucket will be parsed.Auto (automatic detection), other custom strategies.Default: Auto
Globs (Path Patterns)Allows specifying file path patterns to include or exclude files during processing.Example: *.csv (include all CSV files).Default: None (all files included).
Days to Sync If History Is FullDefines how many days' worth of historical data to sync when the history is full.Example: 3.Default: 3.
Start DateSpecifies the date from which the node should begin syncing files.Format: YYYY-MM-DD.Default: Empty (process all files).
Sync ModeDetermines how the synchronization is performed.Incremental (new/updated files), other modes.Default: Incremental.
Sync ScheduleDefines the frequency at which the synchronization process occurs.Every 24 hours, custom intervals (e.g., hourly).Default: Every 24 hours.
Credentials

These are the authentication details used to connect to the S3 bucket. You must configure IAM credentials with appropriate permissions (e.g., read access to the bucket).

Bucket

The S3 bucket acts as the data source for this trigger. Enter the name of the bucket where your files are stored.

Parsing Strategy

This option determines how the files will be interpreted.

  • auto lets the system decide the best way to parse files, while other strategies might require manual setup for specific file types.
  • fast extracts text directly from the document which doesn't work for all files.
  • ocr_only is more reliable, but slower.
Globs (Path Patterns)

Use glob patterns to filter which files should be processed.

  • **: match everything.
  • **/*.csv: match all files with a specific extension.
  • myFolder/**/*.csv: match all .csv files anywhere under myFolder.
  • */**: match everything at least one folder deep.
  • */*/*/**: match everything at least three folders deep.
  • **/file.*|**/file: match every file called "file" with any extension (or no extension).
  • x/*/y/*: match all files that sit in folder x -> any folder -> folder y.
  • **/prefix*.csv: match all .csv files with a specific prefix.
  • **/prefix*.parquet: match all .parquet files with a specific prefix.

This is helpful for excluding unnecessary files.

Days to Sync If History Is Full

If the system encounters a large backlog of files, this setting limits the synchronization to a defined number of recent days. It helps manage processing time and storage efficiently.

Start Date

Define a specific date from which the system should start processing files. This is particularly useful for incremental syncs where you only want to process data from a certain point in time.


Format: YYYY-MM-DDTHH:mm:ss.SSSSSSZ


Example: 2025-01-03T00:00:00.000000Z

Sync Mode

The S3 source connector supports the following:

  1. In Incremental mode, the system only processes files that have been added or updated since the last sync. This reduces redundancy and improves performance.

  2. Full Refresh sync mode is a data replication method that copies all data from a source to a destination

Sync Schedule

This setting allows you to specify how often the synchronization process should run, such as every hour or every 24 hours. Regular intervals ensure that your data stays updated. You can Schedule the Sync on Every 3, 6, 8, 12 and 24 hours


Low-Code Example

    triggerNode:
    nodeId: triggerNode_1
    nodeType: s3Node
    nodeName: S3
    values:
        credentials: 'AWS'
        bucket: 'TEST'
        strategy: auto
        globs:
        - '**'
        days_to_sync_if_history_is_full: '3'
        start_date: '2025-01-03T00:00:00.000000Z'
        syncMode: incremental_append
        cronExpression: 0 0 00 1/1 * ? * UTC
 

Troubleshooting

Common Issues

ProblemSolution
Invalid CredentialsVerify IAM user credentials and policy permissions.
Bucket Not FoundConfirm bucket name and region configuration.
Sync Not WorkingCheck sync schedule settings and IAM permissions.
File Types UnsupportedVerify file formats are among supported types.

Debugging

  • Review AWS CloudWatch logs for access issues.
  • Verify IAM policy permissions are correctly configured.
  • Test bucket accessibility using AWS CLI or console.
  • Check Lamatic Flow logs for detailed error information.

Was this page useful?

Questions? We're here to help

Subscribe to updates