Chunking

Divide and organize large data files into optimally sized portions or chunks for efficient retrievals and contextualization.

Input Parameters

Input Chunk Field: Specifies the text data to be divided into smaller parts
Chunking Type: Defines the method used to segment the text, such as fixed-size or semantic-based chunking. The two main chunking types are:
- Recursive Character Text Splitter: This method splits the text into fixed-size chunks, with the option to specify the maximum number of characters per chunk and the overlap between consecutive chunks.
- Text Splitter: This method splits the text using user-defined separators, such as punctuation marks or newline characters.
Number of Characters: Determines the maximum length of each chunk created during the process (used in Recursive Character Text Splitter)
Overlapping Characters: Indicates the number of characters that overlap between consecutive chunks to maintain context (used in Recursive Character Text Splitter)
List of Separators: Identifies the characters or strings used to delineate the boundaries between chunks (used in Text Splitter)

Expected Output

Accurately chunked data ready to be embedded into a vector database. To get the data of each Chunk, you will need to extract the pageContent object from the list. To do that, you can use the following code:

let docs = {{chunk_node_id.output.chunks}}
 
let outputDocs = docs.map(doc => doc.pageContent);
 
output = outputDocs;

Example Use Case

In this example workflow of vectorizing text data, text is fed into the chunking node which perfoms fixed-size chunking at 500 characters with the amount of character overlap between chunks at 5 and the requisites for separating chunks being common sentence-ending punctuation.

Example Workflow

💡

Learn More about Chunking on Weaviate (opens in a new tab)

Data Index

Was this page useful?

Questions? We're here to help

GitHub Q&A Email Talk to sales

Chunking

Input Parameters

Expected Output

Was this page useful?

Questions? We're here to help

Subscribe to updates