Tomáš Repčík - 19. 10. 2025

Chunking Text for Vector DB

From the simplest to the advanced techniques

cutting paper

The vector database is a powerful tool for storing and retrieving unstructured data, such as text, images, and audio. However, to effectively use a vector database, it is essential to chunk the data into smaller pieces.

This can get quite complex depending on the use case and the type of data being stored.

The most important aspect is actually your data and its structure.

If you have garbage data, no amount of chunking will help you.

This article focuses on text data chunking. Sanitation of dataset will not be covered here.

Why Chunk Text?

Chunking text is a technique to improve the performance and accuracy of vector databases.

It makes text more concise and easier to process by breaking it into smaller, manageable pieces.

The databases have improved lately to handle large texts, but chunking is still required.

The chunking improves representation of the text in vector space, making it easier to find relevant information.

Chunking is double-edged sword. If done properly, it can improve the results significantly. If done poorly, it can lead to loss of context and important information. Only based on your data and use case, you can decide which chunking technique is the best fit.

Base Parameters

Most of the chunking techniques will have these parameters in common:

Chunk Size: The size of each chunk/text in tokens or characters.
Overlap Size: The amount of overlap between consecutive chunks. This makes sure the context is preserved from chunk to chunk.
Delimiter: A character or sequence of characters used to separate chunks. This can be a space, a newline, or any other character that makes sense for the data being stored.

RAG (Retrieval-Augmented Generation) systems are bound by LLM context window size. That is why, if you chunk the text, it is useful to know the target LLM model context window size to optimize the chunk size. You can avoid unnecessary splitting of the text into too small chunks and losing context.

Length-based Chunking / Naive Chunking

Length-based chunking is one of the simplest methods to divide text into smaller pieces. The idea is to split the text into chunks of a fixed length.

Or after a certain number of words, tokens, sentences, or paragraphs, the algorithm can find the nearest delimiter and split the text there.

This is the most straightforward approach, but as you might guess, it does not have to work well.

If you have a long paragraph, it might get split in the middle, which can lead to loss of context.

Text-structure-based Chunking

Text-structure-based is the best fit for normal documents like articles, blogs, books, etc.

It takes in normal human-structured text and splits it based on the natural structure of the text.

People split texts into paragraphs, sections, and chapters naturally, so that we can take this to advantage.

Firstly, we split text into paragraphs, afterwards into sections based on indentation or headings, then into sentences and then into words if needed.

All of this is done recursively until the chunk size is met.

That is why it is referenced as recursive chunking as well.

NLTK or SpaCy libraries can help you with sentence and paragraph splitting. which makes this technique easier to implement.

Document-based Chunking

Unfortunately, the text is not always plain-text document.

On the internet, we usually use HTML documents, markdown files and other formats. You can also be dealing with code files, XML, JSON, etc.

These are also structured documents, but every type has its own specifics and rules. Very similar to text-structure-based chunking, we can parse the document and use its structure to chunk it.

The most prevalent descriptive format will be HTML or markdown.

For Markdown files can be split by headings (#, ##, ###), lists (-, *, 1.), code blocks (```), and paragraphs.

For HTML documents, we can use tags like <h1>, <h2>, <p>, <div>, <section>, and others to identify logical sections of the document.

For programming languages, you can use classes, functions, methods, and blocks to chunk the code files.

📧 Get more content like this in your inbox

Semantic Chunking

From the Cambridge dictionary:

Semantic - connected with the meanings of words.

This means, semantic chunking does not care about the length of the text or its structure.

It takes into account the meaning of the text and tries to chunk it based on that.

We can actually use vector metrics to find similar parts.

With sliding window, we can calculate embeddings for each window and compare them with the previous one.
If the similarity is below certain threshold, we can create a new chunk or add to the existing one.

Unfortunately, this technique is quite performance-heavy and requires a lot of computation power or time.

If you have large documents, it might not be feasible to use this technique.

Agentic Chunking

This is the most advanced and sophisticated chunking technique.

Instead of using fixed rules or similarity metrics, agentic chunking treats the task as a reasoning problem.

An LLM reads the document sequentially and decides whether new content belongs in the current chunk or requires a new one - just like a person would do it.

The process is as follows:

Extract Propositions: The document is converted into standalone propositions - atomic statements that can be understood independently.
Agentic Reasoning: The system processes each proposition sequentially and evaluates whether it fits into the current chunk or if a new chunk should be started.

This technique is powerful as it can make text more compact while preserving context and meaning. However, it is also the most complex and resource-intensive method. Mainly due to the LLM usage.

Conclusion

Chunking text is essential for effectively using vector databases.

There are various techniques available, each with its own advantages and disadvantages.

Length-Based Chunking: This method splits the text into chunks of a fixed length, such as a certain number of tokens or characters. It is simple to implement and works well for many use cases. However, it may not respect the natural structure of the text and can cut off important context.
Text-Structure-Based Chunking: This method relies on the natural structure of the text, such as paragraphs, headings, and lists. It is easy to implement and works well for plain text documents. However, it may not capture the semantic meaning of the text effectively.
Document-Based Chunking: This approach takes into account the specific format of the document, such as HTML or Markdown. It can leverage the document’s structure to create meaningful chunks. The downside is that it requires more effort to implement and may not generalize well to different formats.
Semantic Chunking: This technique focuses on the meaning of the text rather than its structure. It uses vector embeddings to identify similar content and create chunks based on semantic similarity. While powerful, it is computationally intensive and may not be feasible for large documents.
Agentic Chunking: This is the most advanced method, using LLMs to reason about the content and determine the best way to chunk it. It produces high-quality chunks but is also the most complex and resource-intensive.

Resources

Most of the techniques are implemented and are coming from: LangChain

Socials

Thanks for reading this article!

For more content like this, follow me here or on X or LinkedIn.

Subscribe for more