RAG 文档分割
TIP
合理的文档分割直接影响检索效果。分割太大或太小都会影响回答质量。
分割策略
字符分割
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50,
separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
)Token 分割(控制上下文窗口)
python
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=20)分割建议
| 文档类型 | chunk_size |
|---|---|
| 代码 | 500-1000 |
| 文章 | 300-500 |
| 500 | |
| 对话 | 200-300 |
语义分割
python
from langchain.text_splitter import SemanticChunker
semantic_splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
chunks = semantic_splitter.split_text(long_text)