Skip to content

RAG 文档分割

TIP

合理的文档分割直接影响检索效果。分割太大或太小都会影响回答质量。

分割策略

字符分割

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
)

Token 分割(控制上下文窗口)

python
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=20)

分割建议

文档类型chunk_size
代码500-1000
文章300-500
PDF500
对话200-300

语义分割

python
from langchain.text_splitter import SemanticChunker

semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)
chunks = semantic_splitter.split_text(long_text)