微调数据准备
TIP
数据质量是微调效果的关键。好的数据集可以事半功倍。
数据格式
对话格式
json
{
"messages": [
{"role": "system", "content": "你是一个AI助手"},
{"role": "user", "content": "什么是机器学习?"},
{"role": "assistant", "content": "机器学习是..."}
]
}数据处理
python
def prepare_data(raw_path, output_path):
data = []
with open(raw_path) as f:
for line in f:
item = json.loads(line)
formatted = {
"messages": [
{"role": "system", "content": "你是领域专家"},
{"role": "user", "content": item["question"]},
{"role": "assistant", "content": item["answer"]}
]
}
data.append(formatted)
random.shuffle(data)
split = int(len(data) * 0.9)
with open(f"{output_path}/train.json", "w") as f:
for d in data[:split]:
f.write(json.dumps(d, ensure_ascii=False) + "\n")
with open(f"{output_path}/eval.json", "w") as f:
for d in data[split:]:
f.write(json.dumps(d, ensure_ascii=False) + "\n")数据质量检查
- 空内容检查
- 重复数据去重
- 长度检查(不要超长)
- 格式一致性检查