Skip to content

微调数据准备

TIP

数据质量是微调效果的关键。好的数据集可以事半功倍。

数据格式

对话格式

json
{
    "messages": [
        {"role": "system", "content": "你是一个AI助手"},
        {"role": "user", "content": "什么是机器学习?"},
        {"role": "assistant", "content": "机器学习是..."}
    ]
}

数据处理

python
def prepare_data(raw_path, output_path):
    data = []
    with open(raw_path) as f:
        for line in f:
            item = json.loads(line)
            formatted = {
                "messages": [
                    {"role": "system", "content": "你是领域专家"},
                    {"role": "user", "content": item["question"]},
                    {"role": "assistant", "content": item["answer"]}
                ]
            }
            data.append(formatted)

    random.shuffle(data)
    split = int(len(data) * 0.9)

    with open(f"{output_path}/train.json", "w") as f:
        for d in data[:split]:
            f.write(json.dumps(d, ensure_ascii=False) + "\n")

    with open(f"{output_path}/eval.json", "w") as f:
        for d in data[split:]:
            f.write(json.dumps(d, ensure_ascii=False) + "\n")

数据质量检查

  • 空内容检查
  • 重复数据去重
  • 长度检查(不要超长)
  • 格式一致性检查