微信扫码
添加专属顾问
我要投稿
Meta最新开源工具包,一键生成高质量LLM微调数据集。 核心内容: 1. 针对LLM微调的数据获取难题,Meta提供开源解决方案 2. 从原始数据到微调黄金的工作流程:导入、创建、筛选、保存 3. 支持多种文件格式和微调任务,提高数据质量和微调效率
#SDK的命令树 SDK --> SystemCheck[system-check] SDK[synthetic-data-kit] --> Ingest[ingest] SDK --> Create[create] SDK --> Curate[curate] SDK --> SaveAs[save-as] Ingest --> PDFFile[PDF File] Ingest --> HTMLFile[HTML File] Ingest --> YouTubeURL[File Format] Create --> CoT[CoT] Create --> QA[QA Pairs] Create --> Summary[Summary] Curate --> Filter[Filter by Quality] SaveAs --> JSONL[JSONL Format] SaveAs --> Alpaca[Alpaca Format] SaveAs --> FT[Fine-Tuning Format] SaveAs --> ChatML[ChatML Format]
# 从PyPI安装conda create -n synthetic-data python=3.10conda activate synthetic-datapip install synthetic-data-kit
#或者,克隆仓库以获取最新功能:bashgit clone https://github.com/meta-llama/synthetic-data-kit.gitcd synthetic-data-kitpip install -e .
bashvllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000# 创建必要的目录结构:mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}#检查系统是否已准备就绪:synthetic-data-kit system-checkbash# 导入PDFsynthetic-data-kit ingest research_paper.pdf# 生成30个问答对,设置质量阈值synthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0# 筛选质量synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5# 以OpenAI微调格式保存synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft
# Example configurationvllm: api_base: "http://localhost:8000/v1" model: "meta-llama/Llama-3.3-70B-Instruct"generation: temperature: 0.7 chunk_size: 4000 num_pairs: 25curate: threshold: 7.0 batch_size: 8
prompts:qa_generation: |You are creating question-answer pairs for fine-tuning a legal assistant.Focus on technical legal concepts, precedents, and statutory interpretation.Below is a chunk of text about: {summary}...Create {num_pairs} high-quality question-answer pairs based ONLY on this text.Return ONLY valid JSON formatted as:[{"question": "Detailed legal question?","answer": "Precise legal answer."},...]Text:---{text}
# Bash script to process multiple filesfor file in data/pdf/*.pdf; dofilename=$(basename "$file" .pdf)synthetic-data-kit ingest "$file"synthetic-data-kit create "data/output/${filename}.txt" -n 20synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5synthetic-data-kit save-as "data/cleaned/${filename}_cleaned.json" -f chatmldone
synthetic-data-kit curate data/generated/report_qa_pairs.json -t 7.0
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2025-12-14
我微调了一个LangChain专家模型,离Vibe Agent又近了一步
2025-12-11
左脚踩右脚:大模型的有趣且简单的微调方式“SHADOW-FT”
2025-12-11
大模型训练的高效内存解决方案:流水线感知的细粒度激活卸载,实现显存开销与吞吐性能的联合最优
2025-12-08
一杯咖啡成本搞定多模态微调:FC DevPod + Llama-Factory 极速实战
2025-12-04
OpenAI公开新的模型训练方法:或许能解决模型撒谎问题,已在GPT-5 thiking验证
2025-11-23
微调Rerank模型完整指南
2025-11-22
大模型微调全流程实战指南:基于IPO框架的深度解析与优化
2025-11-21
AI基础 | Qwen3 0.6B 微调实现轻量级意图识别
2025-10-12
2025-10-14
2025-10-21
2025-09-24
2025-09-20
2025-09-25
2025-11-05
2025-11-05
2025-11-21
2025-12-04