微信扫码
添加专属顾问
我要投稿
Meta最新开源工具包,一键生成高质量LLM微调数据集。 核心内容: 1. 针对LLM微调的数据获取难题,Meta提供开源解决方案 2. 从原始数据到微调黄金的工作流程:导入、创建、筛选、保存 3. 支持多种文件格式和微调任务,提高数据质量和微调效率
#SDK的命令树 SDK --> SystemCheck[system-check] SDK[synthetic-data-kit] --> Ingest[ingest] SDK --> Create[create] SDK --> Curate[curate] SDK --> SaveAs[save-as] Ingest --> PDFFile[PDF File] Ingest --> HTMLFile[HTML File] Ingest --> YouTubeURL[File Format] Create --> CoT[CoT] Create --> QA[QA Pairs] Create --> Summary[Summary] Curate --> Filter[Filter by Quality] SaveAs --> JSONL[JSONL Format] SaveAs --> Alpaca[Alpaca Format] SaveAs --> FT[Fine-Tuning Format] SaveAs --> ChatML[ChatML Format]
# 从PyPI安装conda create -n synthetic-data python=3.10conda activate synthetic-datapip install synthetic-data-kit
#或者,克隆仓库以获取最新功能:bashgit clone https://github.com/meta-llama/synthetic-data-kit.gitcd synthetic-data-kitpip install -e .
bashvllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000# 创建必要的目录结构:mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}#检查系统是否已准备就绪:synthetic-data-kit system-checkbash# 导入PDFsynthetic-data-kit ingest research_paper.pdf# 生成30个问答对,设置质量阈值synthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0# 筛选质量synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5# 以OpenAI微调格式保存synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft
# Example configurationvllm: api_base: "http://localhost:8000/v1" model: "meta-llama/Llama-3.3-70B-Instruct"generation: temperature: 0.7 chunk_size: 4000 num_pairs: 25curate: threshold: 7.0 batch_size: 8
prompts:qa_generation: |You are creating question-answer pairs for fine-tuning a legal assistant.Focus on technical legal concepts, precedents, and statutory interpretation.Below is a chunk of text about: {summary}...Create {num_pairs} high-quality question-answer pairs based ONLY on this text.Return ONLY valid JSON formatted as:[{"question": "Detailed legal question?","answer": "Precise legal answer."},...]Text:---{text}
# Bash script to process multiple filesfor file in data/pdf/*.pdf; dofilename=$(basename "$file" .pdf)synthetic-data-kit ingest "$file"synthetic-data-kit create "data/output/${filename}.txt" -n 20synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5synthetic-data-kit save-as "data/cleaned/${filename}_cleaned.json" -f chatmldone
synthetic-data-kit curate data/generated/report_qa_pairs.json -t 7.0
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2026-02-03
OpenClaw之后,我们离能规模化落地的Agent还差什么?
2026-01-30
Oxygen 9N-LLM生成式推荐训练框架
2026-01-29
自然·通讯:如何挖掘复杂系统中的三元交互
2026-01-29
微调已死?LoRA革新
2026-01-19
1GB 显存即可部署:腾讯 HY-MT1.5 的模型蒸馏与量化策略解析
2026-01-18
【GitHub高星】AI Research Skills:一键赋予AI“博士级”科研能力,74项硬核技能库开源!
2026-01-10
前Mata GenAI研究员田渊栋的年终总结:关于未来AI的思考
2026-01-07
智元发布SOP:让机器人在真实世界规模化部署与智能化运行
2025-11-21
2025-12-04
2026-01-04
2026-01-02
2025-11-22
2025-11-20
2025-11-19
2026-01-01
2025-12-21
2025-11-23
2026-02-03
2026-01-02
2025-11-19
2025-09-25
2025-06-20
2025-06-17
2025-05-21
2025-05-17