DocChat：在几个小时内完成 GPT-4 级别对话QA训练！

发布日期：2024-09-30 18:30:55 浏览次数： 2476

作者：Halo咯咯

微信搜一搜，关注“Halo咯咯”

01。

概述

DocChat的发布无疑是对话式问答系统领域的一次重大飞跃。Cerebras以其在机器学习（ML）和大型语言模型（LLMs）方面的深厚专业知识，推出了DocChat系列下的两个新模型：Cerebras Llama3-DocChat和Cerebras Dragon-DocChat。这些模型不仅展现出高性能对话式人工智能的潜力，更是在基于文档的问答任务中展现出了独特的定制优势。

Cerebras Llama3-DocChat 是基于Llama 3的构建，并融合了该领域最新研究的先进见解。特别是在Nvidia的ChatQA模型系列上，该模型的开发用时极短，显示了Cerebras在ML训练和数据集策划方面的丰富经验，以及在合成数据生成等创新技术上的突破。

Cerebras Dragon-DocChat 则是一个多轮检索模型，其经过微调后在召回率上取得了显著改进。它在ChatQA对话式问答数据集上接受了训练，并通过硬负样本的对比损失进行增强，展现出在多轮对话设置中的卓越性能。

02。

训练效率与性能

DocChat模型格外引人注目的是它们的训练速度。Cerebras Llama3-DocChat模型仅用几个小时就完成了训练，而Dragon-DocChat模型的微调则在几分钟内完成。这种训练效率的突破，为人工智能行业树立了新的标准。

在性能方面，这两种模型在各种基准测试中都取得了一流的结果，超越了许多现有解决方案。例如，在ConvFinQA和SQA等基准测试中，Cerebras Llama3-DocChat展现出显著的改进，证明了其处理复杂对话式问答任务的能力。

使用示例

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "cerebras/Llama3-DocChat-1.0-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")


system = "This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
instruction = "Please give a full and complete answer for the question."

document = """
# Cerebras Wafer-Scale Cluster

Exa-scale performance, single device simplicity

## AI Supercomputers

Condor Galaxy (CG), the supercomputer built by G42 and Cerebras, is the simplest and fastest way to build AI models in the cloud. With over 16 ExaFLOPs of AI compute, Condor Galaxy trains the most demanding models in hours rather than days. The terabyte scale MemoryX system natively accommodates 100 billion+ parameter models, making large scale training simple and efficient.

| Cluster  | ExaFLOPs | Systems  | Memory |
| -------- | -------- | -------- | ------ |
| CG1      | 4        | 64 CS-2s | 82 TB  |
| CG2      | 4        | 64 CS-2s | 82 TB  |
| CG3      | 8        | 64 CS-3s | 108 TB |
"""

question = "How many total CS systems does Condor Galaxy 1, 2, and 3 have combined, and how many flops does this correspond to?"

user_turn = f"""<context>
{document}
</context>
{instruction} {question}"""

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": user_turn}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

03。

开源承诺

Cerebras通过发布DocChat，展现了其对开源社区的承诺。公司公开了模型权重、完整的训练配方和相关数据集，这种透明度水平允许其他人工智能研究人员和开发人员复制、构建和创新Cerebras的工作，推动该领域的发展。

04。

基准比较

在与其他模型的直接比较中，DocChat模型展现出了令人印象深刻的结果。在ChatRAG基准测试中，Cerebras Llama3-DocChat在多个关键指标上超越了Nvidia的Llama3-ChatQA和GPT-4 Turbo。Cerebras Dragon-DocChat同样在多轮对话设置中的召回率上超越了Facebook的Dragon+和Nvidia的Dragon Multiturn。