2026年7月9日 周四晚上19:30,报名腾讯会议了解“如何构建自进化的动态知识库(Brain)”(限30人)
免费POC, 零成本试错
FDE知识库

FDE知识库

学习大模型的前沿技术与行业落地应用


收藏

Qilin-Med:多阶段知识注入先进的医学大型语言模型

发布日期:2024-06-30 19:08:48 浏览次数: 4274
作者:南极星医学AI笔记

微信搜一搜,关注“南极星医学AI笔记”


Abstract

Integrating large language models (LLMs) into healthcare presents potential but faces challenges. Directly pre-training LLMs for domains like medicine is resource-heavy and sometimes unfeasible. Sole reliance on Supervised Fine-tuning (SFT) can result in overconfident predictions and may not tap into domainspecific insights. Addressing these challenges,we present a multi-stage training method combining Domain-specific Continued Pre-training(DCPT), SFT, and Direct Preference Optimization (DPO). A notable contribution of our study is the introduction of a 3Gb Chinese Medicine(ChiMed) dataset, encompassing medical question answering, plain texts, knowledge graphs,and dialogues, segmented into three training stages. The medical LLM trained with our pipeline, Qilin-Med, exhibits significant performance boosts. In the CPT and SFT phases,it achieves 38.4% and 40.0% accuracy on the CMExam, surpassing Baichuan-7B’s 33.5%. In the DPO phase, on the Huatuo-26M test set, it scores 16.66 in BLEU-1 and 27.44 in ROUGE-1, outperforming the SFT’s 12.69 and 24.21.This highlights the strength of our training approach in refining LLMs for medical applications.

摘要

将大型语言模型(LLM)整合到医疗保健中具有潜力,但也面临挑战。直接为医学等领域预训练LLM资源消耗大,有时并不可行。仅依靠监督微调(SFT)可能导致过度自信的预测,可能无法充分利用特定领域的见解。为了解决这些挑战,我们提出了一种多阶段训练方法,结合了特定领域的持续预训练(DCPT)、SFT和直接偏好优化(DPO)。我们研究的显著贡献是引入了一个3Gb的中国医学(ChiMed)数据集,涵盖了医学问答、普通文本、知识图谱和对话,分为三个训练阶段。使用我们管道训练的医学LLM,Qilin-Med,在性能上有了显著提升。在CPT和SFT阶段,它在CMExam上实现了38.4%和40.0%的准确率,超过了Baichuan-7B的33.5%。在DPO阶段,在Huatuo-26M测试集上,它在BLEU-1得分为16.66,在ROUGE-1得分为27.44,优于SFT的12.69和24.21。这凸显了我们训练方法在改进LLM以适应医疗应用方面的优势。

1 Introduction

Incorporating LLMs such as GPT-4 (OpenAI,2023) and its open-source counterpart LLaMA(Touvron et al., 2023b) into healthcare and biomedicine marks a significant shift with broad implications. These models show promise to enhance the efficiency and effectiveness of clinical and research operations, potentially revolutionizing patient care (Yang et al., 2023b; Karabacak and Margetis, 2023). They offer diverse downstreamhealthcare applications, from automating medical coding (Tu et al., 2022; Suvirat et al., 2023) to analyzing unstructured data for predictive insights (Jiang et al., 2023; Wornow et al., 2023; Hua et al.,2023; Wu et al., 2023), from decision support (Qiu et al., 2023; Cheng et al., 2023; Chiesa-Estomba et al., 2023) to patient engagement improvement(Seth et al., 2023).

1 引言

将LLM(大型语言模型)如GPT-4(OpenAI,2023)及其开源对应物LLaMA(Touvron等人,2023b)整合到医疗保健和生物医学中标志着一个重大的转变,具有广泛的影响。这些模型承诺可以提高临床和研究操作的效率和效果,有可能革命性地改变患者护理(Yang等人,2023b;Karabacak和Margetis,2023)。它们为下游医疗保健应用提供了多样化的选择,从自动化医疗编码(Tu等人,2022;Suvirat等人,2023)到分析非结构化数据以获得预测洞察(Jiang等人,2023;Wornow等人,2023;Hua等人,2023;Wu等人,2023),从决策支持(Qiu等人,2023;Cheng等人,2023;Chiesa-Estomba等人,2023)到改善患者参与度(Seth等人,2023)。

While the advantages of LLMs in healthcare are compelling, these models still have considerable room for improvement, given that medical and healthcare tasks represent some of the most challenging domains of natural language processing (NLP) (Hendrycks et al., 2021; Gu et al.,2021) and that medical AI stakes are exceptionally high as errors can directly affect patient outcomes (Thirunavukarasu et al., 2023; Gu et al.,2021). One major limitation in current medical LLMs is their reliance on solely SFT during the training phase. While SFT is essential for acquiring domain-specific knowledge, it often results in limited knowledge infusion and can lead to overconfident generalizations if not curated meticulously(Luo et al., 2023). Reinforcement learning from human feedback (RLHF) is a popular method to counteract some of SFT’s limitations, but it’s complex and demands rigorous hyperparameter tuning.Consequently, current LLMs may be ill-equipped to handle the nuanced dynamics integral to actual medical consultations.

尽管LLM在医疗保健中的优势很有吸引力,但这些模型仍然有很大的改进空间,考虑到医疗和医疗保健任务是自然语言处理(NLP)中最具有挑战性的领域之一(Hendrycks等人,2021;Gu等人,2021),而医疗AI的赌注非常高,因为错误可以直接影响患者的结果(Thirunavukarasu等人,2023;Gu等人,2021)。当前医学LLM的主要局限性之一是它们在训练阶段完全依赖SFT(监督微调)。虽然SFT对于获取特定领域的知识至关重要,但它往往导致知识注入有限,如果不精心管理,可能会导致过度自信的泛化(Luo等人,2023)。从人类反馈中进行强化学习(RLHF)是克服SFT某些局限性的流行方法,但它复杂且需要严格的超参数调优。因此,当前的LLM可能无法处理实际医疗咨询中不可或缺的微妙动态。

In response to these challenges, our study introduces Qilin-Med, an advanced Chinese medical LLM, built upon a robust pipeline that integrates DCPT, SFT, and DPO. This comprehensive approach allows Qilin-Med to harness the power of expansive medical datasets, effectively transforming a general-purpose foundation model like Baichuan (Yang et al., 2023a) into a specialized medical expert proficient in understanding complex medical texts and capable of handling intricate medical tasks. In addition, we also curated a unique dataset, ChiMed, which consists of sub-datasets corresponding to each of these three training stages to ensure a balanced and comprehensive injection

为了应对这些挑战,我们的研究引入了Qilin-Med,一个先进的中文医学LLM,建立在整合了DCPT(特定领域的持续预训练)、SFT和DPO(直接偏好优化)的强大管道之上。这种全面的方法使Qilin-Med能够利用广泛的医疗数据集,有效地将通用基础模型(如Baichuan)转变为擅长理解复杂医学文本的专业医学专家,并能够处理复杂的医学任务。此外,我们还整理了一个独特的数据集,ChiMed,它包括与这三个训练阶段相对应的子数据集,以确保LLM中平衡且全面地注入医学知识。

The contributions of this study can be summarized as follows:

1. Construction of the ChiMed dataset, which encompasses sub-datasets for DCPT, SFT, and DPO training stages, offering a holistic source to medical knowledge integration.

2. Implementation of a multi-stage knowledge injection pipeline and development of a Chinese medical LLM named Qilin-Med, effectively improving general-domains models on medical text understanding, instruction following, and preference alignment.

3. Empirical validation of our method across multiple datasets, including CMExam (Liu et al., 2023), CEval (Huang et al., 2023), and Huatuo-26M (Li et al., 2023a), setting new benchmarks in the realm of medical LLMs.

本研究的贡献可以总结如下:

  1. 构建了ChiMed数据集,该数据集包含用于DCPT、SFT和DPO训练阶段的子数据集,为医学知识整合提供了全面的来源。

  2. 实施了一个多阶段知识注入管道,并开发了一个名为Qilin-Med的中文医学LLM,有效地提高了通用领域模型在医学文本理解、指令遵循和偏好对齐方面的能力。

  3. 在多个数据集上实证验证了我们的方法,包括CMExam(Liu等人,2023)、CEval(Huang等人,2023)和Huatuo-26M(Li等人,2023a),在医学LLM领域设立了新的基准。

2 Related Work

2.1 Large Language Models

LLMs’ effectiveness relies on large-scale pretraining, such as on datasets like CommonCrawl,Wiki, and Books (Zhao et al., 2023; Touvron et al.,2023a). They typically use next-token prediction as a key training objective to understand context and predict the next word (Zhao et al., 2023; Touvron et al., 2023a). This training objective has been widely used in existing LLMs, e.g., GPT-series models (OpenAI, 2023; Brown et al., 2020), PaLM (Chowdhery et al., 2022), LLaMA (Touvron et al.,2023a), LLaMA-2 (Touvron et al., 2023b), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023),and ChatGLM (Zeng et al., 2022a; Du et al., 2022).

2 相关工作

2.1 大语言模型

LLM的有效性依赖于大规模预训练,如在CommonCrawl、Wiki和书籍等数据集上进行预训练(Zhao等人,2023;Touvron等人,2023a)。它们通常使用下一个标记预测作为关键训练目标来理解上下文并预测下一个单词(Zhao等人,2023;Touvron等人,2023a)。这个训练目标已在现有的LLM中得到广泛应用,例如GPT系列模型(OpenAI,2023;Brown等人,2020)、PaLM(Chowdhery等人,2022)、LLaMA(Touvron等人,2023a)、LLaMA-2(Touvron等人,2023b)、Alpaca(Taori等人,2023)、Vicuna(Chiang等人,2023)和ChatGLM(Zeng等人,2022a;Du等人,2022)。

2.2 Large Language Models in Healthcare

Healthcare-oriented LLMs have gained research attention, but current medical LLMs are typically either trained entirely from scratch, incurring high costs, time, and environmental impact, or finetuned from general-purpose LLMs. As an alternative, SFT methods have been introduced to adapt general LLMs into medical contexts. For example, Xiong et al. (2023) and Li et al. (2023b) proposed to fine-tune ChatGLM and LLaMA on the physician-patient conversations to obtain the DoctorGLM and ChatDoctor, respectively; MedAlpaca(Han et al., 2023) is fine-tuned on Alpaca with over 160,000 medical question-answering pairs generated from various medical corpora. BianQue(Yirong et al., 2023) incorporated multi-turn doctor Q&A datasets to perform a Chain of Questioning; Clinicalcamel (Toma et al., 2023) simultaneously incorporated physician-patient conversations,clinical articles, and medical Q&A pairs for finetuning the LLaMA2 model. Additionally, instruction prompt tuning is also proposed to improve medical LLMs by aligning LLMs to the medical domain. For example, Med-PaLM (Singhal et al., 2023a) and Med-PaLM-2 (Singhal et al.,2023b) had qualified clinicians construct the instruction data to fine-tune the PaLM. Huatuo (Wang et al., 2023a) and ChatGLM-Med (Wang et al.,2023b) constructed the knowledge-based instruction data from the knowledge graph to inject the medical knowledge into the LLMs, thus improving the downstream performances. Among existing medical LLMs, Huatuo(Wang et al., 2023a),ChatGLM-Med (Wang et al., 2023b), DoctorGLM (Xiong et al., 2023), and BianQue (Yirong et al.,2023) stands out as Chinese medical LLMs, which are especially valuable given language inequality within the current NLP field (Bird, 2020; Zeng et al., 2022b).

2.2 医疗保健领域的LLM

面向医疗保健的LLM已引起研究关注,但目前的医学LLM通常是完全从头开始训练的,这需要高成本、时间和环境影响,或者从通用LLM中进行微调。作为替代方案,已引入SFT(监督微调)方法来将通用LLM适应医疗环境。例如,Xiong等人(2023)和李等人(2023b)建议在医生-患者对话上微调ChatGLM和LLaMA,以获得DoctorGLMChatDoctorMedAlpaca(Han等人,2023)在Alpaca上进行了微调,使用了来自各种医学语料库生成的超过160,000个医学问答对。BianQue(Yirong等人,2023)结合了多轮医生问答数据集进行链式提问;Clinicalcamel(Toma等人,2023)同时结合了医生-患者对话、临床文章和医学问答对,以微调LLaMA2模型。此外,也提出了通过使LLM与医疗领域对齐来改进医学LLM的指令提示调谐。例如,Med-PaLM(Singhal等人,2023a)和Med-PaLM-2(Singhal等人,2023b)让合格临床医生构建指令数据来微调PaLM。Huatuo(Wang等人,2023a)和ChatGLM-Med(Wang等人,2023b)从知识图中构建基于知识的指令数据,将医学知识注入LLM,从而提高下游性能。在现有的医学LLM中,Huatuo(Wang等人,2023a)、ChatGLM-Med(Wang等人,2023b)、DoctorGLM(Xiong等人,2023)和BianQue(Yirong等人,2023)作为中文医学LLM脱颖而出,鉴于当前NLP领域中的语言不平等(Bird,2020;Zeng等人,2022b),它们尤为宝贵。

A concurrent study (Yang et al., 2023c) also employed a multi-stage training approach to enhance a medical language model called Zhongjing.However, to align the medical LLM outputs with human preferences, Zhongjing focused on adopting RLHF, which requires medical expert labeling and demands rigorous hyperparameter tuning.Our approach adopted DPO instead, which can automatically and efficiently achieve the goal. We also benchmarked medical LLM performance on a broader set of medical applications, as opposed to Zhongjing’s mere focus on doctor-patient dialogues. In addition, we introduce a large-scale medical dataset ChiMed, which incorporates a diverse set of data types (QA, plain texts, knowledge graphs, and dialogues) for each step of the the proposed training strategy.

一项同时进行的研究(Yang等人,2023c)也采用了多阶段训练方法来增强一个名为Zhongjing的医学语言模型。然而,为了使医学LLM的输出与人类偏好对齐,Zhongjing专注于采用RLHF(强化学习从人类反馈),这需要医疗专家标注,并需要严格的超参数调优。我们的方法采用了DPO(直接偏好优化)代替,它可以自动且高效地实现目标。我们还对医学LLM性能进行了更广泛的医疗应用基准测试,而不是Zhongjing仅仅关注医生-患者对话。此外,我们还引入了一个大规模的医学数据集ChiMed,它包含了各种数据类型(问答、普通文本、知识图谱和对话)以适应提出的训练策略的每一步。

3 Method

Fig.1 presents our three-fold pipeline with DCPT(Sec. 3.1), SFT (Sec. 3.2), and DPO (Sec. 3.3).

3方法

图1展示了我们的三阶段管道,包括DCPT(第3.1节)、SFT(第3.2节)和DPO(第3.3节)。

3.1 Domain-specific Continued Pre-training

General-purpose LLMs struggle with medical texts due to specialized language and styles. Therefore,we started with further pre-training Baichuan, a Chinese foundation model, to strengthen its understanding of fundamental medical knowledge. As a first step, we constructed a medical pre-training dataset called ChiMed-CPT by integrating existing datasets and new data crawled from the internet.

3.1 特定领域的持续预训练

通用LLM(大型语言模型)在处理医学文本时会遇到专门的语言和风格问题。因此,我们从进一步预训练一个名为Baichuan的中文基础模型开始,以加强其对基础医学知识的理解。作为第一步,我们构建了一个名为ChiMed-CPT的医学预训练数据集,通过整合现有数据集和从互联网爬取的新数据。

Figure 1: The construction pipeline of Qilin-Med.

图1:麒麟医疗的施工管道。

3.1.1 Pre-training Dataset Construction

Medical Data Collection We collected four types of medical data: Question Answering, plain (i.e.,unstructured) text, knowledge graph, and dialogue. The Question Answering subset contains three publicly available datasets: Huatuo-26M-encyclopedias (Li et al., 2023a), Huatuo-26M-medical_knowledge (Li et al., 2023a), and CMExam (Liu et al., 2023). Among them, Huatuo-26M-encyclopedias was curated from plain texts in Chinese Wikipedia and the Qianwen Health website; Huatuo-26M-medical_knowledge was curated from three knowledge graphs - CPubMed-KG(Qingcai Chen), 39Health-KG (Chen, 2018), and Xywy-KG (Bai, 2019); CMExam was sourced from the Chinese National Medical Licensing Examination. The plain text subset contains the MedQAtextbooks dataset (Jin et al., 2020) derived from textual data in Chinese medical textbooks. The knowledge graph subset contains data we extracted from CPubMed-KG, 39Health-KG, and Xywy-KG. To ensure the knowledge graph is comprehensive, we aggregated various features related to a disease entity,such as causation, symptoms, and recommended drugs. For the medical dialogue subsets, For the medical dialogue subsets, we have compiled a dataset, named CMD. This dataset comprises over 392K multi-turn medical dialogues sourced from various medical websites and covers 196 subspecialties. Furthermore, we have incorporated resources from Chinese-medical-dialogue-data (Toyhom, 2019) and Medical-Dialogue-System (Chen et al., 2020). Finally, following the deduplicating method proposed by (Lee et al., 2022), we deduplicated the dataset, yielding the Chi totaling 3.0 GB of data.

3.1.1 预训练数据集构建
医学数据收集

医学数据收集我们收集了四种类型的医学数据:问答普通(即非结构化)文本知识图谱对话。问答子集包含三个公开可用的数据集:Huatuo-26M-encyclopedias(李等人,2023a)、Huatuo-26M-medical_knowledge(李等人,2023a)和CMExam(刘等人,2023)。其中,Huatuo-26M-encyclopedias是从中文维基百科和千问健康网站的普通文本中整理的;Huatuo-26Mmedical_knowledge是从三个知识图谱——CPubMed-KG(Qingcai Chen)、39Health-KG(Chen,2018)和Xywy-KG(Bai,2019)中整理的;CMExam来自中国国家医学执业资格考试。普通文本子集包含从中文医学教科书文本数据中整理的MedQA-textbooks数据集(金等人,2020)。知识图谱子集包含我们从CPubMed-KG、39Health-KG和Xywy-KG中提取的数据。为了确保知识图谱的全面性,我们汇总了与疾病实体相关的各种特征,例如病因、症状和推荐药物。对于医学对话子集,我们整理了一个名为CMD的数据集,包含来自各种医疗网站的超过392K个多轮医学对话,涵盖196个亚专科。此外,我们还整合了来自Chinese-medical-dialogue-data(Toyhom,2019)和Medical-Dialogue-System(Chen等人,2020)的资源。最后,我们遵循(Lee等人,2022年)提出的方法去重数据集,生成了ChiMed数据集,总计3.0 GB的数据。

Statistics of the dataset are summarized in Table 1.

数据集的统计信息总结在表1中。

Table 1: Statistics of ChiMed-CPT.

表1:ChiMed-CPT的统计数据。

3.1.2 Training Objective

We used a self-supervised objective, next-token prediction, for domain-specific continued pre-training.Given N sequences partitioned from ChiMed-CPT,where each sequencecontains T tokens, we defined the loss function as the sum of the negative log probabilities of the next token given the previous tokensin the sequence:

3.1.2 训练目标

我们为特定领域的持续预训练使用了自监督目标,即下一个标记预测。给定从ChiMed-CPT中分割出的N个序列,其中每个序列包含T个标记,我们定义的损失函数是序列中给定前T个标记的下一个标记的负对数概率之和:

where θ denotes the model parameters. In the following part, we denote the model obtained via DCPT as the "Medical Foundation Model," as it exhibits precise parsing capability and a fine understanding of medical texts.

其中,θ表示模型参数。在接下来的部分,我们将通过DCPT获得的模型称为“医学基础模型”,因为它展现了精确的解析能力和对医学文本的深入理解。

3.2 Supervised Fine-Tuning

While proficient in medical text comprehension,the medical foundation model can fall short in specific medical tasks due to a lack of task adherence. Frequent pre-training is also impractical due to resource constraints. In response, we conducted SFT on the model using a carefully curated dataset to improve its interpretive and responsive capabilities.

3.2 监督式微调

尽管医学基础模型在医学文本理解方面很熟练,但由于缺乏对特定医学任务的适应性,它在某些医学任务上可能表现不佳。由于资源限制,频繁的预训练也是不切实际的。因此,我们使用精心策划的数据集对模型进行了监督式微调(SFT),以改善其解释和响应能力。

3.2.1 Instruction Dataset Construction

We constructed ChiMed-SFT (statistics shown in Table 2), which consists of general and medical domain single-turn and multi-turn instructions (i.e.,prompts) along with their ground-truth responses.General domain instructions aim to enhance the LLM’s understanding and generation capabilities for instructions, while medical domain instructions focus on answering medical questions, simulating doctor-patient consultations, and explaining medical queries. The responses for the general domain instructions were primarily generated by ChatGPT,while medical domain instructions and expected responses were both real doctor-patient diagnostic dialogues collected from medical websites. To ensure stability in supervised fine-tuning, we standardized instructions from diverse sources within ChiMed-SFT into a uniform format.

3.2.1 指令数据集构建

我们构建了ChiMed-SFT(统计数据见表2),它包含了通用和医学领域的单轮和多轮指令(即提示)以及它们的真实响应。通用领域指令旨在增强LLM对指令的理解和生成能力,而医学领域指令则专注于回答医学问题、模拟医患咨询以及解释医学查询。通用领域指令的响应主要由ChatGPT生成,而医学领域指令和预期响应都是从医疗网站上收集的真实医患诊断对话。为了确保监督式微调的稳定性,我们将ChiMed-SFT中来自不同来源的指令标准化为统一格式。

Table 2: Statistics of ChiMed-SFT.

表2:ChiMed-SFT的统计数据。

3.2.2 Training Objective

Considering each prompt as well as its corresponding response from ChiMed-SFT, the loss function of SFT stage can be defined as follows:

where N denotes the total number of training instances and θ denotes model parameters.

3.2.2 训练目标

考虑到每个提示以及它对应的ChiMed-SFT中的响应,监督式微调阶段的损失函数可以定义如下:

其中 N 表示训练实例的总数,θ 表示模型参数。

We term the fine-tuned model as the "Medical Chat Model" capable of executing specific medical tasks via instructions or dialogues while staying updated with the latest medical knowledge without significant additional resources.

我们将微调后的模型称为“医学聊天模型”,它能够通过指令或对话执行特定的医学任务,同时在不消耗大量额外资源的情况下保持最新的医学知识更新。

3.3 Direct Preference Optimization

SFT encourages some responses but does not prevent undesirable ones, such as missing or inaccurate medical information. A popular solution is RLHF, which uses reward models from response rankings to guide LLM training. However, it is complex and unstable, requiring extensive hyperparameter tuning.

3.3 直接偏好优化

监督式微调(SFT)鼓励某些响应,但并不能阻止不理想的响应,比如遗漏或不准确的医学信息。一个流行的解决方案是RLHF(人类反馈强化学习),它使用来自响应排名的奖励模型来指导LLM的训练。然而,这种方法复杂且不稳定,需要广泛的超参数调整。

To improve stability, we used DPO (Rafailov et al., 2023) to align the medical chat model output with human preferences. DPO is simpler and more effective than RHLF as it doesn’t need explicit reward modeling or reinforcement learning.

为了提高稳定性,我们使用了DPO(Rafailov等人,2023年)来将医学聊天模型的输出与人类偏好对齐。DPO比RHLF更简单、更有效,因为它不需要显式的奖励建模或强化学习。

3.3.1 Preference Dataset Construction

We built ChiMed-DPO (statistics shown in Table 3from two public available preference datasets: (1)Zhongjing_rlhf (Yang et al., 2023c), which comprises 20,000 samples (10,000 in-distribution and 10,000 out-of-distribution) annotated by medical postgraduates/doctors, and (2) MedicalGPT (Xu,2023), which contains 4,000 samples from Chinesemedical-dialogue-data, with preferred responses from doctors and rejected ones from BenTsao(Wang et al., 2023a) model.

3.3.1 偏好数据集构建

我们构建了ChiMed-DPO(统计数据见表3),它来自两个公开可用的偏好数据集:(1) Zhongjing_rlhf(Yang等人,2023c),它包含了20,000个样本(10,000个分布内和10,000个分布外),由医学研究生/医生注释,以及(2) MedicalGPT(Xu,2023),它包含了4,000个来自Chinese-medical-dialogue-data的样本,其中偏好响应来自医生,拒绝响应来自BenTsao(Wang等人,2023a)模型。

Each training sample in ChiMed-DPO is a triplet consisting of a prompt, a preferred response, and a rejected response.

ChiMed-DPO中的每个训练样本都是一个三元组,包括一个提示、一个偏好响应和一个拒绝响应。

3.3.2 Training Objective

To enhance model performance, our primary goals were to calculate log probabilities for preferred and rejected responses within the current model,and subsequently fine-tune model parameters with the aim of elevating the likelihood of preferred responses while diminishing the likelihood of rejected responses. This optimization process was guided by a specific loss function, which can be succinctly outlined as follows:

Through this process, responses generated by QilinMed will better align with human preferences while avoiding unfavored ones, thus improving the quality and safety of medical dialogues.

3.3.2 训练目标

为了提升模型性能,我们的主要目标是计算当前模型内偏好响应和拒绝响应的对数概率,然后微调模型参数,目的是提高偏好响应的可能性同时降低拒绝响应的可能性。这个优化过程是由一个特定的损失函数指导的,可以简洁地概述如下:

通过这个过程,QilinMed生成的响应将更好地与人类偏好对齐,同时避免不受欢迎的响应,从而提高医学对话的质量和安全。

4 Experiments

4.1 Evaluation Datasets, Metrics andBaselines

4.1.1 Evaluation Datasets

We evaluated Qilin-Med in scenarios such as medical knowledge Question Answering and dialogue on the following datasets:

1. CMExam (Liu et al., 2023), a standardized medical exam and practice question dataset.It contains over 60,000 multiple-choice questions and provides question explanations.

2. CEval (Huang et al., 2023), a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of LLMs. It contains 13,948 multiple-choice exam questions across 52 diverse disciplines,including three medical sub-disciplines: Clinical Medicine, Basic Medicine, and Physician.

3. Huatuo-26M (Li et al., 2023a), a Chinese medical dataset that consists of over 26 million medical question-answer pairs, covering topics including diseases, symptoms, treatments,and drug information.

4实验

4.1 评估数据集、指标和基线

4.1.1 评估数据集

我们在医学知识问答和对话等场景下对Qilin-Med进行了评估,使用了以下数据集:

  1. CMExam(Liu等人,2023年),一个标准化的医学考试和实践问题数据集。它包含了超过60,000个多选题,并提供问题解释。

  2. CEval(Huang等人,2023年),一个全面的中国评估套件,旨在评估LLMs的高级知识和推理能力。它包含了52个不同学科(包括三个医学子学科:临床医学、基础医学和医师)的13,948个多选题。

  3. Huatuo-26M(Li等人,2023a年),一个中文医学数据集,包含超过2600万个医学问答对,覆盖疾病、症状、治疗和药物信息等主题

4.1.2 Metrics

We assess model performance on multiple-choice questions using accuracy and weighted F1 score- metrics commonly employed in information retrieval and question-answering tasks. For medical dialogue tasks, BLEU (Papineni et al., 2002)and ROUGE (Lin and Hovy, 2003) were used to evaluate the discrepancy between model-generated responses and ground truth.

4.1.2 指标

我们使用准确率和加权F1分数来评估模型在选择题上的性能,这些指标在信息检索和问答任务中通常被采用。对于医学对话任务,我们使用BLEU (Papineni et al., 2002)和ROUGE (Lin and Hovy, 2003)来评估模型生成响应与真实响应之间的差异。

4.1.3 Baselines

We used Baichuan-7B (Yang et al., 2023a) as the base model. Baichuan-7B is an open-source, largescale pre-trained language model built on the Transformer architecture. It has 7 billion parameters and is trained on approximately 1.2 trillion tokens. It supports both Chinese and English with a context window length of 4096.

4.1.3 基线

我们使用Baichuan-7B(Yang等人,2023a)作为基础模型。Baichuan-7B是一个开源的大规模预训练语言模型,基于Transformer架构构建。它有70亿参数,在大约1200亿个token上进行训练。它支持中英文,上下文窗口长度为4096。

For baselines, we evaluated LLMs in both general scenarios and the medical domain across various tasks. For CMExam, we reported the performance of ChatGLM-6B, LLaMA (Touvron et al.,2023a), Vicuna (Chiang et al., 2023), Alpaca (Taori et al., 2023), Huatuo (Wang et al., 2023a), and DoctorGLM (Xiong et al., 2023) on both the prediction and reasoning tasks. For CEval, we evaluated the performance of ChatGLM (Du et al., 2022),Chinese-LLaMA2 (Cui et al., 2023), and ChineseAlpaca (Cui et al., 2023) on the prediction task.Since CMExam has a standardized training set, we also reported the performance of LLaMA, Alpaca,and Vicuna on CMExam after SFT. Additionally,we evaluated models such as T5 (Raffel et al., 2020)and GPT2 (Radford et al., 2019) on the test set of Huatuo-26M. However, since Huatuo-26M is not fully open-sourced, we were unable to run SFT with this dataset.

对于基线,我们评估了LLMs在一般场景和医学领域各种任务下的表现。对于CMExam,我们报告了ChatGLM-6B、LLaMA (Touvron等人,2023a)、Vicuna (Chiang等人,2023)、Alpaca (Taori等人,2023)、Huatuo (Wang等人,2023a)和DoctorGLM (Xiong等人,2023)在预测和推理任务上的性能。对于CEval,我们评估了ChatGLM (Du等人,2022)、Chinese-LLaMA2 (Cui等人,2023)和Chinese-Alpaca (Cui等人,2023)在预测任务上的性能。由于CMExam有一个标准化的训练集,我们还报告了LLaMA、Alpaca和Vicuna在SFT后的CMExam性能。此外,我们评估了T5 (Raffel等人,2020)和GPT2 (Radford等人,2019)在Huatuo-26M测试集上的表现。然而,由于Huatuo-26M没有完全开源,我们无法使用这个数据集进行SFT。

4.2 Implementation Details

For DCPT, Baichuan-7B was trained on eight A100 80G GPUs, with settings: batch size of 1/GPU,three epochs, a 2e-4 learning rate, 0.05 warmup ratio, 0.01 weight decay, and 1024 block size.

4.2 实现细节

对于DCPT,Baichuan-7B在八个A100 80G GPU上进行训练,设置如下:每个GPU的批处理大小为1,三个epochs,学习率为2e-4,0.05的预热比例,0.01的权重衰减,以及1024的块大小。

Table 3: Statistics of the ChiMed-DPO.

表3:ChiMed-DPO的统计数据。

Table 4: C-Eval results.

表4:C-Eval结果。

For SFT, A100 80G GPUs were used with a 64 batch size/GPU. Qilin-Med settings were: 2e-5 learning rate, 0.05 warmup ratio, 0.05 weight decay,and max_source_length and max_target_length both at 256. We accelerated training using DeepSpeed ZeRO-2 (Ren et al., 2021). We adopted the LoRA technique (Hu et al., 2021), a type of SFT, with lora_rank set at 8, lora_alpha at 32, and lora_dropout at 0.05 for enhanced performance.

对于SFT,我们使用了A100 80G GPUs,每个GPU的批处理大小为64。Qilin-Med的设置如下:2e-5的学习率,0.05的预热比例,0.05的权重衰减,max_source_length和max_target_length都设置为256。我们使用DeepSpeed ZeRO-2 (Ren et al., 2021)来加速训练。我们采用了LoRA技术(Hu et al., 2021),一种SFT,其中lora_rank设置为8,lora_alpha设置为32,lora_dropout设置为0.05以提升性能。

For DPO, 4 RTX 3090 GPUs were used with a batch size of 8/GPU. Settings were: 2e-5 learning rate, 0.05 warmup ratio, 0.05 weight decay, and both max_source_length and max_target_length at 256. The LoRA technique was again applied with lora_rank set at 8, lora_alpha at 16, and lora_dropout at 0.05.

对于DPO,我们使用了4个RTX 3090 GPUs,每个GPU的批处理大小为8。设置如下:2e-5的学习率,0.05的预热比例,0.05的权重衰减,max_source_length和max_target_length都设置为256。再次应用了LoRA技术,其中lora_rank设置为8,lora_alpha设置为16,lora_dropout设置为0.05。

For CMExam assessment, we used OpenAI’s GPT-3.5-turbo, GPT-4-0314, and models like LLaMA, Alpaca, and Vicuna, each with 7B parameters. ChatGLM was tested using its 6B parameter version and operated with P-Tuning V2 (Liu et al.,2021), using a prefix token length of 128 and a learning rate of 0.02 for SFT. For other models including LLaMA, Alpaca, Vicuna, and Huatuo, we used the LoRA technique (Hu et al., 2021) with a rank of 8, an alpha of 16, and a 0.05 dropout rate.

在CMExam评估中,我们使用了OpenAI的GPT-3.5-turbo、GPT-4-0314以及LLaMA、Alpaca和Vicuna等模型,每个模型的参数量为7B。ChatGLM使用了其6B参数版本,并使用P-Tuning V2 (Liu et al., 2021)进行操作,前缀令牌长度为128,SFT的学习率为0.02。对于其他模型,包括LLaMA、Alpaca、Vicuna和Huatuo,我们使用了LoRA技术(Hu et al., 2021),rank设置为8,alpha设置为16,dropout率为0.05。

During the Huatuo-26M evaluation, we compared T5 and GPT2 performances. Both models were set with maximum question and answer lengths of 256 and 512, respectively. We used the original 12-layer Chinese GPT2.

在Huatuo-26M评估期间,我们比较了T5和GPT2的性能。两个模型的问题和答案的最大长度分别设置为256和512。我们使用了原始的12层中文GPT2。

In the C-Eval phase, all models were evaluated using few-shot prompting. We opted for 5 shots and employed a greedy decoding strategy for answer prediction.

在C-Eval阶段,所有模型都使用少量样本提示进行评估。我们选择了5个样本,并采用贪婪解码策略进行答案预测。

4.3 Results and Discussion

On C-Eval Table 4 summarizes online evaluation results on the C-Eval benchmark. Among the five general LLMs compared in the upper part of the table, Baichuan-7B achieved the highest scores in both average and three medical subjects (namely Clinical Medicine, Physician and Basic Medicine),outperforming other models in instruction following as well as medical understanding. Specifically,Baichuan-7B achieved an accuracy of 45.1% in Basic Medicine, significantly surpassing ChatGLM-6B which scored only 36.6%. After the Domainspecific Continued Pre-training and Supervised Fine-tuning stages, the model enhanced its proficiency in medical knowledge and comprehension,better equipping it to address questions within medical domains. Notably, our Qilin models show a great performance boost compared to ZhongjingLLaMA. However, a decline in general language capabilities was noted, with average accuracy on CEval dropping from 42.8% to 40.1%. This decline suggests that while the model’s medical expertise grew, its broader linguistic abilities suffered due to its increased focus on the medical field.

4.3 结果与讨论

在C-Eval

在C-Eval上,表4总结了在线评估结果。在表的上半部分比较的五个通用LLM中,Baichuan-7B在平均分和三个医学科目(即临床医学、医师和基础医学)上都取得了最高的分数,超过了其他模型在遵循指令以及医学理解方面的表现。特别是,Baichuan-7B在基础医学上的准确率达到45.1%,显著超过了ChatGLM-6B的36.6%。经过领域特定持续预训练和监督式微调阶段后,模型在医学知识和理解方面的熟练度得到了提升,更好地装备了它来回答医学领域内的问题。值得注意的是,与Zhongjing-LLaMA相比,我们的Qilin模型显示出巨大的性能提升。然而,也注意到通用语言能力的下降,C-Eval的平均准确率从42.8%下降到40.1%。这种下降表明,尽管模型的医学专业知识增长了,但由于其增加了对医学领域的关注,其更广泛的语言能力受到了影响。

On CMExam Table 5 displays the evaluation outcomes on the CMExam benchmark. ChatGLM and Vicuna performed well in explanation generation, reflecting enhanced comprehension of medical knowledge and dialogue skills. Of the two, Vicuna had a weaker answer prediction rate at 5%, while ChatGLM reached 26%. After fine-tuning with CMExam’s training data (i.e.,LLaMA-CMExam, Alpaca-CMExam, and VicunaCMExam), we noted marked improvements in both tasks. Following the Domain-specific Continued pre-training and Supervised Fine-tuning using our data, our proposed Qilin-Med-7B-CPT and QilinMed-7B-SFT outperformed those fine-tuned on CMExam. This indicates our framework’s efficacy in enriching LLMs with medical knowledge and bolstering their problem-solving capabilities in the medical domain.

在CMExam

在CMExam上,表5展示了CMExam基准的评估结果。ChatGLM和Vicuna在解释生成方面表现良好,反映出对医学知识和对话技能的增强理解。在这两个模型中,Vicuna的答案预测率为5%,而ChatGLM达到了26%。在使用CMExam的训练数据进行微调后(即LLaMA-CMExam、Alpaca-CMExam和Vicuna-CMExam),我们在两个任务上都注意到了明显的改进。在采用我们的数据进行领域特定持续预训练和监督式微调后,我们提出的Qilin-Med-7B-CPT和Qilin-Med-7B-SFT在性能上超过了那些在CMExam上微调的模型。这表明我们的框架在丰富LLM的医学知识以及增强其在医学领域的问题解决能力方面是有效的。

Table 5: CMExam results

表5:CMExam结果。

Table 6: Huatuo-26M results.

表6:华佗-26M结果。

On Huatuo-26M Table 6 shows the evaluation re sults on Huatuo-26M. Among all three baseline methods (namely T5, GPT2, and Baichuan-7B),Baichuan-7B achieved the highest scores on most metrics, while T5 exhibited poor medical dialogue performance. Qilin-Med-7B-CPT outperformed Baichuan-7B in terms of BLEU-1 and ROUGE-1, proving that DCPT effectively injects medicalrelated knowledge into the model. Comparing Qilin-Med-7B-CPT and Qilin-Med-7B-SFT (10.63 vs. 12.69 in terms of BLEU-1), we see that SFT further strengthens model medical knowledge and instruction compliance capabilities. Finally, QilinMed-7B-DPO achieved higher scores in all metrics than Qilin-Med-7B-SFT, showing that DPO efficiently helps align the medical chat model output with human preferences and encourages the model to generate more preferred outputs.

在华佗-26M

在Huatuo-26M上,表6展示了Huatuo-26M的评估结果。在所有三种基线方法(即T5、GPT2和Baichuan-7B)中,Baichuan-7B在大多数指标上取得了最高的分数,而T5在医学对话性能上表现较差。Qilin-Med-7B-CPT在BLEU-1和ROUGE-1方面超过了Baichuan-7B,证明DCPT有效地将医学相关知识注入了模型。比较Qilin-Med-7B-CPT和Qilin-Med-7B-SFT(BLEU-1分别为10.63和12.69),我们看到SFT进一步强化了模型的医学知识和指令遵循能力。最后,Qilin-Med-7B-DPO在所有指标上的得分都高于Qilin-Med-7B-SFT,显示出DPO有效地帮助将医学聊天模型的输出与人类偏好对齐,并鼓励模型生成更受欢迎的输出。

4.4 Case Study

We examine the model outputs for Medical Dialogue and Medical Question Answering tasks using examples from Huatuo-26M and CMExam. As shown in Table 2, Baichuan-7B’s response appears detached from the conversation’s context, often leading to unnatural sentence transitions and run-on sentences in the Chinese generation. The incorporation of the CPT and SFT stages significantly refines Baichuan-7B’s medical acumen, leading to more relevant and informed responses, a trend further evident in Table 3. However, certain responses still exhibited run-on sentences, highlighting the need for further refinement. Notably, outputs from Qilin-7B-DPO stand out, aligning closely with human expectations in both accuracy and context, emphasizing the pivotal role and efficacy of the DPO stage in enhancing model outputs, while also addressing the aforementioned linguistic challenges.

4.4 案例研究

我们检查了模型在医学对话和医学问答任务中的输出,使用的例子来自Huatuo-26M和CMExam。如表2所示,Baichuan-7B的响应与对话的上下文脱节,常常导致中文生成中的句子过渡不自然和长句。引入CPT和SFT阶段显著提高了Baichuan-7B的医学知识,使得响应更加相关和知识渊博,这一趋势在表3中更为明显。然而,某些响应仍然存在长句,凸显了进一步改进的必要性。值得注意的是,Qilin-7B-DPO的输出在准确性和上下文方面与人类期望非常接近,强调了DPO阶段在提升模型输出方面的重要作用和效果,同时也解决了前述的语言学挑战。

5 Limitations

The introduction of Qilin-Med, trained on the ChiMed dataset, marks a significant advancement in medical LLMs. However, several limitations should be acknowledged. The ChiMed dataset,while comprehensive, primarily focuses on Chi-nese medical knowledge, potentially limiting the model’s global applicability. The multi-stage training pipeline, including the DPO stage, might introduce biases based on the preferences of the human evaluators involved. Furthermore, while metrics like BLEU and ROUGE provide insights into the model’s performance, they might not capture the complete picture, especially in nuanced medical scenarios. Future work should consider a more diverse set of evaluation metrics, including human evaluations, to ensure a holistic understanding of Qilin-Med’s capabilities

5 局限性

Qilin-Med的引入,在ChiMed数据集上的训练,标志着医学LLM的一个重大进步。然而,应该承认几个局限性。ChiMed数据集虽然全面,但主要关注中文医学知识,可能限制了模型的全球适用性。多阶段训练管道,包括DPO阶段,可能会引入基于人类评估者偏好的偏见。此外,虽然BLEU和ROUGE等指标提供了模型性能的洞见,但它们可能无法捕捉到完整的画面,特别是在微妙的医学场景中。未来的工作应考虑更多样化的评估指标集,包括人类评估,以确保对Qilin-Med能力的全面理解。

Figure 2: A case on Huatuo-26M dialogue dataset.

图2:华图-26M对话数据集上的一个案例。

6 Ethics and Societal Impacts

We do not recruit any human research participants for this study. To prepare the data, the information was made anonymous in accordance with the regulations set by the Health Insurance Portability and Accountability Act (HIPAA), ensuring that the protected health information was de-identified. The creation and utilization of the ChiMed dataset adhered to stringent ethical standards, ensuring the authenticity and accuracy of the medical knowl edge it encapsulates. However, it is crucial to em phasize that Qilin-Med and ChiMed are intended for research and academic purposes. Commercial exploitation or any use that deviates from this primary objective is strictly discouraged. Researchers and practitioners are urged to respect these guidelines, ensuring the ethical and responsible use of Qilin-Med and the associated dataset. The development of Qilin-Med aims to enhance the capabilities of LLMs in the medical domain. However, it is paramount to understand that Qilin-Med is not a replacement for human medical expertise. It should not be used for direct patient diagnosis or as a standalone tool for medical decision-making. Any conclusions or insights derived from Qilin-Med should be contextualized, considering the specific focus of ChiMed and the inherent limitations of LLMs. The primary intent behind Qilin-Med is to aid research, and its use should be confined to this scope to prevent potential misuse.

6 伦理和社会影响

本研究没有招募任何人类研究参与者。为了准备数据,根据健康保险可携带性和责任法案(HIPAA)的规定,对信息进行了匿名化处理,确保受保护的医疗信息被去识别化。ChiMed数据集的创建和使用遵循了严格的伦理标准,确保了其封装的医学知识的真实性和准确性。然而,必须强调的是,Qilin-Med和ChiMed旨在用于研究和学术目的。强烈反对任何偏离这一主要目标的商业利用或其他用途。研究人员和从业者被敦促遵守这些指导原则,确保对Qilin-Med和相关数据集的伦理和负责任的使用。Qilin-Med的发展旨在提高LLM在医学领域的 capabilities 能力。然而,至关重要的是要理解Qilin-Med不是人类医学专长的替代品。它不应用于直接的患者诊断或作为医疗决策的独立工具。从Qilin-Med得出的任何结论或见解都应结合ChiMed的具体关注点和LLM的固有局限性来考虑。Qilin-Med背后的主要意图是辅助研究,其使用应局限于这一范围,以防止潜在的滥用。

7 Conclusion & Future Work

This study introduces a multi-stage training approach, a large-scale Chinese medicine dataset -ChiMed, and Qilin-Med, a cutting-edge Chinese medical language model. It demonstrates the potential of domain-specific training in healthcare, with implications for patient care, clinical decisions, and medical research. Qilin-Med’s refined outputs, especially post-DPO stage, enable more accurate and context-aware medical dialogues, forerunning a new era of AI-driven medical insights and interventions.

7 结论与未来工作

本研究介绍了一种多阶段训练方法,一个大规模的中文医学数据集——ChiMed,以及Qilin-Med,一个先进的中文医学语言模型。它展示了特定领域训练在医疗保健方面的潜力,对患者护理、临床决策和医学研究具有影响。特别是经过DPO阶段后的Qilin-Med的优化输出,使医学对话更加准确和上下文感知,预示着AI驱动的医学见解和干预的新时代。

Figure 3: A case on CMExam dataset

图3:CMExam数据集上的一个案例。


53AI,企业落地大模型首选服务商

产品:场景落地咨询+大模型应用平台+行业解决方案

承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业

联系我们

售前咨询
186 6662 7370
预约演示
185 8882 0121

微信扫码

添加专属顾问

回到顶部

加载中...

扫码咨询

扫码登录
登录即表示您同意《53AI网站服务协议》
服务协议

欢迎您使用【53AI 官方网站】(以下简称“本网站”或“我们”)。本《会员服务协议》(以下简称“本协议”)是您(以下简称“会员”或“用户”)与【深圳市博思协创网络科技有限公司】之间关于注册、登录及使用本网站会员服务所订立的法律协议。

在您注册或登录前,请务必审慎阅读、充分理解各条款内容,特别是免除或限制责任的条款、知识产权条款、争议解决条款等。此类条款将以加粗形式提示您注意。 当您通过微信公众号授权、手机验证码验证或其他方式成功登录本网站时,即视为您已完全理解并同意接受本协议的全部内容。

一、 定义

本网站:指由【深圳市博思协创网络科技有限公司】运营的,域名为【53ai.com】的网站及相关移动端页面。

会员服务:指本网站向注册会员提供的知识库文章查阅、内容检索及其他相关增值服务。

知识库内容:指本网站发布的包括但不限于文字、图表、数据、研究报告、行业分析等数字化内容资源。

二、 账号注册与登录

登录方式:本网站支持以下登录方式,您可根据实际情况选择:

微信公众号授权登录:您同意将您的微信OpenID信息授权给本网站,用于创建或关联会员账号。

手机验证码登录:您需提供真实有效的手机号码,并通过短信验证码完成身份验证与登录/注册。

账号安全:您的账号仅限您本人使用,禁止赠与、借用、租用、转让或售卖。因您保管不善导致的账号被盗、密码泄露等损失,由您自行承担。

实名认证:根据相关法律法规要求,我们可能要求您在特定功能下完成实名认证。如您拒绝提供,可能无法使用部分或全部服务。

未成年人保护:若您未满18周岁,请在法定监护人的陪同下阅读本协议,并在征得监护人同意后使用本服务。

三、 服务内容与规范

知识库查阅权限:会员登录后,有权按照其会员等级对应的权限范围,在线浏览、检索本网站知识库中的相关文章及内容。

服务变更:我们有权根据业务发展需要,调整、变更或终止部分服务内容,并将以网站公告、公众号消息等方式提前通知。

禁止行为:您在使用服务时不得实施以下行为:

利用技术手段批量爬取、下载、转存知识库内容;

将知识库内容用于商业目的或未经授权地向第三方传播;

干扰本网站正常运行或侵犯其他用户合法权益;

发布违法违规信息或从事违反公序良俗的活动。

四、 知识产权声明

权利归属:本网站知识库中的排版设计、软件代码等内容的知识产权均归【公司全称】或原权利人所有,受《中华人民共和国著作权法》等法律保护。

有限许可:本网站授予会员一项非独占、不可转让、不可转授权的普通许可,仅限于个人学习、研究之目的在线查阅知识库内容。

侵权追责:未经书面许可,任何单位或个人不得以任何形式复制、转载、摘编、镜像、汇编或以其他方式使用上述内容。一经发现,我们保留追究其法律责任的权利。

五、 个人信息保护

我们重视对您个人信息的保护。关于我们如何收集、使用、存储和保护您的个人信息,请单独阅读 《隐私政策》。

您通过微信公众号授权或手机号验证所提供的信息,我们将严格按照《个人信息保护法》的规定处理,仅用于身份识别、服务提供及安全验证等必要用途。

您可以随时通过网站设置或联系客服行使查阅、更正、删除个人信息及撤回授权同意的权利。

六、 免责声明

内容准确性:知识库内容仅供参考,不构成专业建议。我们不对其完整性、准确性、时效性作任何明示或暗示的保证,您应自行判断并承担使用风险。

不可抗力:因自然灾害、政策法规变化、网络故障、第三方平台接口异常(如微信接口维护、运营商短信通道故障)等不可抗力导致的服务中断或延迟,我们不承担违约责任。

第三方链接:本网站可能包含指向第三方网站的链接,该等网站的内容和服务不受我们控制,请您自行甄别风险。

七、 违约责任

如您违反本协议约定,我们有权视情节采取警告、限制功能、暂停服务、注销账号等措施,并保留要求赔偿损失的权利。

如因您的违约行为导致我们遭受行政处罚、第三方索赔或商誉损失,您应承担全部赔偿责任(包括但不限于罚款、赔偿金、律师费、公证费等)。

八、 法律适用与争议解决

本协议的订立、执行和解释均适用中华人民共和国大陆地区法律。

因本协议产生的或与本协议有关的任何争议,双方应友好协商解决;协商不成的,任何一方均可向【公司所在地】有管辖权的人民法院提起诉讼。

九、 其他

本协议构成双方就本服务达成的完整协议,取代此前任何口头或书面约定。

本协议任一条款被认定为无效或不可执行的,不影响其他条款的效力。

我们对本协议享有最终解释权,并在法律允许的范围内保留随时修改的权利。修改后的协议一经公布即生效,继续使用服务即视为同意修订内容。


已查阅