LlamaIndex ：企业级知识助理，万物可知

发布日期：2025-01-11 15:10:04 浏览次数： 3348

作者：AI零壹白洞

微信搜一搜，关注“AI零壹白洞”

在 1 月 9 号，LlamaIndex 推出了自己新架构，引入了代理文档工作流 (ADW)，从官网的结论和 case 来说，它超越了检索增强生成 (RAG) 流程并提高了代理的工作效率。随着编排框架的不断改进，这种方法可以为组织提供增强代理决策能力的选择。

LlamaIndex 表示，ADW 可以帮助代理管理超越简单提取或匹配的复杂工作流程，当前市面上很多代理框架基于 RAG 系统，但该系统仅仅只为代理提供完成任务所需的信息。但是这种方法不允许代理根据这些信息做出决策。

LlamaIndex 给出了一些现实世界的例子来说明 ADW 如何发挥作用。例如，在合同审查中，人类分析师必须提取关键信息、交叉引用监管要求、识别潜在风险并提出建议，在文末小编将举例 case 的使用。当部署在该工作流程中时，AI 代理理想情况下会遵循相同的模式，并根据他们为合同审查而阅读的文档和其他文档中的知识做出决策。

关于编排和 agent 扩展知识，大家可以参考下其它家产品思路：

Cloudflare : 您编写工作流，我们负责其余工作
Dify : 一种面向低代码 AI 开发的工作流
AI agent 代理 : 应用企业现实场景
Agent Q : 具有规划和自我修复下一代 AI Agent

(原图来自附录)

Agentic Document Workflows

ADW 是一种架构，架构结合了文档处理、检索、结构化输出和代理编排，以实现端到端知识工作自动化。它超越了传统的智能文档处理 (IDP) 和 RAG 范式 ( RAG 原理科普

文档重新排序技术提升 RAG 性能)，并有助于实现代理在大幅提高知识生产力方面的承诺。

IDP (Intelligent Document Processing)

智能文档处理是一种前沿技术，结合了人工智能（AI）、机器学习（ML）、自然语言处理（NLP）、计算机视觉（CV）等技术，用于自动提取和处理非结构化文档中的数据

。它能够将纸质文档或文档图像的手动录入数据过程变成自动化过程，以便与其他数字业务流程集成。

白话说就是一种可以自动识别和提取各种文档（如扫描表格、PDF 文件、电子邮件等）中有价值的数据并将其转换为所需格式的技术。该技术也称为认知文档处理、智能文档识别或智能文档捕获。

(原图来自附录)

LlamaIndex 已开发出参考架构，将其 LlamaCloud 解析功能与代理相结合。它构建了能够理解上下文、维护状态并推动多步骤流程的系统。为此，每个工作流都有一个协调器，它可以指导代理利用 LlamaParse 从数据中提取信息，维护文档上下文和流程的状态，然后从另一个知识库检索参考资料。

(原图来自附录)

“ 通过在整个过程中保持上下文状态，代理可以处理复杂的多步骤工作流程，而不仅仅是简单的提取或匹配。这种方法使他们能够在协调不同系统组件的同时，构建有关他们正在处理的文档的深层背景。”
LlamaIndex

超越基本 RAG

虽然 RAG 已成为将 LLM 应用于企业数据的强大模式，但许多实际文档工作流程需要更复杂的编排。比如想一下，一个典型的合同审查工作流程。分析师需要提取关键条款、交叉引用监管要求、识别潜在风险并生成合规建议，这不仅需要信息检索，还需要结构化推理和决策支持。如果单纯上下文 token 推理预测会出现幻觉，即使利用 RAG，会出现知识的片面性，知识上对象没有关联贯穿性，比如分析师能知道关键条款的数额，但无法获取关键条款在什么时候修改过，或者关键条款有几个人参与讨论商定过。(上述这个贯穿性可以使用市面上存在的 Graph RAG，Graph RAG ：智能搜索的未来)

所以在实际企业中，企业文件不是孤立存在的，它涉及合同、政策、电子邮件和表格的协同文件，这个过程的决策涉及多个步骤，包含从数据提取到验证、批准到建议。整个过程中保持上下文和状态，从技术上来说需要解析器、检索器和业务逻辑引擎。ADW 系统可以跨步骤维护状态、应用业务规则、协调不同组件并根据文档内容采取行动，而不仅仅是分析它。

关于 LlamaCloud 是 LlamaIndex 的托管提取和索引服务，是一项基于云的服务，可上传、解析和索引文档，然后使用 LlamaIndex 从知识库中搜索它们。而 LlamaParse ，是 LlamaCloud 的组件，可让将文档解析为结构化数据，维护文档上下文和流程阶段的状态。它可用作独立的 REST API、Python 包和 Web UI。

合同审查 case

(原图来自附录)

如上图从左边到右边，基本基于的 RAG 的思想流程，LlamaIndex 利用 LlamaParse 将文档解析成一段段 Embedding，然后这些 Embedding 去知识库中映射关联的数据，形成图的形式，最终根据原文进行合同审查，是否合理，是否有人审批。

以下将用代码形式进行描述细节：

from llama_index.indices.managed.llama_cloud import LlamaCloudIndexfrom llama_parse import LlamaParsefrom typing import List, Optionalfrom pydantic import BaseModel, Fieldfrom llama_index.llms.openai import OpenAI
# Setup Indexindex = LlamaCloudIndex(  name="gdpr",   project_name="llamacloud_demo",  organization_id="cdcb3478-1348-492e-8aa0-25f47d1a3902",  # api_key="llx-...")
# Setup RAG proxy# similarity_top_k=2，相邻前2回进行提取retriever = index.as_retriever(similarity_top_k=2)
# Setup Parserparser = LlamaParse(result_type="markdown")
# Setup env，包含大模型，parser proxy，RAG proxyllm = OpenAI(model="gpt-4o")workflow = ContractReviewWorkflow(    parser=parser,    guideline_retriever=retriever,    llm=llm,    verbose=True,    timeout=None, )
# Setup 合同 Output 格式class ContractClause(BaseModel):    clause_text: str = Field(..., description="The exact text of the clause.")    mentions_data_processing: bool = Field(False, description="True if the clause involves personal data collection or usage.")    mentions_data_transfer: bool = Field(False, description="True if the clause involves transferring personal data, especially to third parties or across borders.")    requires_consent: bool = Field(False, description="True if the clause explicitly states that user consent is needed for data activities.")    specifies_purpose: bool = Field(False, description="True if the clause specifies a clear purpose for data handling or transfer.")    mentions_safeguards: bool = Field(False, description="True if the clause mentions security measures or other safeguards for data.")
class ContractExtraction(BaseModel):    vendor_name: Optional[str] = Field(None, description="The vendor's name if identifiable.")    effective_date: Optional[str] = Field(None, description="The effective date of the agreement, if available.")    governing_law: Optional[str] = Field(None, description="The governing law of the contract, if stated.")    clauses: List[ContractClause] = Field(..., description="List of extracted clauses and their relevant indicators.")

# Setup 合同检查内容class GuidelineMatch(BaseModel):    guideline_text: str = Field(..., description="The single most relevant guideline excerpt related to this clause.")    similarity_score: float = Field(..., description="Similarity score indicating how closely the guideline matches the clause, e.g., between 0 and 1.")    relevance_explanation: Optional[str] = Field(None, description="Brief explanation of why this guideline is relevant.")# 级联检查项class ClauseComplianceCheck(BaseModel):    clause_text: str = Field(..., description="The exact text of the clause from the contract.")    matched_guideline: Optional[GuidelineMatch] = Field(None, description="The most relevant guideline extracted via vector retrieval.")    compliant: bool = Field(..., description="Indicates whether the clause is considered compliant with the referenced guideline.")    notes: Optional[str] = Field(None, description="Additional commentary or recommendations.")
# Setup Contract Review Workflow# 1.从知识库协议中提取结构化数据。# 2.对于每个条款，根据 Setup 合同检查内容进行检索，看其是否符合准则。# 3.生成最终摘要和判断结果。from llama_index.core.workflow import (    Event,    StartEvent,    StopEvent,    Context,    Workflow,    step,)from llama_index.core.llms import LLMfrom typing import Optionalfrom pydantic import BaseModelfrom llama_index.core import SimpleDirectoryReaderfrom llama_index.core.schema import Documentfrom llama_index.core.agent import FunctionCallingAgentWorkerfrom llama_index.core.prompts import ChatPromptTemplatefrom llama_index.core.llms import ChatMessage, MessageRolefrom llama_index.core.retrievers import BaseRetrieverfrom pathlib import Pathimport loggingimport jsonimport os
# 设置日志_logger = logging.getLogger(__name__)_logger.setLevel(logging.INFO)# 开始设置 prompt# 提取内容 promptCONTRACT_EXTRACT_PROMPT = """\You are given contract data below. \Please extract out relevant information from the contract into the defined schema - the schema is defined as a function call.\
{contract_data}"""# 内容和知识库匹配 promptCONTRACT_MATCH_PROMPT = """\Given the following contract clause and the corresponding relevant guideline text, evaluate the compliance \and provide a JSON object that matches the ClauseComplianceCheck schema.
**Contract Clause:**{clause_text}
**Matched Guideline Text(s):**{guideline_text}"""# 级联检查项 promptCOMPLIANCE_REPORT_SYSTEM_PROMPT = """\You are a compliance reporting assistant. Your task is to generate a final compliance report \based on the results of clause compliance checks against \a given set of guidelines. 
Analyze the provided compliance results and produce a structured report according to the specified schema. Ensure that if there are no noncompliant clauses, the report clearly indicates full compliance."""# 报告输出格式 promptCOMPLIANCE_REPORT_USER_PROMPT = """\A set of clauses within a contract were checked against GDPR compliance guidelines for the following vendor: {vendor_name}. The set of noncompliant clauses are given below.
Each section includes:- **Clause:** The exact text of the contract clause.- **Guideline:** The relevant GDPR guideline text.- **Compliance Status:** Should be `False` for noncompliant clauses.- **Notes:** Additional information or explanations.
{compliance_results}
Based on the above compliance results, generate a final compliance report following the `ComplianceReport` schema below. If there are no noncompliant clauses, the report should indicate that the contract is fully compliant."""

class ContractExtractionEvent(Event):    contract_extraction: ContractExtractionclass MatchGuidelineEvent(Event):    clause: ContractClauseclass MatchGuidelineResultEvent(Event):    result: ClauseComplianceCheckclass GenerateReportEvent(Event):    match_results: List[ClauseComplianceCheck]class LogEvent(Event):    msg: str    delta: bool = False
# 工作流核心代码class ContractReviewWorkflow(Workflow):    """Contract review workflow."""
    def __init__(self, parser: LlamaParse, guideline_retriever: BaseRetriever,        llm: LLM | None = None, similarity_top_k: int = 20, output_dir: str = "data_out",        **kwargs,) -> None:        """Init params."""        super().__init__(**kwargs)
		# 拿前面设置好的 llamaIndex 组件和环境        self.parser = parser        self.guideline_retriever = guideline_retriever        self.llm = llm or OpenAI(model="gpt-4o-mini")        self.similarity_top_k = similarity_top_k
        # if not exists, create        out_path = Path(output_dir) / "workflow_output"        if not out_path.exists():            out_path.mkdir(parents=True, exist_ok=True)            os.chmod(str(out_path), 0o0777)        self.output_dir = out_path
    @step    async def parse_contract(self, ctx: Context, ev: StartEvent) -> ContractExtractionEvent:        # load output template file        contract_extraction_path = Path(            f"{self.output_dir}/contract_extraction.json"        )        if contract_extraction_path.exists():            if self._verbose:                ctx.write_event_to_stream(LogEvent(msg=">> Loading contract from cache"))            contract_extraction_dict = json.load(open(str(contract_extraction_path), "r"))            contract_extraction = ContractExtraction.model_validate(contract_extraction_dict)        else:            if self._verbose:                ctx.write_event_to_stream(LogEvent(msg=">> Reading contract"))
            # 设置 llamaParam 解析文档            docs = SimpleDirectoryReader(input_files=[ev.contract_path]).load_data()
            # 构造提取内容 prompt            prompt = ChatPromptTemplate.from_messages([                ("user", CONTRACT_EXTRACT_PROMPT)            ])			# 等待 LLM 返回结果，参数包含输入的文档，模型，prompt            contract_extraction = await llm.astructured_predict(                ContractExtraction,                prompt,                contract_data="\n".join([d.get_content(metadata_mode="all") for d in docs])            )            if not isinstance(contract_extraction, ContractExtraction):                raise ValueError(f"Invalid extraction from contract: {contract_extraction}")            # save output template to file            with open(contract_extraction_path, "w") as fp:                fp.write(contract_extraction.model_dump_json())        if self._verbose:            ctx.write_event_to_stream(LogEvent(msg=f">> Contract data: {contract_extraction.dict()}"))
        return ContractExtractionEvent(contract_extraction=contract_extraction)
    @step    async def dispatch_guideline_match(self, ctx: Context, ev: ContractExtractionEvent) -> MatchGuidelineEvent:        """For each clause in the contract, find relevant guidelines.        Use a map-reduce pattern.         """        await ctx.set("num_clauses", len(ev.contract_extraction.clauses))        await ctx.set("vendor_name", ev.contract_extraction.vendor_name)
        for clause in ev.contract_extraction.clauses:            ctx.send_event(MatchGuidelineEvent(clause=clause, vendor_name=ev.contract_extraction.vendor_name))
    # 匹配知识库内容    @step    async def handle_guideline_match(self, ctx: Context, ev: MatchGuidelineEvent) -> MatchGuidelineResultEvent:        """Handle matching clause against guideline."""
        # 构造查询 prompt        query = """Please find the relevant guideline from {ev.vendor_name} that aligns with the following contract clause:{ev.clause.clause_text}"""		# 查询知识库 Embedding        guideline_docs = self.guideline_retriever.retrieve(query)        guideline_text="\n\n".join([g.get_content() for g in guideline_docs])        if self._verbose:            ctx.write_event_to_stream(                LogEvent(msg=f">> Found guidelines: {guideline_text[:200]}...")            )
        # 提取知识库相关内容        prompt = ChatPromptTemplate.from_messages([("user", CONTRACT_MATCH_PROMPT)])		# 等待 LLM 处理知识库内容和输入内容正确性，参数包含检查代理，prompt，知识库dump出来的graph，输入需要匹配文本        compliance_output = await llm.astructured_predict(            ClauseComplianceCheck,            prompt,            clause_text=ev.clause.model_dump_json(),            guideline_text=guideline_text
        )
        if not isinstance(compliance_output, ClauseComplianceCheck):            raise ValueError(f"Invalid compliance check: {compliance_output}")
        return MatchGuidelineResultEvent(result=compliance_output)
    # 匹配结果    @step    async def gather_guideline_match(self, ctx: Context, ev: MatchGuidelineResultEvent) -> GenerateReportEvent:        """Handle matching clause against guideline."""        num_clauses = await ctx.get("num_clauses")        events = ctx.collect_events(ev, [MatchGuidelineResultEvent] * num_clauses)        if events is None:            return
        match_results = [e.result for e in events]        # save match results        match_results_path = Path(            f"{self.output_dir}/match_results.jsonl"        )        with open(match_results_path, "w") as fp:            for mr in match_results:                fp.write(mr.model_dump_json() + "\n")

        return GenerateReportEvent(match_results=[e.result for e in events])
    # 输出    @step    async def generate_output(self, ctx: Context, ev: GenerateReportEvent) -> StopEvent:        if self._verbose:            ctx.write_event_to_stream(LogEvent(msg=">> Generating Compliance Report"))
        # if all clauses are compliant, return a compliant result        non_compliant_results = [r for r in ev.match_results if not r.compliant]
        # generate compliance results string        result_tmpl = """1. **Clause**: {clause}2. **Guideline:** {guideline}3. **Compliance Status:** {compliance_status}4. **Notes:** {notes}"""        non_compliant_strings = []        for nr in non_compliant_results:            non_compliant_strings.append(                result_tmpl.format(                    clause=nr.clause_text,                    guideline=nr.matched_guideline.guideline_text,                    compliance_status=nr.compliant,                    notes=nr.notes                )            )        non_compliant_str = "\n\n".join(non_compliant_strings)
        prompt = ChatPromptTemplate.from_messages([            ("system", COMPLIANCE_REPORT_SYSTEM_PROMPT),            ("user", COMPLIANCE_REPORT_USER_PROMPT)        ])        compliance_report = await llm.astructured_predict(            ComplianceReport,            prompt,            compliance_results=non_compliant_str,            vendor_name=await ctx.get("vendor_name")        )
        return StopEvent(result={"report": compliance_report, "non_compliant_results": non_compliant_results})