我要投稿

企业级智能体系统 RAG的分片优化逻辑

发布日期：2026-06-10 09:43:35 浏览次数： 1516

作者：宋悦的爱码士

微信搜一搜，关注“宋悦的爱码士”

失踪人口回归，这段时间很忙，不仅是在做项目优化还得探索所谓的口播自动化编程，后面可以跟大家一起交流。

分片优化

最简单有效非demo的手段

云厂商永远是你值得信赖的伙伴，花点钱可以解决的问题没必要写代码。

本期优化

使用策略模式实现了三种方式，固定窗口，语义，命题三种策略。在实际的开发过程中可以参考并延伸。

文件	路径	说明
ChunkingStrategyType	`bootstrap/rag/chunking/ChunkingStrategyType.java`	策略类型枚举
ChunkingStrategy	`bootstrap/rag/chunking/ChunkingStrategy.java`	策略接口
ChunkingStrategyFactory	`bootstrap/rag/chunking/ChunkingStrategyFactory.java`	策略工厂
ChunkingRequest	`bootstrap/rag/chunking/ChunkingRequest.java`	分块请求
ChunkingResult	`bootstrap/rag/chunking/ChunkingResult.java`	分块结果
DocumentChunk	`bootstrap/rag/chunking/DocumentChunk.java`	文档分块
SlidingWindowChunkingStrategy	`bootstrap/rag/chunking/impl/SlidingWindowChunkingStrategy.java`	滑动窗口实现
SemanticChunkingStrategy	`bootstrap/rag/chunking/impl/SemanticChunkingStrategy.java`	语义切分实现
PropositionChunkingStrategy	`bootstrap/rag/chunking/impl/PropositionChunkingStrategy.java`	命题切分实现
MultimodalChunkBuilder	`bootstrap/rag/chunking/MultimodalChunkBuilder.java`	多模态分块构建
ChunkingService	`bootstrap/admin/service/ChunkingService.java`	分块业务服务
DocumentProcessingJobService	`bootstrap/admin/service/DocumentProcessingJobService.java`	文档处理管线
KnowledgeController	`bootstrap/admin/controller/KnowledgeController.java`	API 控制器

1. 为什么需要分片优化

在 RAG（Retrieval-Augmented Generation）系统中，文档分片是影响检索质量的第一道关卡。分片质量直接决定了：

召回率：分片太粗，关键信息被淹没在噪音中；分片太细，上下文丢失
精确度：分片边界不当，跨边界的信息在检索时可能遗漏
语义完整性：理想情况下，每个分片应该是一个语义自洽的单元

传统固定切分的问题

最简单的做法是按固定字符数切割，但会带来严重问题：

原文："...2024年公司营收达到50亿元。其中，|华东区域贡献了32%的份额，华|南区域贡献了28%..."
                                              ↑ 粗暴切割点           ↑

切割后，"华东区域贡献了32%的份额" 被拆成两半，检索 "华东区域营收占比" 时无法精准命中。

本项目的优化思路

本项目实现了三种由粗到细的分片策略：

策略	核心思想	适用场景	依赖
SLIDING_WINDOW	固定窗口 + 重叠区域 + 段落感知	通用场景，无需额外依赖	无
SEMANTIC	嵌入向量相似度下降点作为切分边界	语义结构丰富的文档	EmbeddingModel
PROPOSITION	LLM 拆解为原子级命题再重组	需要精细事实检索的场景	ChatModel

2. 整体架构设计

2.1 类图总览

                    ┌──────────────────────┐
                    │   ChunkingStrategy   │  ◄── 策略接口
                    │──────────────────────│
                    │ + type()             │
                    │ + chunk(request)     │
                    │ + validate(config)   │
                    │ + getDefaultConfig() │
                    │ + getVersion()       │
                    └──────────┬───────────┘
                               │ implements
              ┌────────────────┼────────────────┐
              │                │                │
   ┌──────────▼─────┐ ┌───────▼──────┐ ┌──────▼────────┐
   │ SlidingWindow  │ │  Semantic    │ │  Proposition  │
   │ ChunkingStrat. │ │  ChunkingStr.│ │  ChunkingStr. │
   └────────────────┘ └──────────────┘ └───────────────┘

                    ┌──────────────────────┐
                    │ ChunkingStrategyFactory │ ◄── 策略工厂
                    │──────────────────────│
                    │ - strategyMap        │
                    │ + getStrategy(type)  │
                    │ + listStrategies()   │
                    │ + validateConfig()   │
                    └──────────┬───────────┘
                               │ used by
                    ┌──────────▼───────────┐
                    │   ChunkingService    │ ◄── 业务服务层
                    │──────────────────────│
                    │ + chunk()            │
                    │ + preview()          │
                    │ + validateConfig()   │
                    │ + listStrategies()   │
                    └──────────────────────┘

2.2 设计模式：策略模式 + 工厂模式

本模块采用 策略模式（Strategy Pattern） 定义算法族，用 工厂模式（Factory Pattern） 管理策略注册与获取。

关键设计决策：

Spring 自动注册：所有 ChunkingStrategy 实现类标注 @Component，Spring 自动注入 List，工厂在构造时自动完成注册
配置驱动：每种策略的参数通过 Map config 传入，支持灵活配置
策略降级：PROPOSITION 策略在 ChatModel 不可用时自动降级为 SLIDING_WINDOW
版本追踪：每个策略有版本号，生成的 chunk 携带 strategyVersion，便于后续升级迁移

3. 核心数据模型

3.1 ChunkingRequest — 分块请求

文件：com.isy.rag.bootstrap.rag.chunking.ChunkingRequest

@Data
@Builder
public class ChunkingRequest {
    private String content;                    // 文档文本内容
    private ChunkingStrategyType strategyType; // 分块策略枚举
    private Map config;        // 策略专属配置

    // 类型安全的配置读取方法
    public  T getConfigValue(String key, Class type, T defaultValue);
}

设计亮点 — getConfigValue 方法：

该方法实现了类型安全的配置读取，支持：

自动 Number 子类转换（Integer/Long/Double/Float）
String 类型自动转换
Boolean 类型智能解析
默认值兜底

这意味着前端传入的 JSON 数值（如 500 可能被 Jackson 解析为 Integer 或 Long）都能被正确处理。

3.2 DocumentChunk — 文档分块

文件：com.isy.rag.bootstrap.rag.chunking.DocumentChunk

@Data
@Builder
public class DocumentChunk {
    // ===== 基础字段 =====
    private String content;           // 分块内容
    private int index;                // 分块索引
    private int charCount;            // 字符数
    private Integer tokenCount;       // token数（估算）
    private Integer startOffset;      // 原文起始偏移
    private Integer endOffset;        // 原文结束偏移
    private String chunkType;         // TEXT / PROPOSITION / SEMANTIC_SEGMENT / IMAGE / TABLE_IMAGE
    private String strategyVersion;   // 策略版本
    private Map metadata; // 策略专属元数据

    // ===== 多模态扩展字段 =====
    private List assetIds;      // 关联资产ID列表
    private Integer blockIndex;       // 文档内顺序索引（多模态混合排序用）
    private String sectionTitle;      // 所属章节标题
    private Integer pageNum;          // 页码
    private String pipelineStatus;    // 管线状态
}

chunkType 枚举值说明：

chunkType	来源策略	说明
`TEXT`	SLIDING_WINDOW	普通文本分块
`SEMANTIC_SEGMENT`	SEMANTIC	语义段落分块
`PROPOSITION`	PROPOSITION	命题分块
`IMAGE`	MultimodalChunkBuilder	图片描述分块
`TABLE_IMAGE`	MultimodalChunkBuilder	表格/图表描述分块

metadata 策略专属字段：

// SEMANTIC 策略 metadata 示例
{
    "breakpointType": "percentile",
    "boundaryScore": 0.35,        // 断点处的相似度
    "mergedFromSmallChunks": true  // 是否由小chunk合并而来
}

// PROPOSITION 策略 metadata 示例
{
    "propositionRange": "3-7",    // 命题索引范围
    "propositionCount": 5         // 包含的命题数量
}

3.3 ChunkingResult — 分块结果

@Data
public class ChunkingResult {
    private List chunks;       // 分块文档列表
    private ChunkingStrategyType strategyType; // 使用的策略类型
    private String strategyVersion;           // 策略版本
    private long durationMs;                  // 处理耗时(毫秒)
    private int totalChars;                   // 原始文档总字符数
    private Map metadata;     // 额外元数据
}

3.4 ChunkingStrategyType — 策略类型枚举

public enum ChunkingStrategyType {
    SLIDING_WINDOW("SLIDING_WINDOW", "滑动窗口切分", "固定窗口+重叠区域，确保跨边界信息不丢失"),
    SEMANTIC("SEMANTIC", "语义切分", "按语义边界切分，用嵌入向量相似度下降点作为切分边界"),
    PROPOSITION("PROPOSITION", "命题切分", "用LLM将文档拆解为原子级命题，每个命题是独立可验证的事实陈述");

    // fromCode() 方法：code 为 null 或无效时默认返回 SLIDING_WINDOW
}

4. 策略模式详解

4.1 ChunkingStrategy 接口

文件：com.isy.rag.bootstrap.rag.chunking.ChunkingStrategy

public interface ChunkingStrategy {
    ChunkingStrategyType type();                           // 策略类型标识
    ChunkingResult chunk(ChunkingRequest request);         // 执行分块（核心方法）
    void validate(Map config);             // 校验配置参数
    Map getDefaultConfig();                // 获取默认配置
    default String getVersion() { return "v1"; }           // 策略版本号
}

接口设计原则：

type() 用于工厂注册和查找，每个实现类返回唯一标识
validate() 在执行分块前校验参数，防止运行时异常
getDefaultConfig() 让前端可以展示推荐配置，降低使用门槛
getVersion() 默认返回 "v1"，策略升级时覆写

4.2 ChunkingStrategyFactory — 策略工厂

文件：com.isy.rag.bootstrap.rag.chunking.ChunkingStrategyFactory

@Slf4j
@Component
public class ChunkingStrategyFactory {
    private final Map strategyMap = new HashMap<>();

    // Spring 自动注入所有 ChunkingStrategy 实现类
    public ChunkingStrategyFactory(List strategies) {
        for (ChunkingStrategy strategy : strategies) {
            strategyMap.put(strategy.type(), strategy);
            log.info("注册分块策略: {} - {}", strategy.type().getCode(), strategy.type().getName());
        }
    }

    public ChunkingStrategy getStrategy(ChunkingStrategyType type) { ... }
    public ChunkingStrategy getStrategy(String strategyCode) { ... }
    public List> listStrategies() { ... }
    public void validateConfig(String strategyCode, Map config) { ... }
    public boolean hasStrategy(ChunkingStrategyType type) { ... }
}

自动注册机制：

Spring 容器启动时，SlidingWindowChunkingStrategy、SemanticChunkingStrategy、PropositionChunkingStrategy 三者都标注了 @Component，Spring 自动收集到 List 中注入给工厂。工厂在构造函数中遍历并注册到 strategyMap。

好处：新增策略只需实现接口并加 @Component，零修改工厂代码即可自动注册。

4.3 ChunkingService — 业务服务层

文件：com.isy.rag.bootstrap.admin.service.ChunkingService

@Slf4j
@Service
@RequiredArgsConstructor
public class ChunkingService {
    private final ChunkingStrategyFactory chunkingStrategyFactory;

    public ChunkingResult chunk(String content, String strategyCode, String configJson) {
        ChunkingStrategyType strategyType = ChunkingStrategyType.fromCode(strategyCode);
        ChunkingStrategy strategy = chunkingStrategyFactory.getStrategy(strategyType);
        Map config = parseConfig(configJson, strategy);
        strategy.validate(config);  // 先校验
        ChunkingRequest request = ChunkingRequest.builder()
                .content(content).strategyType(strategyType).config(config).build();
        return strategy.chunk(request);  // 再执行
    }

    public ChunkingResult preview(String content, String strategyCode, String configJson) {
        return chunk(content, strategyCode, configJson);  // 预览与正式分块逻辑相同
    }
}

调用链路：

Controller → ChunkingService.chunk() → Factory.getStrategy() → Strategy.chunk()
                                         ↓
                                    parseConfig() (JSON → Map)
                                         ↓
                                    Strategy.validate()

5. 三种分片策略深度剖析

5.1 滑动窗口切分（SLIDING_WINDOW）

文件：com.isy.rag.bootstrap.rag.chunking.impl.SlidingWindowChunkingStrategy

5.1.1 核心思想

文档: [AAAAAA|BBBBBB|CCCCCC|DDDDDD|EEEEEE]

固定窗口无重叠:
  Chunk 0: [AAAAAA]
  Chunk 1: [BBBBBB]    ← 边界处信息可能丢失
  Chunk 2: [CCCCCC]

滑动窗口有重叠(overlap=2):
  Chunk 0: [AAAAAA]
  Chunk 1:   [AABBBBBB]   ← 重叠区保留跨边界信息
  Chunk 2:     [BBCCCCCC]
  Chunk 3:       [CCDDDDDD]

5.1.2 两种子模式

模式一：段落感知（paragraphAware=true，默认）

private List splitByParagraph(String content, int chunkSize, int overlap, int minChunkSize) {
    String[] paragraphs = content.split("\n\n+");  // 按双换行分割段落
    StringBuilder currentChunk = new StringBuilder();

    for (String paragraph : paragraphs) {
        // 如果加入当前段落会超过 chunkSize → 先保存当前块
        if (currentChunk.length() + paragraph.length() > chunkSize && currentChunk.length() > 0) {
            // 保存 & 处理重叠部分
            String overlapText = currentChunk.substring(currentChunk.length() - overlap);
            currentChunk = new StringBuilder(overlapText);
        }
        currentChunk.append(paragraph).append("\n\n");

        // 超长段落强制分割
        if (currentChunk.length() > chunkSize * 1.5) {
            // 按 chunkSize - overlap 步长强制切分
        }
    }
}

关键逻辑解析：

段落边界优先：只在段落之间切分，不把一个段落从中间切断
超长段落兜底：当单个段落超过 chunkSize * 1.5 时，退化为固定大小切分
最小块过滤：minChunkSize 过滤过小的尾部碎片

模式二：纯固定大小（paragraphAware=false）

private List splitByFixedSize(String content, int chunkSize, int overlap, int minChunkSize) {
    int step = chunkSize - overlap;  // 步长 = 窗口大小 - 重叠区
    for (int i = 0; i < content.length(); i += step) {
        int end = Math.min(i + chunkSize, content.length());
        String chunkText = content.substring(i, end).trim();
        if (chunkText.length() >= minChunkSize) {
            chunks.add(buildChunk(chunkText, chunkIndex++, i, end));
        }
    }
}

5.1.3 参数校验规则

public void validate(Map config) {
    // chunkSize 范围: 100 ~ 8000
    // chunkOverlap 范围: 0 ~ chunkSize/2
    // 为什么 overlap 上限是 chunkSize/2？因为超过一半会导致新窗口中大部分是重复内容
}

5.1.4 Token 估算

private int estimateTokens(String text) {
    return text.length() / 2;  // 粗略估算：中文约1.5字/token，英文约4字符/token
}

5.1.5 适用场景

通用文档：技术文档、合同、报告等结构较规则的文档
快速处理：不依赖任何外部模型，处理速度最快
资源受限：没有 Embedding 模型或 LLM 时的唯一选择

5.2 语义切分（SEMANTIC）

文件：com.isy.rag.bootstrap.rag.chunking.impl.SemanticChunkingStrategy参考：LangChain SemanticChunker

5.2.1 核心思想

核心洞察：同一主题的句子之间语义相似度高，主题切换处相似度会突然下降。找到这些"下降点"就是切分边界。

句子序列:  S1 ──S2 ──S3 ──S4 ──S5 ──S6 ──S7 ──S8
相似度:    0.92  0.88  0.85  0.31  0.90  0.87  0.28
                                   ↑                ↑
                              主题切换点          主题切换点

切分结果:  [S1 S2 S3 S4] | [S5 S6 S7] | [S8]

5.2.2 算法五步骤

步骤1：句子分割

private List splitIntoSentences(String content) {
    // 中英文标点统一处理：。！？；. ! ? ;
    String[] parts = content.split("(?<=[。！？；.!?;])\\s*");
    // 如果分割后句子太少（≤1），退化为按换行分割
    if (sentences.size() <= 1) {
        sentences.clear();
        String[] lines = content.split("\\n+");
    }
}

步骤2：逐句生成嵌入向量

List<double[]> embeddings = new ArrayList<>();
for (String sentence : sentences) {
    TextSegment segment = TextSegment.from(sentence);
    Embedding embedding = embeddingModel.embed(segment).content();
    float[] vector = embedding.vector();
    // float[] → double[] 转换
    double[] doubleVector = new double[vector.length];
    for (int i = 0; i < vector.length; i++) {
        doubleVector[i] = vector[i];
    }
    embeddings.add(doubleVector);
}

步骤3：计算相邻句子的余弦相似度

private double cosineSimilarity(double[] a, double[] b) {
    double dotProduct = 0.0, normA = 0.0, normB = 0.0;
    for (int i = 0; i < a.length; i++) {
        dotProduct += a[i] * b[i];
        normA += a[i] * a[i];
        normB += b[i] * b[i];
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

步骤4：根据断点阈值确定切分边界

三种断点检测算法：

private List findBreakpoints(List similarities, String type, int amount) {
    double threshold;
    switch (type) {
        case "percentile":
            // 将相似度排序，取第 amount 百分位的值作为阈值
            // amount=95 表示：只有最低5%的相似度点才被识别为断点
            List sorted = similarities.stream().sorted().collect(Collectors.toList());
            int idx = (int) Math.ceil(amount / 100.0 * sorted.size()) - 1;
            threshold = sorted.get(idx);
            break;

        case "standard_deviation":
            // 均值 - 标准差 × 倍数
            // amount=95 → 倍数=0.95，相似度显著低于均值的点为断点
            double mean = similarities.stream().mapToDouble(Double::doubleValue).average().orElse(0);
            double stdDev = Math.sqrt(similarities.stream()
                    .mapToDouble(s -> Math.pow(s - mean, 2)).average().orElse(0));
            threshold = mean - stdDev * (amount / 100.0);
            break;

        case "interquartile":
            // Q1 - IQR × 倍数
            // 基于四分位距，对异常值更鲁棒
            double q1 = sorted.get(sorted.size() / 4);
            double q3 = sorted.get(sorted.size() * 3 / 4);
            double iqr = q3 - q1;
            threshold = q1 - iqr * (amount / 100.0);
            break;
    }

    // 相似度低于阈值的点即为断点
    for (int i = 0; i < similarities.size(); i++) {
        if (similarities.get(i) < threshold) breakpoints.add(i);
    }
}

步骤5：合并句子为块

// 先按断点合并 → 过滤小于 minChunkSize 的 → 超过 maxChunkSize 的二次切分
// 合并小于 minChunkSize 的候选块与相邻块
for (int i = 0; i < candidateTexts.size(); i++) {
    String text = candidateTexts.get(i);
    mergeBuffer.append(text);
    if (mergeBuffer.length() >= minChunkSize || i == candidateTexts.size() - 1) {
        mergedTexts.add(mergeBuffer.toString().trim());
        mergeBuffer = new StringBuilder();
    }
}

// 超长块二次切分（优先在句号处断开）
private List splitByMaxSize(String text, int maxSize, int minSize) {
    // 尝试在句号、分号等标点处断开
    int bestPunc = Math.max(
        Math.max(text.lastIndexOf("。", end), text.lastIndexOf(".", end)),
        text.lastIndexOf("；", end)
    );
    if (bestPunc > start + minSize) {
        end = bestPunc + 1;
    }
}

5.2.3 前置条件

if (embeddingModel == null) {
    throw new IllegalStateException(
        "语义切分需要EmbeddingModel，但当前未配置。请选择SLIDING_WINDOW策略或配置Embedding模型。"
    );
}

语义切分必须依赖 Embedding 模型，没有配置时直接抛异常（而非降级），因为用户显式选择了语义切分，静默降级可能导致不符合预期的结果。

5.2.4 适用场景

长文档：论文、技术白皮书等主题切换频繁的文档
语义结构丰富：不同章节讨论不同话题的文档
检索精度要求高：需要按语义边界精准切分的场景

5.3 命题切分（PROPOSITION）

文件：com.isy.rag.bootstrap.rag.chunking.impl.PropositionChunkingStrategy参考：NirDiamant/RAG_Techniques/proposition_chunking

5.3.1 核心思想

将文档拆解为原子级命题（Atomic Proposition）——每个命题是一个独立可验证的事实陈述，然后将命题重新组装为适合检索的块。

原文: "GPT-4于2023年3月发布，参数量约为1.8万亿，训练成本超过1亿美元。"

提取命题:
  - "GPT-4于2023年3月发布"
  - "GPT-4的参数量约为1.8万亿"
  - "GPT-4的训练成本超过1亿美元"

重组为块:
  Chunk 0: "GPT-4于2023年3月发布\nGPT-4的参数量约为1.8万亿\nGPT-4的训练成本超过1亿美元"

为什么比直接切分更好？

传统切分可能在 "1.8万亿" 和 "训练成本" 之间切断
命题切分确保每个事实独立完整，检索时不会遗漏半句话

5.3.2 算法三步骤

步骤1：文档分批

private List splitIntoBatches(String content, int maxChars) {
    String[] paragraphs = content.split("\n\n+");
    // 按段落边界分批，每批不超过 maxInputCharsPerBatch（默认3000字符）
    // 超长段落强制按 maxChars 切分
}

为什么要分批？因为 LLM 有输入长度限制，直接把整篇文档丢给 LLM 会导致截断或超时。

步骤2：LLM 提取命题

private List extractPropositions(String text, int maxProps) {
    String systemPrompt = "You are an expert at decomposing text into atomic propositions.\n"
        + "Rules:\n"
        + "1. Each proposition must be a complete, standalone fact.\n"
        + "2. Preserve specific details: numbers, dates, names, quantities.\n"
        + "3. Do not add information not present in the source text.\n"
        + "4. Output one proposition per line, prefixed with \"- \".\n"
        + "5. If the text contains no verifiable facts, return an empty response.";

    // 调用 ChatModel
    ChatResponse chatResponse = chatModel.chat(chatRequest);
    String responseText = chatResponse.aiMessage().text();

    // 解析 LLM 输出
    for (String line : lines) {
        // 去除 "- " 或 "1. " 前缀
        if (line.startsWith("- ")) {
            line = line.substring(2).trim();
        } else if (line.matches("^\\d+\\.\\s+.*")) {
            line = line.replaceFirst("^\\d+\\.\\s+", "").trim();
        }
        if (!line.isEmpty() && line.length() > 5) {
            propositions.add(line);
        }
    }
}

Prompt 设计要点：

Rule 1：确保每个命题是完整独立的
Rule 2：保留数字、日期、名称等关键细节
Rule 3：禁止编造（避免幻觉）
Rule 5：无事实内容时返回空（而非强行拆解）

步骤3：命题重组

private List reassemblePropositions(List propositions, int maxChunkSize, int minChunkSize) {
    // 按顺序合并命题直到达到 maxChunkSize
    for (int i = 0; i < propositions.size(); i++) {
        if (currentChunk.length() + prop.length() + 2 > maxChunkSize && currentChunk.length() > 0) {
            // 保存当前块，记录 propositionRange 和 propositionCount
            meta.put("propositionRange", propStartIndex + "-" + (i - 1));
            meta.put("propositionCount", i - propStartIndex);
        }
        currentChunk.append(prop);
    }
}

为什么需要重组？ 单个命题可能只有十几个字，直接作为 chunk 向量化后信息量不足。重组为包含多个相关命题的块，既有原子性又有足够上下文。

5.3.3 兜底策略

private ChunkingResult fallbackChunk(ChunkingRequest request) {
    log.warn("[KB检索] 命题切分使用兜底策略: SLIDING_WINDOW");
    SlidingWindowChunkingStrategy fallback = new SlidingWindowChunkingStrategy();
    // 使用同样的 maxChunkSize 作为 chunkSize
    ChunkingResult result = fallback.chunk(fallbackRequest);
    result.setMetadata(Map.of(
        "fallbackFrom", "PROPOSITION",
        "fallbackReason", "ChatModel不可用或产出为空"
    ));
    return result;
}

降级触发条件：

chatModel == null（未配置 LLM）
LLM 提取产出为空（所有批次都失败或无事实内容）

降级时会在 metadata 中标记 fallbackFrom 和 fallbackReason，便于后续排查。

5.3.4 适用场景

事实密集型文档：法律条文、规章制度、产品规格等
精细检索需求：需要精确命中某个具体事实
高质量 Embedding 可用：命题较短，需要 Embedding 模型能捕捉细粒度语义

6. 多模态分块构建

文件：com.isy.rag.bootstrap.rag.chunking.MultimodalChunkBuilder

6.1 职责

将图片分析结果（VLM 描述 + OCR 文本）生成 IMAGE / TABLE_IMAGE 类型的 chunk，并与文本 chunk 合并排序。

6.2 图片 Chunk 生成逻辑

对于每个 DocumentAssetDO:
  if (analysisStatus == 2) {          // VLM 分析完成
      → 生成 IMAGE 或 TABLE_IMAGE chunk
      → content 使用 embeddingText（用于向量化）
      → metadata 包含: asset_id, page_num, image_type, keywords 等
  }
  else if (analysisStatus == 3        // VLM 失败
           && ocrText 不为空) {        // 但有 OCR 结果
      → 降级生成简短 IMAGE chunk
      → content = "[图片]" + ocrText
      → metadata 标记 degraded=true, degrade_reason="VLM_FAILED_WITH_OCR"
  }
  else {
      → 不生成 chunk（跳过）
  }

6.3 文本与图片混合排序

public List mergeAndSortChunks(List textChunks, List imageChunks) {
    // 1. 合并
    List allChunks = new ArrayList<>();
    allChunks.addAll(textChunks);
    allChunks.addAll(imageChunks);

    // 2. 为没有 blockIndex 的文本 chunk 分配估算的 blockIndex
    int maxBlockIndex = allChunks.stream()
            .filter(c -> c.getBlockIndex() != null)
            .mapToInt(DocumentChunk::getBlockIndex)
            .max().orElse(-1);
    for (DocumentChunk chunk : textChunks) {
        if (chunk.getBlockIndex() == null) {
            maxBlockIndex++;
            chunk.setBlockIndex(maxBlockIndex);
        }
    }

    // 3. 按 blockIndex 排序
    allChunks.sort(Comparator.comparingInt(c -> c.getBlockIndex() != null ? c.getBlockIndex() : Integer.MAX_VALUE));

    // 4. 重新分配全局递增的 chunk index
    for (int i = 0; i < allChunks.size(); i++) {
        allChunks.get(i).setIndex(i);
    }
    return allChunks;
}

blockIndex 的作用：图片 chunk 来自文档解析阶段，携带了解析时的 blockIndex（文档中的物理顺序），文本 chunk 没有这个信息。混合排序时，图片按原始位置插入，文本按顺序分配 blockIndex，确保最终顺序与原文阅读顺序一致。

6.4 Chunk-Asset 关系管理

// PRIMARY 关系：一个 IMAGE chunk 与其关联的资产
public void savePrimaryRelation(Long chunkId, Long assetId) {
    ChunkAssetRelDO rel = new ChunkAssetRelDO();
    rel.setChunkId(chunkId);
    rel.setAssetId(assetId);
    rel.setRelationType("PRIMARY");
    chunkAssetRelMapper.insert(rel);
}

// 重建时清理
public void cleanupChunksAndRelations(Long docId) {
    chunkAssetRelMapper.softDeleteByDocId(docId, System.currentTimeMillis());
}

7. 文档处理管线集成

文件：com.isy.rag.bootstrap.admin.service.DocumentProcessingJobService

7.1 完整处理管线

上传文件 → 解析(PARSING) → 资产提取(EXTRACTING_ASSETS) → 图片分析(ANALYZING_IMAGES)
         → 分块(CHUNKING) → 向量化(EMBEDDING) → 索引(INDEXING) → 完成(DONE)

7.2 分块阶段的集成代码

// ---- 阶段4: 分块 (小事务) ----
if (recoveryService.shouldRun(resumeStage, "CHUNKING")) {
    // 1. 先清理旧分块和关系
    multimodalChunkBuilder.cleanupChunksAndRelations(docId);
    chunkMapper.deleteByDocId(docId);

    // 2. 文本分块（调用 ChunkingService）
    ChunkingResult chunkingResult = chunkingService.chunk(content, strategyCode, configJson);
    List textChunks = chunkingResult.getChunks();

    // 3. 图片分块（调用 MultimodalChunkBuilder）
    List imageChunks = new ArrayList<>();
    if (multimodalEnabled) {
        List analyzedAssets = assetMapper.selectByDocId(docId);
        imageChunks = multimodalChunkBuilder.buildImageChunks(analyzedAssets);
    }

    // 4. 合并排序（按 blockIndex 混合排序）
    List allChunks = multimodalChunkBuilder.mergeAndSortChunks(textChunks, imageChunks);

    // 5. 保存到数据库（小事务）
    chunkDOs = saveChunksInTransaction(docId, document.getKbId(), allChunks);
}

7.3 策略优先级

// 确定分块策略和配置（文档级 > 知识库级）
String strategyCode = StrUtil.isNotBlank(document.getChunkStrategy())
        ? document.getChunkStrategy() : kb.getChunkStrategy();
String configJson = StrUtil.isNotBlank(document.getChunkConfigJson())
        ? document.getChunkConfigJson() : kb.getChunkConfigJson();

优先使用文档级配置，如果文档没有指定则使用知识库级配置。这允许同一知识库下的不同文档使用不同策略。

7.4 断点恢复机制

管线通过 ProcessRecoveryService 实现断点恢复：

ProcessRecoveryService.ResumeStage resumeStage = recoveryService.resolveResumeStage(docId);
if ("DONE".equals(resumeStage.getStartStage())) {
    return;  // 已完成，跳过
}
// 每个阶段执行前检查是否需要运行
if (recoveryService.shouldRun(resumeStage, "CHUNKING")) { ... }

如果处理在分块阶段中断，重启后会从分块阶段继续，无需重新解析。

8. API 接口使用指南

8.1 获取所有分块策略

GET /api/admin/kb/chunking-strategies

响应示例：

{
  "code": "00000",
  "data": [
    {
      "type": "SLIDING_WINDOW",
      "name": "滑动窗口切分",
      "description": "固定窗口+重叠区域，确保跨边界信息不丢失",
      "version": "v1",
      "defaultConfig": {
        "chunkSize": 500,
        "chunkOverlap": 50,
        "paragraphAware": true,
        "minChunkSize": 80
      }
    },
    {
      "type": "SEMANTIC",
      "name": "语义切分",
      "description": "按语义边界切分，用嵌入向量相似度下降点作为切分边界",
      "version": "v1",
      "defaultConfig": {
        "breakpointThresholdType": "percentile",
        "breakpointThresholdAmount": 95,
        "minChunkSize": 120,
        "maxChunkSize": 900
      }
    },
    {
      "type": "PROPOSITION",
      "name": "命题切分",
      "description": "用LLM将文档拆解为原子级命题，每个命题是独立可验证的事实陈述",
      "version": "v1",
      "defaultConfig": {
        "maxInputCharsPerBatch": 3000,
        "maxPropositionsPerBatch": 50,
        "fallbackStrategy": "SLIDING_WINDOW",
        "maxChunkSize": 500,
        "minChunkSize": 50
      }
    }
  ]
}

8.2 分块预览

POST /api/admin/kb/{kbId}/chunk-preview
Content-Type: application/json

{
  "text": "这是需要预览分块的文本内容...",
  "chunkStrategy": "SLIDING_WINDOW",
  "chunkConfig": {
    "chunkSize": 200,
    "chunkOverlap": 50
  }
}

响应示例：

{
  "code": "00000",
  "data": {
    "strategyType": "SLIDING_WINDOW",
    "strategyVersion": "v1",
    "totalChunks": 5,
    "totalChars": 1000,
    "durationMs": 12,
    "chunks": [
      {
        "index": 0,
        "content": "这是需要预览分块的文本内容...",
        "charCount": 200,
        "tokenCount": 100,
        "chunkType": "TEXT",
        "strategyVersion": "v1",
        "metadata": null
      }
    ]
  }
}

8.3 创建知识库时指定策略

POST /api/admin/kb/save
Content-Type: application/json

{
  "name": "我的知识库",
  "description": "使用语义切分",
  "chunkStrategy": "semantic",
  "chunkConfigJson": "{\"breakpointThresholdType\":\"percentile\",\"breakpointThresholdAmount\":90}"
}

8.4 上传文档时覆盖策略

POST /api/admin/kb/{kbId}/upload
Content-Type: multipart/form-data

file: <文件>
chunkStrategy: proposition
chunkConfig: {"maxChunkSize": 600}

8.5 重建文档分块

POST /api/admin/kb/document/{docId}/rebuild-chunks

8.6 获取文档处理进度

GET /api/admin/kb/document/{docId}/progress

响应示例：

{
  "docId": 123,
  "parseStatus": 2,
  "vectorStatus": 1,
  "processStage": "DONE",
  "processProgress": 100,
  "chunkCount": 15
}

9. 配置参数速查表

9.1 SLIDING_WINDOW 配置

参数	类型	默认值	范围	说明
`chunkSize`	Integer	500	100~8000	每个分块的目标字符数
`chunkOverlap`	Integer	50	0~chunkSize/2	重叠区字符数
`paragraphAware`	Boolean	true	true/false	是否感知段落边界
`minChunkSize`	Integer	80	> 0	最小分块字符数，过小的碎片被过滤

9.2 SEMANTIC 配置

参数	类型	默认值	范围	说明
`breakpointThresholdType`	String	"percentile"	percentile / standard_deviation / interquartile	断点检测算法
`breakpointThresholdAmount`	Integer	95	> 0	断点阈值参数（含义随类型变化）
`minChunkSize`	Integer	120	> 0	最小分块字符数
`maxChunkSize`	Integer	900	> minChunkSize	最大分块字符数（超长块二次切分）

breakpointThresholdAmount 含义：

算法	amount=95 的含义
percentile	取相似度分布的第95百分位作为阈值，只有最低5%的点是断点
standard_deviation	阈值 = 均值 - 0.95 × 标准差
interquartile	阈值 = Q1 - IQR × 0.95

9.3 PROPOSITION 配置

参数	类型	默认值	范围	说明
`maxInputCharsPerBatch`	Integer	3000	500~10000	每批输入LLM的字符数
`maxPropositionsPerBatch`	Integer	50	> 0	每批最大命题数
`fallbackStrategy`	String	"SLIDING_WINDOW"	-	兜底策略名称
`maxChunkSize`	Integer	500	> 0	命题重组后的最大块字符数
`minChunkSize`	Integer	50	> 0	命题重组后的最小块字符数

10. 如何扩展新的分片策略

只需 3 步 即可添加新的分片策略：

步骤1：在枚举中注册类型

// ChunkingStrategyType.java
public enum ChunkingStrategyType {
    SLIDING_WINDOW(...),
    SEMANTIC(...),
    PROPOSITION(...),
    MY_NEW_STRATEGY("MY_NEW_STRATEGY", "新策略名称", "新策略描述");  // 新增
    ...
}

步骤2：实现策略接口

@Slf4j
@Component  // 关键：标注 @Component 让 Spring 自动注册
public class MyNewChunkingStrategy implements ChunkingStrategy {

    @Override
    public ChunkingStrategyType type() {
        return ChunkingStrategyType.MY_NEW_STRATEGY;
    }

    @Override
    public ChunkingResult chunk(ChunkingRequest request) {
        long startTime = System.currentTimeMillis();
        String content = request.getContent();

        // 1. 读取配置（使用 getConfigValue 安全获取）
        int myParam = request.getConfigValue("myParam", Integer.class, 100);

        // 2. 实现你的分块算法
        List chunks = myChunkingAlgorithm(content, myParam);

        // 3. 返回结果
        long durationMs = System.currentTimeMillis() - startTime;
        return ChunkingResult.of(chunks, type(), getVersion(), durationMs, content.length());
    }

    @Override
    public void validate(Map config) {
        // 校验配置参数
    }

    @Override
    public Map getDefaultConfig() {
        Map config = new LinkedHashMap<>();
        config.put("myParam", 100);
        return config;
    }
}

步骤3：完成！

无需修改工厂、服务层或控制器——Spring 自动注入 + 工厂自动注册，新策略立即可用。

验证方式：

调用 GET /api/admin/kb/chunking-strategies，新策略应该出现在列表中。

11. 策略选型建议

11.1 决策流程图

开始
 │
 ├─ 是否有 Embedding 模型？
 │   ├─ 否 → SLIDING_WINDOW（唯一选择）
 │   └─ 是 ↓
 │
 ├─ 是否有 ChatModel (LLM)？
 │   ├─ 否 ↓
 │   │   ├─ 文档语义结构是否丰富（多主题切换）？
 │   │   │   ├─ 是 → SEMANTIC
 │   │   │   └─ 否 → SLIDING_WINDOW（更简单快速）
 │   └─ 是 ↓
 │       ├─ 是否需要原子级事实检索？
 │       │   ├─ 是 → PROPOSITION
 │       │   └─ 否 ↓
 │       │       ├─ 文档语义结构丰富？
 │       │       │   ├─ 是 → SEMANTIC
 │       │       │   └─ 否 → SLIDING_WINDOW

11.2 性能与质量对比

维度	SLIDING_WINDOW	SEMANTIC	PROPOSITION
处理速度	极快（纯字符串操作）	较慢（需调 Embedding API）	最慢（需调 LLM API 多次）
外部依赖	无	EmbeddingModel	ChatModel
Token 消耗	0	每句一次 Embedding	每批一次 Chat
语义完整性	低（可能切断语义）	高（按语义边界切分）	最高（原子级事实）
可解释性	高（规则明确）	中（断点阈值可调）	中（依赖 LLM 输出质量）
推荐文档长度	任意	中长文档（>2000字）	中短文档（LLM 输入限制）

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业