my own enterprise RAG double cache

482 words

2 minutes

my own enterprise RAG double cache

2026-04-20

myRAG

LLM

/

Agent

/

LangGraph

/

RAG

原有query查询分析

1
  Query: "Token管理"
2
    │
3
    ├─ Query Expansion → 5 sub-queries (Token管理, Token, JWT, ...)
4
    │
5
    ├─ For EACH sub-query:
6
    │    ├─ DashScope embed()       ← 网络 I/O, 慢
7
    │    ├─ PostgreSQL BM25 FTS    ← DB I/O
8
    │    └─ PostgreSQL vector sim  ← DB I/O
9
    │
10
    ├─ RRF Fusion                  ← 快
11
    │
12
    └─ DashScope rerank()         ← 网络 I/O, 慢

缓存策略分层

1
  请求
2
    │
3
    ▼
4
  ┌─────────────────────┐
5
  │  L1: Query Result Cache  │  ← 相同 query 直接返回 (TTL 5min)
6
  │  (In-memory LRU)         │
7
  └─────────────────────┘
8
    │ miss
9
    ▼
10
  ┌─────────────────────┐
11
  │  L2: Embedding Cache   │  ← 相同 text → embedding (持久化)
12
  │  (SQLite / Redis)      │
13
  └─────────────────────┘
14
    │ miss
15
    ▼
16
  ┌─────────────────────┐
17
  │  L3: BM25 Index Cache  │  ← 启动时加载, 增量更新
18
  │  (In-memory)            │
19
  └─────────────────────┘
20
    │
21
    ▼
22
  DashScope API + PostgreSQL (pgvector + FTS)

各层详解

L1: Query Result Cache（命中率最高）

缓存键：skill_point + top_k → RetrievalResult
TTL：5 分钟（可配置）
淘汰：LRU，1000 条上限
命中场景：用户反复查询同一个技能点

L2: Embedding Cache（减少 DashScope 调用）

缓存键：hash(text) → embedding_vector
持久化：SQLite 文件（embeddings.db），启动时加载
命中场景：同一文档块被不同 query 引用；query expansion 生成的同义词复用
注意：需要评估 DashScope 的 https://help.aliyun.com/zh/dashscope/developer-reference/api-details-for-text-embedding-v3

L3: BM25 Index（消除 DB BM25 查询延迟）

启动时从 PostgreSQL 加载全量文本，构建内存 BM25 索引
增量更新：文档变更时只更新对应部分，不全量重建
或更简单：每次 build_index.py 后将 BM25 数据序列化到文件，API 启动时反序列化

最简可行方案：L1 + L3

L2（embedding 缓存）实际上可以简化——因为 build_index.py 已经把 embedding 存进 pgvector 了，search_similar_chunks 直接查 pgvector 就行。

1
  只加 L1（Query Result Cache）和 L3（BM25 Index）：
2

3
  启动时:
4
    ├─ 从 pgvector 加载所有 chunk embeddings → HybridRetriever
5
    └─ 从 PostgreSQL 加载所有 chunk 文本 → BM25 index
6

7
  请求时:
8
    ├─ Check L1 cache → hit? return
9
    ├─ Query Expansion
10
    ├─ BM25 (in-memory, 快)
11
    ├─ Vector search (pgvector, 已有索引)
12
    ├─ RRF + Rerank
13
    └─ Store result in L1 cache