配置
本指南涵盖了语义路由的配置选项。系统使用单个 YAML 配置文件来控制信号驱动路由、插件链处理和模型选择。
架构概览
该配置定义了三个主要层级:
- 信号提取层:定义 6 种类型的信号(关键词、嵌入、领域、事实核查、用户反馈、偏好)
- 决策引擎:使用 AND/OR 运算符组合信号以做出路由决策
- 插件链:配置用于缓存、安全和优化的插件
配置文件
配置文件位于 config/config.yaml。以下是基于实际实现的结构:
# config/config.yaml - Actual configuration structure
# BERT model for semantic similarity
bert_model:
model_id: sentence-transformers/all-MiniLM-L12-v2
threshold: 0.6
use_cpu: true
# Semantic caching
semantic_cache:
backend_type: "memory" # Options: "memory" or "milvus"
enabled: false
similarity_threshold: 0.8 # Global default threshold
max_entries: 1000
ttl_seconds: 3600
eviction_policy: "fifo" # Options: "fifo", "lru", "lfu"
# Tool auto-selection
tools:
enabled: false
top_k: 3
similarity_threshold: 0.2
tools_db_path: "config/tools_db.json"
fallback_to_empty: true
# Jailbreak protection
prompt_guard:
enabled: false # Global default - can be overridden per category
use_modernbert: true
model_id: "models/jailbreak_classifier_modernbert-base_model"
threshold: 0.7
use_cpu: true
# vLLM endpoints - your backend models
vllm_endpoints:
- name: "endpoint1"
address: "192.168.1.100" # Replace with your server IP address
port: 11434
models:
- "your-model" # Replace with your model
weight: 1
# Model configuration
model_config:
"your-model":
pii_policy:
allow_by_default: true
pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]
preferred_endpoints: ["endpoint1"]
# Example: DeepSeek model with custom name
"ds-v31-custom":
reasoning_family: "deepseek" # Uses DeepSeek reasoning syntax
preferred_endpoints: ["endpoint1"]
# Example: Qwen3 model with custom name
"my-qwen3-model":
reasoning_family: "qwen3" # Uses Qwen3 reasoning syntax
preferred_endpoints: ["endpoint2"]
# Example: Model without reasoning support
"phi4":
preferred_endpoints: ["endpoint1"]
# Classification models
classifier:
category_model:
model_id: "models/category_classifier_modernbert-base_model"
use_modernbert: true
threshold: 0.6
use_cpu: true
pii_model:
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
use_modernbert: true
threshold: 0.7
use_cpu: true
# Signals - Signal extraction configuration
signals:
# Keyword-based signals (fast pattern matching)
keywords:
- name: "math_keywords"
operator: "OR"
keywords:
- "calculate"
- "equation"
- "solve"
- "derivative"
- "integral"
case_sensitive: false
- name: "code_keywords"
operator: "OR"
keywords:
- "function"
- "class"
- "debug"
- "compile"
case_sensitive: false
# Embedding-based signals (semantic similarity)
embeddings:
- name: "code_debug"
threshold: 0.70
candidates:
- "how to debug the code"
- "troubleshooting steps for my code"
aggregation_method: "max"
- name: "math_intent"
threshold: 0.75
candidates:
- "solve mathematical problem"
- "calculate the result"
aggregation_method: "max"
# Domain signals (MMLU classification)
domains:
- name: "mathematics"
description: "Mathematical and computational problems"
mmlu_categories:
- "abstract_algebra"
- "college_mathematics"
- "elementary_mathematics"
- name: "computer_science"
description: "Programming and computer science"
mmlu_categories:
- "computer_security"
- "machine_learning"
# Fact check signals (verification need detection)
fact_check:
- name: "needs_verification"
description: "Queries requiring fact verification"
# User feedback signals (satisfaction analysis)
user_feedbacks:
- name: "correction_needed"
description: "User indicates previous answer was wrong"
# Preference signals (LLM-based matching)
preferences:
- name: "complex_reasoning"
description: "Requires deep reasoning and analysis"
llm_endpoint: "https://:11434"
# Categories - Define domain categories
categories:
- name: math
- name: computer science
- name: other
# Decisions - Combine signals to make routing decisions
decisions:
- name: math
description: "Route mathematical queries"
priority: 10
rules:
operator: "OR" # Match ANY of these conditions
conditions:
- type: "keyword"
name: "math_keywords"
- type: "embedding"
name: "math_intent"
- type: "domain"
name: "mathematics"
modelRefs:
- model: your-model
use_reasoning: true # Enable reasoning for math problems
# Optional: Decision-level plugins
plugins:
- type: "semantic-cache"
configuration:
enabled: true
similarity_threshold: 0.9 # Higher threshold for math
- type: "jailbreak"
configuration:
enabled: true
- type: "pii"
configuration:
enabled: true
threshold: 0.8
- type: "system_prompt"
configuration:
enabled: true
prompt: "You are a mathematics expert. Solve problems step by step."
- name: computer_science
description: "Route computer science queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "keyword"
name: "code_keywords"
- type: "embedding"
name: "code_debug"
- type: "domain"
name: "computer_science"
modelRefs:
- model: your-model
use_reasoning: true # Enable reasoning for code
plugins:
- type: "semantic-cache"
configuration:
enabled: true
similarity_threshold: 0.85
- type: "system_prompt"
configuration:
enabled: true
prompt: "You are a programming expert. Provide clear code examples."
- name: other
description: "Route general queries"
priority: 5
rules:
operator: "OR"
conditions:
- type: "domain"
name: "other"
modelRefs:
- model: your-model
use_reasoning: false # No reasoning for general queries
plugins:
- type: "semantic-cache"
configuration:
enabled: true
similarity_threshold: 0.75 # Lower threshold for general queries
default_model: your-model
# Reasoning family configurations - define how different model families handle reasoning syntax
reasoning_families:
deepseek:
type: "chat_template_kwargs"
parameter: "thinking"
qwen3:
type: "chat_template_kwargs"
parameter: "enable_thinking"
gpt-oss:
type: "reasoning_effort"
parameter: "reasoning_effort"
gpt:
type: "reasoning_effort"
parameter: "reasoning_effort"
# Global default reasoning effort level
default_reasoning_effort: "medium"
在同一个 model_config 块中分配推理家族——对每个模型使用 reasoning_family(参见示例中的 ds-v31-custom 和 my-qwen3-model)。没有推理语法的模型只需省略该字段(例如 phi4)。
配置配方(预设)
我们提供经过策划、带有版本的预设,您可以直接使用或作为起点:
- 准确度优化:https://github.com/vllm-project/semantic-router/blob/main/config/config.recipe-accuracy.yaml
- Token 效率优化:https://github.com/vllm-project/semantic-router/blob/main/config/config.recipe-token-efficiency.yaml
- 延迟优化:https://github.com/vllm-project/semantic-router/blob/main/config/config.recipe-latency.yaml
- 指南与用法:https://github.com/vllm-project/semantic-router/blob/main/config/RECIPES.md
快速使用
- 本地:将配方复制到 config.yaml,然后运行:
- cp config/config.recipe-accuracy.yaml config/config.yaml
- make run-router
- Helm/Argo:在 config map 中引用配方文件内容(示例见上述指南)。
信号配置
信号是智能路由的基础。系统支持 6 种类型的信号,可以组合使用来做出路由决策。
1. 关键词信号 - 快速模式匹配
signals:
keywords:
- name: "math_keywords"
operator: "OR" # OR: match any keyword, AND: match all keywords
keywords:
- "calculate"
- "equation"
- "solve"
case_sensitive: false
使用场景
- 针对特定术语的确定性路由
- 合规性与安全(PII 关键词、禁用术语)
- 需要 <1ms 延迟的高吞吐量场景
2. 嵌入信号 - 语义理解
signals:
embeddings:
- name: "code_debug"
threshold: 0.70 # Similarity threshold (0-1)
candidates:
- "how to debug the code"
- "troubleshooting steps"
aggregation_method: "max" # max, avg, or min
使用场景
- 对释义具有鲁棒性的意图检测
- 语义相似度匹配
- 处理多样化的用户表达方式
3. 领域信号 - MMLU 分类
signals:
domains:
- name: "mathematics"
description: "Mathematical problems"
mmlu_categories:
- "abstract_algebra"
- "college_mathematics"
使用场景
- 学术和专业领域路由
- 主题专家模型选择
- 支持 14 个 MMLU 类别
4. 事实核查信号 - 验证需求检测
signals:
fact_check:
- name: "needs_verification"
description: "Queries requiring fact verification"
使用场景
- 区分事实查询与创意/代码任务
- 路由至具有幻觉检测功能的模型
- 触发事实核查插件
5. 用户反馈信号 - 满意度分析
signals:
user_feedbacks:
- name: "correction_needed"
description: "User indicates previous answer was wrong"
使用场景
- 处理后续修正(“这不对”、“重试一次”)
- 检测满意度水平
- 重试时路由至能力更强的模型
6. 偏好信号 - 基于 LLM 的匹配
signals:
preferences:
- name: "complex_reasoning"
description: "Requires deep reasoning"
llm_endpoint: "https://:11434"
使用场景
- 通过外部 LLM 进行复杂的意图分析
- 细微的路由决策
- 当其他信号不足时使用
决策规则 - 信号融合
使用 AND/OR 运算符组合信号
decisions:
- name: math
description: "Route mathematical queries"
priority: 10
rules:
operator: "OR" # Match ANY condition
conditions:
- type: "keyword"
name: "math_keywords"
- type: "embedding"
name: "math_intent"
- type: "domain"
name: "mathematics"
modelRefs:
- model: math-specialist
weight: 1.0
策略:
- 基于优先级:优先评估高优先级决策
- 基于置信度:选择置信度分数最高的决策
- 混合:结合优先级和置信度
插件链配置
插件在链中处理请求/响应。每个决策可以覆盖全局插件设置。
全局插件配置
# Global defaults
semantic_cache:
enabled: true
similarity_threshold: 0.8
prompt_guard:
enabled: true
threshold: 0.7
classifier:
pii_model:
enabled: true
threshold: 0.8
决策级插件覆盖
decisions:
- name: math
description: "Route mathematical queries"
priority: 10
plugins:
- type: "semantic-cache"
configuration:
enabled: true
similarity_threshold: 0.9 # Higher for math
- type: "jailbreak"
configuration:
enabled: true
- type: "pii"
configuration:
enabled: true
threshold: 0.8
- type: "system_prompt"
configuration:
enabled: true
prompt: "You are a mathematics expert."
- type: "header_mutation"
configuration:
enabled: true
headers:
X-Math-Mode: "enabled"
- type: "hallucination"
configuration:
enabled: false # Optional real-time detection
插件类型
| 插件 | 描述 | 配置 |
|---|---|---|
| 语义缓存 (semantic-cache) | 基于语义相似度的缓存 | similarity_threshold, ttl_seconds |
| 越狱检测 (jailbreak) | 对抗性提示词检测 | threshold, model_id |
| 个人隐私信息 (pii) | PII 检测与脱敏 | threshold, pii_types_allowed |
| 系统提示词 (system_prompt) | 动态提示词注入 | prompt |
| 请求头变更 (header_mutation) | HTTP 请求头操作 | headers |
| hallucination (幻觉检测) | Token 级幻觉检测 | enabled |
关键配置部分
后端端点
配置您的 LLM 服务器
vllm_endpoints:
- name: "my_endpoint"
address: "127.0.0.1" # Your server IP - MUST be IP address format
port: 8000 # Your server port
weight: 1 # Load balancing weight
# Model configuration - maps models to endpoints
model_config:
"llama2-7b": # Model name - must match vLLM --served-model-name
preferred_endpoints: ["my_endpoint"]
"qwen3": # Another model served by the same endpoint
preferred_endpoints: ["my_endpoint"]
示例:Llama / Qwen 后端配置
vllm_endpoints:
- name: "local-vllm"
address: "127.0.0.1"
port: 8000
model_config:
"llama2-7b":
preferred_endpoints: ["local-vllm"]
"qwen3":
preferred_endpoints: ["local-vllm"]
地址格式要求
重要提示:address 字段必须包含有效的 IP 地址(IPv4 或 IPv6)。不支持域名和其他格式。
✅ 支持的格式
# IPv4 addresses
address: "127.0.0.1"
# IPv6 addresses
address: "2001:db8::1"
❌ 不支持
# Domain names
address: "localhost" # ❌ Use 127.0.0.1 instead
address: "api.openai.com" # ❌ Use IP address instead
# Protocol prefixes
address: "http://127.0.0.1" # ❌ Remove protocol prefix
# Paths
address: "127.0.0.1/api" # ❌ Remove path, use IP only
# Ports in address
address: "127.0.0.1:8080" # ❌ Use separate 'port' field
模型名称一致性
model_config 中的模型名称必须与启动 vLLM 服务器时使用的 --served-model-name 参数完全匹配
# vLLM server command (examples):
vllm serve meta-llama/Llama-2-7b-hf --served-model-name llama2-7b --port 8000
vllm serve Qwen/Qwen3-1.8B --served-model-name qwen3 --port 8000
# config.yaml must reference the model in model_config:
model_config:
"llama2-7b": # ✅ Matches --served-model-name
preferred_endpoints: ["your-endpoint"]
"qwen3": # ✅ Matches --served-model-name
preferred_endpoints: ["your-endpoint"]
模型设置
配置特定模型的设置
model_config:
"llama2-7b":
pii_policy:
allow_by_default: true # Allow PII by default
pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]
preferred_endpoints: ["my_endpoint"] # Optional: specify which endpoints can serve this model
"gpt-4":
pii_policy:
allow_by_default: false
# preferred_endpoints omitted - router will not set endpoint header
# Useful when external load balancer handles endpoint selection
关于 preferred_endpoints 的说明
- 可选字段:如果省略,路由器将不会设置
x-vsr-destination-endpoint请求头 - 当指定时:路由器根据权重选择最佳端点并设置该请求头
- 当省略时:上游负载均衡器或服务网格将处理端点选择
- 验证:在类别中使用或作为
default_model的模型必须配置preferred_endpoints
定价(可选)
如果您希望路由器计算请求成本并公开 Prometheus 成本指标,请在 model_config 下的每个模型中添加每 1M token 的价格和货币。
model_config:
phi4:
pricing:
currency: USD
prompt_per_1m: 0.07
completion_per_1m: 0.35
"mistral-small3.1":
pricing:
currency: USD
prompt_per_1m: 0.1
completion_per_1m: 0.3
gemma3:27b:
pricing:
currency: USD
prompt_per_1m: 0.067
completion_per_1m: 0.267
- 成本公式:
(prompt_tokens * prompt_per_1m + completion_tokens * completion_per_1m) / 1,000,000(以给定货币计算)。 - 未配置时,路由器仍会报告 token 和延迟指标;成本将被视为 0。
分类模型
配置 BERT 分类模型
classifier:
category_model:
model_id: "models/category_classifier_modernbert-base_model"
use_modernbert: true
threshold: 0.6 # Classification confidence threshold
use_cpu: true # Use CPU (no GPU required)
pii_model:
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
threshold: 0.7 # PII detection threshold
use_cpu: true
类别与路由
定义如何使用基于决策的路由系统处理不同类型的查询
# Categories define domains for classification
categories:
- name: math
- name: computer science
- name: other
# Decisions define routing logic with rules and model selection
decisions:
- name: math
description: "Route mathematical queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "domain"
name: "math"
modelRefs:
- model: your-model
use_reasoning: true # Enable reasoning for this model on math problems
- name: computer science
description: "Route computer science queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "domain"
name: "computer science"
modelRefs:
- model: your-model
use_reasoning: true # Enable reasoning for code
- name: other
description: "Route general queries"
priority: 5
rules:
operator: "OR"
conditions:
- type: "domain"
name: "other"
modelRefs:
- model: your-model
use_reasoning: false # No reasoning for general queries
default_model: your-model # Fallback model
模型特定推理
use_reasoning 字段在每个决策的 modelRefs 内针对每个模型进行配置,允许精细控制
decisions:
- name: math
description: "Route mathematical queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "domain"
name: "math"
modelRefs:
- model: gpt-oss-120b
use_reasoning: true # GPT-OSS-120b supports reasoning for math
- model: phi4
use_reasoning: false # phi4 doesn't support reasoning mode
- model: deepseek-v31
use_reasoning: true # DeepSeek supports reasoning for math
模型推理配置
配置不同模型如何处理推理模式语法。这允许您在不修改代码的情况下添加新模型
# Model reasoning configurations - define how different models handle reasoning syntax
model_reasoning_configs:
- name: "deepseek"
patterns: ["deepseek", "ds-", "ds_", "ds:", "ds "]
reasoning_syntax:
type: "chat_template_kwargs"
parameter: "thinking"
- name: "qwen3"
patterns: ["qwen3"]
reasoning_syntax:
type: "chat_template_kwargs"
parameter: "enable_thinking"
- name: "gpt-oss"
patterns: ["gpt-oss", "gpt_oss"]
reasoning_syntax:
type: "reasoning_effort"
parameter: "reasoning_effort"
- name: "gpt"
patterns: ["gpt"]
reasoning_syntax:
type: "reasoning_effort"
parameter: "reasoning_effort"
# Global default reasoning effort level (when not specified per category)
default_reasoning_effort: "medium"
模型推理配置选项
配置结构
name:模型家族的唯一标识符patterns:用于匹配模型名称的模式数组reasoning_syntax.type:模型期望指定推理模式的方式"chat_template_kwargs":使用聊天模板参数(适用于 DeepSeek, Qwen3 等模型)"reasoning_effort":使用 OpenAI 兼容的 reasoning_effort 字段(适用于 GPT 模型)
reasoning_syntax.parameter:模型使用的特定参数名称
模式匹配: 系统支持简单字符串模式和正则表达式,以实现灵活的模型匹配
- 简单字符串匹配:
"deepseek"匹配任何包含 "deepseek" 的模型 - 前缀模式:
"ds-"匹配以 "ds-" 开头或恰好为 "ds" 的模型 - 正则表达式:
"^gpt-4.*"匹配以 "gpt-4" 开头的模型 - 通配符:
"*"匹配所有模型(用于兜底配置) - 多个模式:
["deepseek", "ds-", "^phi.*"]匹配其中任何一个模式
正则表达式模式示例
patterns:
- "^gpt-4.*" # Models starting with "gpt-4"
- ".*-instruct$" # Models ending with "-instruct"
- "phi[0-9]+" # Models like "phi3", "phi4", etc.
- "^(llama|mistral)" # Models starting with "llama" or "mistral"
添加新模型: 要支持新的模型家族(例如 Claude),只需添加新的配置
model_reasoning_configs:
- name: "claude"
patterns: ["claude"]
reasoning_syntax:
type: "chat_template_kwargs"
parameter: "enable_reasoning"
未知模型: 不匹配任何已配置模式的模型在启用推理模式时将不应用任何推理字段。这可以防止不支持推理语法的模型出现问题。
默认推理力度: 设置当类别未指定自身力度级别时使用的全局默认推理力度级别 (reasoning effort)
default_reasoning_effort: "high" # Options: "low", "medium", "high"
决策特定推理力度: 针对每个决策覆盖默认力度级别
decisions:
- name: math
description: "Route mathematical queries"
priority: 10
reasoning_effort: "high" # Use high effort for complex math
rules:
operator: "OR"
conditions:
- type: "domain"
name: "math"
modelRefs:
- model: your-model
use_reasoning: true # Enable reasoning for this model
- name: general
description: "Route general queries"
priority: 5
reasoning_effort: "low" # Use low effort for general queries
rules:
operator: "OR"
conditions:
- type: "domain"
name: "general"
modelRefs:
- model: your-model
use_reasoning: true # Enable reasoning for this model
安全功能
配置 PII 检测和越狱保护
# PII Detection
classifier:
pii_model:
threshold: 0.7 # Higher = more strict PII detection
# Jailbreak Protection
prompt_guard:
enabled: true # Enable jailbreak detection
threshold: 0.7 # Detection sensitivity
use_cpu: true # Runs on CPU
# Model-level PII policies
model_config:
"your-model":
pii_policy:
allow_by_default: true # Allow most content
pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"] # Specific allowed types
可选功能
配置附加功能
# Semantic Caching
semantic_cache:
enabled: true # Enable semantic caching globally
backend_type: "memory" # Options: "memory" or "milvus"
similarity_threshold: 0.8 # Global default cache hit threshold
max_entries: 1000 # Maximum cache entries
ttl_seconds: 3600 # Cache expiration time
eviction_policy: "fifo" # Options: "fifo", "lru", "lfu"
# Decision-Level Cache Configuration (New)
# Override global cache settings for specific decisions
categories:
- name: health
- name: general_chat
- name: troubleshooting
decisions:
- name: health
description: "Route health queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "domain"
name: "health"
modelRefs:
- model: your-model
use_reasoning: false
plugins:
- type: "semantic-cache"
configuration:
enabled: true
similarity_threshold: 0.95 # Very strict - medical accuracy critical
- name: general_chat
description: "Route general chat queries"
priority: 5
rules:
operator: "OR"
conditions:
- type: "domain"
name: "general_chat"
modelRefs:
- model: your-model
use_reasoning: false
plugins:
- type: "semantic-cache"
configuration:
similarity_threshold: 0.75 # Relaxed for better cache hits
- name: troubleshooting
description: "Route troubleshooting queries"
priority: 5
rules:
operator: "OR"
conditions:
- type: "domain"
name: "troubleshooting"
modelRefs:
- model: your-model
use_reasoning: false
# No cache plugin - uses global default (0.8)
# Tool Auto-Selection
tools:
enabled: true # Enable automatic tool selection
top_k: 3 # Number of tools to select
similarity_threshold: 0.2 # Tool relevance threshold
tools_db_path: "config/tools_db.json"
fallback_to_empty: true # Return empty on failure
# BERT Model for Similarity
bert_model:
model_id: sentence-transformers/all-MiniLM-L12-v2
threshold: 0.6 # Similarity threshold
use_cpu: true # CPU-only inference
# Batch Classification API Configuration
api:
batch_classification:
max_batch_size: 100 # Maximum texts per batch request
concurrency_threshold: 5 # Switch to concurrent processing at this size
max_concurrency: 8 # Maximum concurrent goroutines
# Metrics configuration for monitoring
metrics:
enabled: true # Enable Prometheus metrics collection
detailed_goroutine_tracking: true # Track individual goroutine lifecycle
high_resolution_timing: false # Use nanosecond precision timing
sample_rate: 1.0 # Collect metrics for all requests (1.0 = 100%)
# Batch size range labels for metrics (OPTIONAL - uses sensible defaults)
# Default ranges: "1", "2-5", "6-10", "11-20", "21-50", "50+"
# Only specify if you need custom ranges:
# batch_size_ranges:
# - {min: 1, max: 1, label: "1"}
# - {min: 2, max: 5, label: "2-5"}
# - {min: 6, max: 10, label: "6-10"}
# - {min: 11, max: 20, label: "11-20"}
# - {min: 21, max: 50, label: "21-50"}
# - {min: 51, max: -1, label: "50+"} # -1 means no upper limit
# Histogram buckets - choose from presets below or customize
duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
size_buckets: [1, 2, 5, 10, 20, 50, 100, 200]
# Preset examples for quick configuration (copy values above)
preset_examples:
fast:
duration: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
size: [1, 2, 3, 5, 8, 10]
standard:
duration: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
size: [1, 2, 5, 10, 20, 50, 100]
slow:
duration: [0.1, 0.5, 1, 5, 10, 30, 60, 120]
size: [10, 50, 100, 500, 1000, 5000]
如何使用预设示例
该配置包含用于快速设置的预设示例。以下是使用方法:
第 1 步:选择您的场景
fast- 适用于实时 API(微秒级到毫秒级响应时间)standard- 适用于典型的 Web API(毫秒级到秒级响应时间)slow- 适用于批处理或重型计算(秒级到分钟级)
第 2 步:复制预设置值
# Example: Switch to fast API configuration
# Copy from preset_examples.fast and paste to the actual config:
duration_buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
size_buckets: [1, 2, 3, 5, 8, 10]
第 3 步:重启服务
pkill -f "router"
make run-router
默认批大小范围
系统提供了合理的默认批处理大小 (batch size) 范围,适用于大多数用例:
- "1" - 单个文本请求
- "2-5" - 小批量请求
- "6-10" - 中等批量请求
- "11-20" - 大批量请求
- "21-50" - 超大批量请求
- "50+" - 最大批量请求
除非您有特殊要求,否则无需配置 batch_size_ranges。 省略该配置时将自动使用默认值。
按用例分类的配置示例
实时聊天 API (fast 预设)
# Copy these values to your config for sub-millisecond monitoring
duration_buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
size_buckets: [1, 2, 3, 5, 8, 10]
# batch_size_ranges: uses defaults (no configuration needed)
电子商务 API (standard 预设)
# Copy these values for typical web API response times
duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
size_buckets: [1, 2, 5, 10, 20, 50, 100]
# batch_size_ranges: uses defaults (no configuration needed)
数据处理流水线 (slow 预设)
# Copy these values for heavy computation workloads
duration_buckets: [0.1, 0.5, 1, 5, 10, 30, 60, 120]
size_buckets: [10, 50, 100, 500, 1000, 5000]
# Custom batch size ranges for large-scale processing (overrides defaults)
batch_size_ranges:
- {min: 1, max: 50, label: "1-50"}
- {min: 51, max: 200, label: "51-200"}
- {min: 201, max: 1000, label: "201-1000"}
- {min: 1001, max: -1, label: "1000+"}
可用指标
batch_classification_requests_total- 批量请求总数batch_classification_duration_seconds- 处理耗时直方图batch_classification_texts_total- 处理的文本总数batch_classification_errors_total- 按类型分类的错误计数batch_classification_concurrent_goroutines- 活动 goroutine 计数batch_classification_size_distribution- 批大小分布
访问指标地址:https://:9190/metrics
类别级缓存配置
新增:在类别级别配置语义缓存设置,以便对缓存行为进行精细控制。
为什么要使用类别级缓存设置?
不同类别对语义变化的容忍度不同:
- 敏感类别(健康、心理、法律):微小的词语变化可能导致显著的意义差异。需要高相似度阈值 (0.92-0.95)。
- 通用类别(聊天、故障排除):对微小的用词变化不太敏感。可以使用较低的阈值 (0.75-0.82) 以获得更高的缓存命中率。
- 隐私类别:出于合规或安全原因,可能需要完全禁用缓存。
配置示例
示例 1:不同决策的混合阈值
semantic_cache:
enabled: true
backend_type: "memory"
similarity_threshold: 0.8 # Global default
categories:
- name: health
- name: psychology
- name: general_chat
- name: troubleshooting
decisions:
- name: health
description: "Route health queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "domain"
name: "health"
modelRefs:
- model: your-model
use_reasoning: false
plugins:
- type: "system_prompt"
configuration:
enabled: true
system_prompt: "You are a health expert..."
mode: "replace"
- type: "semantic-cache"
configuration:
enabled: true
similarity_threshold: 0.95 # Very strict - "headache" vs "severe headache" = different
- name: psychology
description: "Route psychology queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "domain"
name: "psychology"
modelRefs:
- model: your-model
use_reasoning: false
plugins:
- type: "system_prompt"
configuration:
enabled: true
system_prompt: "You are a psychology expert..."
mode: "replace"
- type: "semantic-cache"
configuration:
similarity_threshold: 0.92 # Strict - clinical nuances matter
- name: general_chat
description: "Route general chat queries"
priority: 5
rules:
operator: "OR"
conditions:
- type: "domain"
name: "general_chat"
modelRefs:
- model: your-model
use_reasoning: false
plugins:
- type: "system_prompt"
configuration:
enabled: true
system_prompt: "You are a helpful assistant..."
mode: "replace"
- type: "semantic-cache"
configuration:
similarity_threshold: 0.75 # Relaxed - "how's the weather" = "what's the weather"
- name: troubleshooting
description: "Route troubleshooting queries"
priority: 5
rules:
operator: "OR"
conditions:
- type: "domain"
name: "troubleshooting"
modelRefs:
- model: your-model
use_reasoning: false
plugins:
- type: "system_prompt"
configuration:
enabled: true
system_prompt: "You are a tech support expert..."
mode: "replace"
# No cache plugin - uses global threshold of 0.8
示例 2:为敏感数据禁用缓存
categories:
- name: personal_data
decisions:
- name: personal_data
description: "Route personal data queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "domain"
name: "personal_data"
modelRefs:
- model: your-model
use_reasoning: false
plugins:
- type: "system_prompt"
configuration:
enabled: true
system_prompt: "Handle personal information..."
mode: "replace"
- type: "semantic-cache"
configuration:
enabled: false # Disable cache entirely for privacy
配置选项
决策级插件字段
plugins[].type: "semantic-cache"- 语义缓存插件配置configuration.enabled(可选, 布尔值):启用/禁用此决策的缓存。如果未指定,则继承全局semantic_cache.enabled。configuration.similarity_threshold(可选, 浮点数 0.0-1.0):此决策中缓存命中的最小相似度分数。如果未指定,则继承全局semantic_cache.similarity_threshold。
回退层级
- 决策特定插件的
similarity_threshold(如果设置) - 全局
semantic_cache.similarity_threshold(如果设置) bert_model.threshold(最终回退值)
最佳实践
阈值选择建议
- 高精度 (0.92-0.95):医疗健康、心理咨询、法律、金融
- 中等精度 (0.85-0.90):技术文档、教育
- 低精度 (0.75-0.82):闲聊、FAQ、故障排除
隐私与合规
- 为处理以下内容的决策禁用缓存(设置插件
enabled: false):- 个人身份信息 (PII)
- 金融数据
- 健康记录
- 敏感业务信息
性能调优
- 从保守(较高)的阈值开始
- 监控每个决策的缓存命中率
- 为命中率低的决策降低阈值
- 为出现错误缓存命中的决策提高阈值
常见配置示例
启用所有安全功能
# Enable PII detection
classifier:
pii_model:
threshold: 0.8 # Strict PII detection
# Enable jailbreak protection
prompt_guard:
enabled: true
threshold: 0.7
# Configure model PII policies
model_config:
"your-model":
pii_policy:
allow_by_default: false # Block all PII by default
pii_types_allowed: [] # No PII allowed
性能优化
# Enable caching
semantic_cache:
enabled: true
backend_type: "memory"
similarity_threshold: 0.85 # Higher = more cache hits
max_entries: 5000
ttl_seconds: 7200 # 2 hour cache
eviction_policy: "fifo" # Options: "fifo", "lru", "lfu"
# Enable tool selection
tools:
enabled: true
top_k: 5 # Select more tools
similarity_threshold: 0.1 # Lower = more tools selected
开发环境设置
# Disable security for testing
prompt_guard:
enabled: false
# Disable caching for consistent results
semantic_cache:
enabled: false
# Lower classification thresholds
classifier:
category_model:
threshold: 0.3 # Lower = more specialized routing
配置校验
测试您的配置
启动前验证您的配置
# Test configuration syntax
python -c "import yaml; yaml.safe_load(open('config/config.yaml'))"
# Test the router with your config
make build
make run-router
常见配置模式
多模型配置
vllm_endpoints:
- name: "math_endpoint"
address: "192.168.1.10" # Math server IP
port: 8000
weight: 1
- name: "general_endpoint"
address: "192.168.1.20" # General server IP
port: 8000
weight: 1
categories:
- name: math
- name: other
decisions:
- name: math
description: "Route mathematical queries"
priority: 10
rules:
operator: "OR"
conditions:
- type: "domain"
name: "math"
modelRefs:
- model: math-model
use_reasoning: true # Enable reasoning for math
- name: other
description: "Route general queries"
priority: 5
rules:
operator: "OR"
conditions:
- type: "domain"
name: "other"
modelRefs:
- model: general-model
use_reasoning: false # No reasoning for general queries
负载均衡
vllm_endpoints:
- name: "endpoint1"
address: "192.168.1.30" # Primary server IP
port: 8000
weight: 2 # Higher weight = more traffic
- name: "endpoint2"
address: "192.168.1.31" # Secondary server IP
port: 8000
weight: 1
最佳实践
安全配置
适用于生产环境
# Enable all security features
classifier:
pii_model:
threshold: 0.8 # Strict PII detection
prompt_guard:
enabled: true # Enable jailbreak protection
threshold: 0.7
model_config:
"your-model":
pii_policy:
allow_by_default: false # Block PII by default
性能调优
适用于高流量场景
# Enable caching
semantic_cache:
enabled: true
backend_type: "memory"
similarity_threshold: 0.85 # Higher = more cache hits
max_entries: 10000
ttl_seconds: 3600
eviction_policy: "lru"
# Optimize classification
classifier:
category_model:
threshold: 0.7 # Balance accuracy vs speed
开发环境 vs 生产环境
开发环境
# Relaxed settings for testing
classifier:
category_model:
threshold: 0.3 # Lower threshold for testing
prompt_guard:
enabled: false # Disable for development
semantic_cache:
enabled: false # Disable for consistent results
生产环境
# Strict settings for production
classifier:
category_model:
threshold: 0.7 # Higher threshold for accuracy
prompt_guard:
enabled: true # Enable security
semantic_cache:
enabled: true # Enable for performance
故障排除
常见问题
无效的 YAML 语法
# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('config/config.yaml'))"
缺失模型文件
# Check if models are downloaded
ls -la models/
# If missing, run: make download-models
端点连接性
# Test your backend server
curl -f http://your-server:8000/health
配置未生效
# Restart the router after config changes
make run-router
测试配置
# Test with different queries
make test-auto-prompt-reasoning # Math query
make test-auto-prompt-no-reasoning # General query
make test-pii # PII detection
make test-prompt-guard # Jailbreak protection
模型推理配置问题
模型未获得推理字段
- 检查模型名称是否匹配
model_reasoning_configs中的模式 - 验证模式语法(精确匹配 vs 前缀)
- 未知模型将不应用推理字段(这是设计使然)
应用了错误的推理语法
- 确保
reasoning_syntax.type匹配您的模型期望的格式 - 检查
reasoning_syntax.parameter名称是否正确 - DeepSeek 模型通常将
chat_template_kwargs与"thinking"配合使用 - GPT 模型通常使用
reasoning_effort
添加对新模型的支持
# Add a new model configuration
model_reasoning_configs:
- name: "my-new-model"
patterns: ["my-model"]
reasoning_syntax:
type: "chat_template_kwargs" # or "reasoning_effort"
parameter: "custom_parameter"
测试模型推理配置
# Test reasoning with your specific model
curl -X POST https://:8801/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}'
配置生成
语义路由器支持基于模型性能基准测试的自动配置生成。此工作流程使用 MMLU-Pro 评估结果来确定不同类别的最佳模型路由。
基准测试工作流程
-
运行 MMLU-Pro 评估
# Evaluate models using MMLU-Pro benchmark
python src/training/model_eval/mmlu_pro_vllm_eval.py \
--endpoint https://:8000/v1 \
--models phi4,gemma3:27b,mistral-small3.1 \
--samples-per-category 5 \
--use-cot \
--concurrent-requests 4 \
--output-dir results -
生成配置
# Generate config.yaml from benchmark results
python src/training/model_eval/result_to_config.py \
--results-dir results \
--output-file config/config.yaml \
--similarity-threshold 0.80
生成的配置特性
生成的配置包括:
- 模型性能排名: 模型按每个类别的性能进行排名
- 推理设置: 自动配置每个类别的推理要求
use_reasoning:是否使用逐步推理reasoning_effort:所需的力度级别(低/中/高)
- 默认模型选择: 整体表现最好的模型被设置为默认模型
- 安全与性能设置: 预配置了以下内容的最佳值:
- PII 检测阈值
- 语义缓存设置
- 工具选择参数
自定义生成的配置
可以通过以下方式自定义生成的 config.yaml:
- 在
result_to_config.py中编辑特定类别的设置 - 通过命令行参数调整阈值和参数
- 手动修改生成的 config.yaml
工作流程示例
这是一个生成和测试配置的完整示例工作流程
# Run MMLU-Pro evaluation
# Option 1: Specify models manually
python src/training/model_eval/mmlu_pro_vllm_eval.py \
--endpoint https://:8000/v1 \
--models phi4,gemma3:27b,mistral-small3.1 \
--samples-per-category 5 \
--use-cot \
--concurrent-requests 4 \
--output-dir results \
--max-tokens 2048 \
--temperature 0.0 \
--seed 42
# Option 2: Auto-discover models from endpoint
python src/training/model_eval/mmlu_pro_vllm_eval.py \
--endpoint https://:8000/v1 \
--samples-per-category 5 \
--use-cot \
--concurrent-requests 4 \
--output-dir results \
--max-tokens 2048 \
--temperature 0.0 \
--seed 42
# Generate initial config
python src/training/model_eval/result_to_config.py \
--results-dir results \
--output-file config/config.yaml \
--similarity-threshold 0.80
# Test the generated config
make test
此工作流程可确保您的配置:
- 基于实际的模型性能
- 在部署前经过妥善测试
- 进行版本控制以跟踪更改
- 针对您的特定用例进行优化
后续步骤
配置系统旨在简单而强大。从基础配置开始,并根据需要逐步启用高级功能。