【Agent】构建Harness | hermes-agent框架组件-程序员充电站

note

hermes-agent实现了一个完整的 “经验提取 → 知识存储 → 智能检索 → 上下文注入 → 执行验证 → 自动改进” 闭环。是内置闭环自学习机制的项目。不是只做 task summary，而是在做一个 persistent memory + skill induction + retrieval + user modeling 的闭环。更多是工程优化
Skills 系统让 AI Agent 像人类专家一样积累经验——把成功的做法写成 SOP，在使用中持续修订，并且可以分享给其他人。
后台审查：记忆审查、skill审查等，如异步fork出一个后台审查Agent实例，判断历史对话中有啥内容能沉淀为有价值的skill/memory。代码：hermes-agent/run_agent.py
Agentic RL奖励函数：
- 比如代码能执行通过/检索正确就给1分不能就0
- 参考OpenClaw-RL的Combined（RLVR + OPD）方法，其中RLVR的reward是三信号加权（正确70%+效率15%+工具使用15%）

文章目录

note
一、建Harness—六大组件
二、hermes agent
- 1、后台审查Agent
- 2、Agentic RL
Reference

一、建Harness—六大组件

【关于Harness】如何构建Harness——六大组件全解析,https://mp.weixin.qq.com/s/HwqEaXSGkcYgUNrzB2okuA

六大组件：
1、文件系统（工作台）
作用：不仅是存文件，更是 Agent 的“外部大脑”。用于存储中间结果、实现多 Agent 协作（通过文件共享状态）、并与 Git 集成实现版本控制和回滚。
2、Bash + 沙箱（手脚）
作用：实现“写→跑→修”的自我验证循环。沙箱提供资源隔离（如 Docker），防止 Agent 执行危险操作（如 rm -rf），是 Agent 从“顾问”变为“工程师”的关键。
3、记忆（AGENTS.md - 外挂大脑）
作用：一种“不改权重加知识”的巧妙方案。Agent 将项目规范、架构决策写入 Markdown 文件，下次启动时自动注入上下文。这比微调（Fine-tuning）成本更低，且人类可读可编辑。
4、Web Search + MCP
作用：Web Search 解决实时性问题（如查最新文档）；MCP (Model Context Protocol) 是 Anthropic 推出的“AI 世界的 USB 接口”，让 Agent 能即插即用地连接数据库、Jira 等内部工具，从“搜索”升级为“连接”。
5、上下文工程（注意力管理）
作用：对抗 Context Rot（上下文腐烂）。通过压缩（Summarization）、卸载（将大段输出存文件只留摘要）、分层管理等策略，防止重要信息被淹没，保持模型“头脑清醒”。

6、编排 + Hooks（调度与质检）
作用：编排负责将大任务拆解分发给不同 Agent（如简单任务用小模型，复杂任务用大模型）；Hooks 是质量门禁，通过确定性规则（如 Lint 检查、格式校验）拦截模型可能产生的错误输出，确保质量底线

二、hermes agent

1、后台审查Agent

后台审查Agent每当主 Agent 完成对用户的回复后，对于用户而言，交互似乎就此结束。但在后台，Hermes 通过_spawn_background_review会在后台异步启动一个审查 Agent。这是一个异步处理机制，系统会立即 Fork 出一个新的轻量级 Agent 实例，专门负责对刚刚结束的对话进行深度复盘。这个后台 Agent 不会干扰前台的用户体验，而是从三个维度对此次交互进行全方位审查的Prompt：

记忆审查（_MEMORY_REVIEW_PROMPT）：这段对话有什么值得记住的经验？判断这段对话中是否蕴含值得长期保留的关键经验或事实，提炼初长期记忆，存入 Agent 的记忆库
技能审查（_SKILL_REVIEW_PROMPT）：这个任务模式是否值得变成Skill？分析当前的任务解决路径是否具有通用性，是否值得被抽象并固化为一个可复用的Skill
综合审查（_COMBINED_REVIEW_PROMPT）：有什么可以改进的？反思整个执行过程中是否存在优化空间或潜在的错误模式。

具体参考源码中的prompt（截止20260429），关注THINK CLASS-FIRST. What general pattern of task did the user just complete生成skill考虑任务类别，不存具体任务；memory存用户画像如偏好等。

# ------------------------------------------------------------------# Background memory/skill review# ------------------------------------------------------------------_MEMORY_REVIEW_PROMPT=("Review the conversation above and consider saving to memory if appropriate.\n\n""Focus on:\n""1. Has the user revealed things about themselves — their persona, desires, ""preferences, or personal details worth remembering?\n""2. Has the user expressed expectations about how you should behave, their work ""style, or ways they want you to operate?\n\n""If something stands out, save it using the memory tool. ""If nothing is worth saving, just say 'Nothing to save.' and stop.")_SKILL_REVIEW_PROMPT=("Review the conversation above and consider whether a skill should be saved or updated.\n\n""Work in this order — do not skip steps:\n\n""1. SURVEY the existing skill landscape first. Call skills_list to see what you ""have. If anything looks potentially relevant, skill_view it before deciding. ""You are looking for the CLASS of task that just happened, not the exact task. ""Example: a successful Tauri build is in the class \"desktop app build ""troubleshooting\", not \"fix my specific Tauri error today\".\n\n""2. THINK CLASS-FIRST. What general pattern of task did the user just complete? ""What conditions will trigger this pattern again? Describe the class in one ""sentence before looking at what to save.\n\n""3. PREFER GENERALIZING AN EXISTING SKILL over creating a new one. If a skill ""already covers the class — even partially — update it (skill_manage patch) ""with the new insight. Broaden its \"when to use\" trigger if needed.\n\n""4. ONLY CREATE A NEW SKILL when no existing skill reasonably covers the class. ""When you create one, name and scope it at the class level ""(\"react-i18n-setup\", not \"add-i18n-to-my-dashboard-app\"). The trigger ""section must describe the class of situations, not this one session.\n\n""5. If you notice two existing skills that overlap, note it in your response ""so a future review can consolidate them. Do not consolidate now unless the ""overlap is obvious and low-risk.\n\n""Only act when something is genuinely worth saving. ""If nothing stands out, just say 'Nothing to save.' and stop.")_COMBINED_REVIEW_PROMPT=("Review the conversation above and consider two things:\n\n""**Memory**: Has the user revealed things about themselves — their persona, ""desires, preferences, or personal details? Has the user expressed expectations ""about how you should behave, their work style, or ways they want you to operate? ""If so, save using the memory tool.\n\n""**Skills**: Was a non-trivial approach used to complete a task that required trial ""and error, changing course due to experiential findings, or a different method ""or outcome than the user expected? If so, work in this order:\n"" a. SURVEY existing skills first (skills_list, then skill_view on candidates).\n"" b. Identify the CLASS of task, not the specific task ""(\"desktop app build troubleshooting\", not \"fix my Tauri error\").\n"" c. PREFER UPDATING/GENERALIZING an existing skill that covers the class.\n"" d. ONLY CREATE A NEW SKILL if no existing one covers the class. Scope at ""the class level, not this one session.\n"" e. If you notice overlapping skills during the survey, note it so a future ""review can consolidate them.\n\n""Only act if there's something genuinely worth saving. ""If nothing stands out, just say 'Nothing to save.' and stop.")

2、Agentic RL

奖励函数：
（1）比如代码能执行通过/检索正确就给1分不能就0
（2）参考OpenClaw-RL的Combined（RLVR + OPD），其中RLVR的reward是三信号加权（正确70%+效率15%+工具使用15%）
代码：hermes-agent/environments/
论文：OpenClaw-RL: Train Any Agent Simply by Talking，https://arxiv.org/pdf/2603.10165

OpenClaw-RL核心：

把这些“下一状态信号”（比如用户继续追问/纠正/满意、工具返回结果/报错等）统一收集起来，作为在线训练数据。把 Agent 每次交互后的用户反馈、工具结果、环境变化，都转成在线 RL 信号，让 Agent 在真实使用中持续变强。
OpenClaw-RL combined advantage：A t combined = w binary r final + w opd ( log ⁡ π teacher ( a t ∣ s enhanced ) − log ⁡ π θ ( a t ∣ s t ) ) A_t^{\text {combined }}=w_{\text {binary }} r_{\text {final }}+w_{\text {opd }}\left(\log \pi_{\text {teacher }}\left(a_t \mid s_{\text {enhanced }}\right)-\log \pi_\theta\left(a_t \mid s_t\right)\right)Atcombined=wbinaryrfinal+wopd(logπteacher(at∣senhanced)−logπθ(at∣st))
将二值奖励和教师-学生分布差异加权成一个新的优势估计
- 用教师模型在 hint-enhanced prompt 上跑一次 forward，得到 logprobs
- 信用分配问题缓解

方法	Advantage 来源	粒度
RLVR / Binary RL	PRM 给的 (r final r_{\text{final}}rfinal)	response-level / sequence-level
OPD	teacher-student logprob gap	token-level
Combined	两者加权和	mixed

上面说的OpenClaw-RL的RLVR reward参考：

asyncdefcompute_reward(self,item:dict,result:AgentResult,ctx:ToolContext,)->float:""" Multi-signal reward: - correctness (0.7): Did the tests pass? - efficiency (0.15): Fewer turns = better - tool_usage (0.15): Did the agent actually write + run code? """cfg=self.config# ---- Signal 1: Test correctness ----# Check if test_solution.py exists and passes in the agent's sandboxcorrectness=0.0try:test_result=ctx.terminal("python test_solution.py 2>&1",timeout=30)output=test_result.get("output","")exit_code=test_result.get("exit_code",1)ifexit_code==0and"passed"inoutput.lower():correctness=1.0elifexit_code==0:correctness=0.8# Ran without error but no explicit "passed"elif"assert"inoutput.lower()and"error"inoutput.lower():correctness=0.2# Partial — code runs but assertions failelse:correctness=0.1# Code errors out entirelyexceptExceptionase:logger.debug("Test execution failed in reward: %s",e)correctness=0.0# ---- Signal 2: Efficiency ----max_turns=cfg.max_agent_turns turns_used=result.turns_usedifturns_used<=3:efficiency=1.0elifturns_used<=max_turns//2:efficiency=0.8elifturns_used<=max_turns*3//4:efficiency=0.5else:efficiency=0.2# ---- Signal 3: Tool usage ----tools_used=set()formsginresult.messages:ifmsg.get("role")=="assistant"andmsg.get("tool_calls"):fortcinmsg["tool_calls"]:fn=tc.get("function",{})ifisinstance(tc,dict)else{}name=fn.get("name","")ifname:tools_used.add(name)# Good: used both terminal and file toolsif"terminal"intools_usedand("write_file"intools_usedor"patch"intools_used):tool_usage=1.0elif"terminal"intools_used:tool_usage=0.6eliftools_used:tool_usage=0.3else:tool_usage=0.0# ---- Combine ----reward=(cfg.correctness_weight*correctness+cfg.efficiency_weight*efficiency+cfg.tool_usage_weight*tool_usage)reward=min(1.0,max(0.0,reward))# Track metricsself._reward_buffer.append(reward)self._correctness_buffer.append(correctness)self._efficiency_buffer.append(efficiency)self._tool_usage_buffer.append(tool_usage)logger.debug("Reward: correctness=%.2f, efficiency=%.2f, tool_usage=%.2f → %.3f",correctness,efficiency,tool_usage,reward,)returnreward

Reference

[1] 一文搞懂Hermes：新顶流Agent如何从经验中自我进化
[2] https://github.com/NousResearch/hermes-agent
[3] 深度解析 Hermes Agent 如何实现“自进化”及其 Prompt / Context / Harness 的设计实践
[4] https://hermes-agent.nousresearch.com/docs/user-guide/features/rl-training