news 2026/4/30 9:38:24

【Agent】构建Harness | hermes-agent框架组件

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
【Agent】构建Harness | hermes-agent框架组件

note

  • hermes-agent实现了一个完整的 “经验提取 → 知识存储 → 智能检索 → 上下文注入 → 执行验证 → 自动改进” 闭环。是内置闭环自学习机制的项目。不是只做 task summary,而是在做一个 persistent memory + skill induction + retrieval + user modeling 的闭环。更多是工程优化
  • Skills 系统让 AI Agent 像人类专家一样积累经验——把成功的做法写成 SOP,在使用中持续修订,并且可以分享给其他人。
  • 后台审查:记忆审查、skill审查等,如异步fork出一个后台审查Agent实例,判断历史对话中有啥内容能沉淀为有价值的skill/memory。代码:hermes-agent/run_agent.py
  • Agentic RL奖励函数:
    • 比如代码能执行通过/检索正确就给1分不能就0
    • 参考OpenClaw-RL的Combined(RLVR + OPD)方法,其中RLVR的reward是三信号加权(正确70%+效率15%+工具使用15%)

文章目录

  • note
  • 一、建Harness—六大组件
  • 二、hermes agent
    • 1、后台审查Agent
    • 2、Agentic RL
  • Reference

一、建Harness—六大组件

【关于Harness】如何构建Harness——六大组件全解析,https://mp.weixin.qq.com/s/HwqEaXSGkcYgUNrzB2okuA

六大组件:
1、文件系统(工作台)
作用:不仅是存文件,更是 Agent 的“外部大脑”。用于存储中间结果、实现多 Agent 协作(通过文件共享状态)、并与 Git 集成实现版本控制和回滚。
2、Bash + 沙箱(手脚)
作用:实现“写→跑→修”的自我验证循环。沙箱提供资源隔离(如 Docker),防止 Agent 执行危险操作(如 rm -rf),是 Agent 从“顾问”变为“工程师”的关键。
3、记忆(AGENTS.md - 外挂大脑)
作用:一种“不改权重加知识”的巧妙方案。Agent 将项目规范、架构决策写入 Markdown 文件,下次启动时自动注入上下文。这比微调(Fine-tuning)成本更低,且人类可读可编辑。
4、Web Search + MCP
作用:Web Search​ 解决实时性问题(如查最新文档);MCP (Model Context Protocol)​ 是 Anthropic 推出的“AI 世界的 USB 接口”,让 Agent 能即插即用地连接数据库、Jira 等内部工具,从“搜索”升级为“连接”。
5、上下文工程(注意力管理)
作用:对抗 Context Rot(上下文腐烂)。通过压缩(Summarization)、卸载(将大段输出存文件只留摘要)、分层管理等策略,防止重要信息被淹没,保持模型“头脑清醒”。

6、编排 + Hooks(调度与质检)
作用:编排负责将大任务拆解分发给不同 Agent(如简单任务用小模型,复杂任务用大模型);Hooks​ 是质量门禁,通过确定性规则(如 Lint 检查、格式校验)拦截模型可能产生的错误输出,确保质量底线

二、hermes agent

1、后台审查Agent

后台审查Agent每当主 Agent 完成对用户的回复后,对于用户而言,交互似乎就此结束。但在后台,Hermes 通过_spawn_background_review会在后台异步启动一个审查 Agent。这是一个异步处理机制,系统会立即 Fork 出一个新的轻量级 Agent 实例,专门负责对刚刚结束的对话进行深度复盘。这个后台 Agent 不会干扰前台的用户体验,而是从三个维度对此次交互进行全方位审查的Prompt:

  • 记忆审查(_MEMORY_REVIEW_PROMPT):这段对话有什么值得记住的经验?判断这段对话中是否蕴含值得长期保留的关键经验或事实,提炼初长期记忆,存入 Agent 的记忆库
  • 技能审查(_SKILL_REVIEW_PROMPT):这个任务模式是否值得变成Skill?分析当前的任务解决路径是否具有通用性,是否值得被抽象并固化为一个可复用的Skill
  • 综合审查(_COMBINED_REVIEW_PROMPT):有什么可以改进的?反思整个执行过程中是否存在优化空间或潜在的错误模式。

具体参考源码中的prompt(截止20260429),关注THINK CLASS-FIRST. What general pattern of task did the user just complete生成skill考虑任务类别,不存具体任务;memory存用户画像如偏好等。

# ------------------------------------------------------------------# Background memory/skill review# ------------------------------------------------------------------_MEMORY_REVIEW_PROMPT=("Review the conversation above and consider saving to memory if appropriate.\n\n""Focus on:\n""1. Has the user revealed things about themselves — their persona, desires, ""preferences, or personal details worth remembering?\n""2. Has the user expressed expectations about how you should behave, their work ""style, or ways they want you to operate?\n\n""If something stands out, save it using the memory tool. ""If nothing is worth saving, just say 'Nothing to save.' and stop.")_SKILL_REVIEW_PROMPT=("Review the conversation above and consider whether a skill should be saved or updated.\n\n""Work in this order — do not skip steps:\n\n""1. SURVEY the existing skill landscape first. Call skills_list to see what you ""have. If anything looks potentially relevant, skill_view it before deciding. ""You are looking for the CLASS of task that just happened, not the exact task. ""Example: a successful Tauri build is in the class \"desktop app build ""troubleshooting\", not \"fix my specific Tauri error today\".\n\n""2. THINK CLASS-FIRST. What general pattern of task did the user just complete? ""What conditions will trigger this pattern again? Describe the class in one ""sentence before looking at what to save.\n\n""3. PREFER GENERALIZING AN EXISTING SKILL over creating a new one. If a skill ""already covers the class — even partially — update it (skill_manage patch) ""with the new insight. Broaden its \"when to use\" trigger if needed.\n\n""4. ONLY CREATE A NEW SKILL when no existing skill reasonably covers the class. ""When you create one, name and scope it at the class level ""(\"react-i18n-setup\", not \"add-i18n-to-my-dashboard-app\"). The trigger ""section must describe the class of situations, not this one session.\n\n""5. If you notice two existing skills that overlap, note it in your response ""so a future review can consolidate them. Do not consolidate now unless the ""overlap is obvious and low-risk.\n\n""Only act when something is genuinely worth saving. ""If nothing stands out, just say 'Nothing to save.' and stop.")_COMBINED_REVIEW_PROMPT=("Review the conversation above and consider two things:\n\n""**Memory**: Has the user revealed things about themselves — their persona, ""desires, preferences, or personal details? Has the user expressed expectations ""about how you should behave, their work style, or ways they want you to operate? ""If so, save using the memory tool.\n\n""**Skills**: Was a non-trivial approach used to complete a task that required trial ""and error, changing course due to experiential findings, or a different method ""or outcome than the user expected? If so, work in this order:\n"" a. SURVEY existing skills first (skills_list, then skill_view on candidates).\n"" b. Identify the CLASS of task, not the specific task ""(\"desktop app build troubleshooting\", not \"fix my Tauri error\").\n"" c. PREFER UPDATING/GENERALIZING an existing skill that covers the class.\n"" d. ONLY CREATE A NEW SKILL if no existing one covers the class. Scope at ""the class level, not this one session.\n"" e. If you notice overlapping skills during the survey, note it so a future ""review can consolidate them.\n\n""Only act if there's something genuinely worth saving. ""If nothing stands out, just say 'Nothing to save.' and stop.")

2、Agentic RL

奖励函数:
(1)比如代码能执行通过/检索正确就给1分不能就0
(2)参考OpenClaw-RL的Combined(RLVR + OPD),其中RLVR的reward是三信号加权(正确70%+效率15%+工具使用15%)
代码:hermes-agent/environments/
论文:OpenClaw-RL: Train Any Agent Simply by Talking,https://arxiv.org/pdf/2603.10165

OpenClaw-RL核心:

  • 把这些“下一状态信号”(比如用户继续追问/纠正/满意、工具返回结果/报错等)统一收集起来,作为在线训练数据。把 Agent 每次交互后的用户反馈、工具结果、环境变化,都转成在线 RL 信号,让 Agent 在真实使用中持续变强。
  • OpenClaw-RL combined advantage:A t combined = w binary r final + w opd ( log ⁡ π teacher ( a t ∣ s enhanced ) − log ⁡ π θ ( a t ∣ s t ) ) A_t^{\text {combined }}=w_{\text {binary }} r_{\text {final }}+w_{\text {opd }}\left(\log \pi_{\text {teacher }}\left(a_t \mid s_{\text {enhanced }}\right)-\log \pi_\theta\left(a_t \mid s_t\right)\right)Atcombined=wbinaryrfinal+wopd(logπteacher(atsenhanced)logπθ(atst))
  • 将二值奖励 和 教师-学生分布差异 加权成一个新的优势估计
    • 用教师模型在 hint-enhanced prompt 上跑一次 forward,得到 logprobs
    • 信用分配问题缓解
方法Advantage 来源粒度
RLVR / Binary RLPRM 给的 (r final r_{\text{final}}rfinal)response-level / sequence-level
OPDteacher-student logprob gaptoken-level
Combined两者加权和mixed

上面说的OpenClaw-RL的RLVR reward参考:

asyncdefcompute_reward(self,item:dict,result:AgentResult,ctx:ToolContext,)->float:""" Multi-signal reward: - correctness (0.7): Did the tests pass? - efficiency (0.15): Fewer turns = better - tool_usage (0.15): Did the agent actually write + run code? """cfg=self.config# ---- Signal 1: Test correctness ----# Check if test_solution.py exists and passes in the agent's sandboxcorrectness=0.0try:test_result=ctx.terminal("python test_solution.py 2>&1",timeout=30)output=test_result.get("output","")exit_code=test_result.get("exit_code",1)ifexit_code==0and"passed"inoutput.lower():correctness=1.0elifexit_code==0:correctness=0.8# Ran without error but no explicit "passed"elif"assert"inoutput.lower()and"error"inoutput.lower():correctness=0.2# Partial — code runs but assertions failelse:correctness=0.1# Code errors out entirelyexceptExceptionase:logger.debug("Test execution failed in reward: %s",e)correctness=0.0# ---- Signal 2: Efficiency ----max_turns=cfg.max_agent_turns turns_used=result.turns_usedifturns_used<=3:efficiency=1.0elifturns_used<=max_turns//2:efficiency=0.8elifturns_used<=max_turns*3//4:efficiency=0.5else:efficiency=0.2# ---- Signal 3: Tool usage ----tools_used=set()formsginresult.messages:ifmsg.get("role")=="assistant"andmsg.get("tool_calls"):fortcinmsg["tool_calls"]:fn=tc.get("function",{})ifisinstance(tc,dict)else{}name=fn.get("name","")ifname:tools_used.add(name)# Good: used both terminal and file toolsif"terminal"intools_usedand("write_file"intools_usedor"patch"intools_used):tool_usage=1.0elif"terminal"intools_used:tool_usage=0.6eliftools_used:tool_usage=0.3else:tool_usage=0.0# ---- Combine ----reward=(cfg.correctness_weight*correctness+cfg.efficiency_weight*efficiency+cfg.tool_usage_weight*tool_usage)reward=min(1.0,max(0.0,reward))# Track metricsself._reward_buffer.append(reward)self._correctness_buffer.append(correctness)self._efficiency_buffer.append(efficiency)self._tool_usage_buffer.append(tool_usage)logger.debug("Reward: correctness=%.2f, efficiency=%.2f, tool_usage=%.2f → %.3f",correctness,efficiency,tool_usage,reward,)returnreward

Reference

[1] 一文搞懂Hermes:新顶流Agent如何从经验中自我进化
[2] https://github.com/NousResearch/hermes-agent
[3] 深度解析 Hermes Agent 如何实现“自进化”及其 Prompt / Context / Harness 的设计实践
[4] https://hermes-agent.nousresearch.com/docs/user-guide/features/rl-training

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/30 9:38:23

JetBrains IDE试用期重置终极指南:开源免费工具完全解析

JetBrains IDE试用期重置终极指南&#xff1a;开源免费工具完全解析 【免费下载链接】ide-eval-resetter 项目地址: https://gitcode.com/gh_mirrors/id/ide-eval-resetter 你是否曾因JetBrains IDE试用期到期而不得不重新配置开发环境&#xff1f;面对精心调校的代码风…

作者头像 李华
网站建设 2026/4/30 9:38:20

MySQL-聚合函数

什么是聚合函数聚合函数作用于一组数据,并对一组数据返回一个值.聚合函数的类型AVG() 平均值SUM() 求和MAX() 最大值MIN() 最小值COUNT() 计数 不计算NULL值计算表中有多少条记录COUNT(*) COUNT(1)如果需要统计表中的记录数,使用COUNT(*),COUNT(1),COUNT(具体字段)哪个效率更…

作者头像 李华
网站建设 2026/4/30 9:38:10

大语言模型测试框架LangTest:多维度评估与工程实践指南

1. 项目概述&#xff1a;一个面向大语言模型的多维度测试框架 最近在折腾大语言模型&#xff08;LLM&#xff09;的应用开发&#xff0c;从简单的聊天机器人到复杂的RAG系统&#xff0c;踩过的坑不计其数。最头疼的问题之一&#xff0c;就是如何系统性地评估一个模型或一个应用…

作者头像 李华
网站建设 2026/4/30 9:36:11

唯众人工智能语音实训平台:从入门到实战,语音AI全链路实训详解

一、平台简介先一句话说清楚&#xff1a;唯众人工智能语音实训平台&#xff0c;是专门给高校教学、实训和创新实践打造的软硬一体化系统&#xff0c;核心就是聚焦语音信号处理、人工智能算法、嵌入式开发这三大块&#xff0c;把语音采集、噪声检测、声源定位、语音识别与合成、…

作者头像 李华
网站建设 2026/4/30 9:36:02

超越基础权限:用PostgreSQL表空间+Schema策略实现数据存储配额管理

超越基础权限&#xff1a;用PostgreSQL表空间Schema策略实现数据存储配额管理 在SaaS平台或数据中台的实际运营中&#xff0c;数据存储的野蛮生长往往成为成本失控的隐形杀手。某个业务部门突然激增的数据量可能悄无声息地吞噬整个磁盘空间&#xff0c;导致关键服务不可用——这…

作者头像 李华