Claude Agent SDK 端到端自動化開發完整實戰：Agent harness、多 Agent 上下文交接、自動測試驗收的工程 pipeline 建構指南

自由揚John2026年5月17日約 37 分鐘閱讀

複製引文

Claude Agent SDK Agent harness 自動化開發 pipeline 工程實戰指南封面

早上九點。你打開 terminal，輸入一句「把這個 PRD 拆成 epic、寫程式、跑測試、修 bug、跑 CI、產 changelog、開 PR」。然後去煮咖啡。回來的時候，PR 已經在等你 review，測試全綠，failing edge case 也被另一個子 Agent 補了單元測試。

這不是科幻片。Anthropic 在 2025 年 9 月正式把 Claude Code SDK 改名為 Claude Agent SDK，把同樣驅動 Claude Code 的 agent loop、context engine、tool layer 開放出來給開發者程式化呼叫。意思是，過去你要花三個月手刻的「需求 → 開發 → 測試 → 驗收 → 多 Agent 協作」這條工程產線，現在的關鍵設計題從「能不能做」變成「怎麼建得穩」。

問題是——能跑起來和能上線是兩件事。Atlanta Tech News 2026 年 1 月引用的 Gartner 資料顯示，88% 的 AI Agent pilot 從來沒進入正式環境。DigitalApplied 三月針對 650 位企業技術主管的調查更直接：78% 有在跑 pilot，只有 14% 真的 scaled to production。

這篇文章是寫給工程師跟技術主管的工程實戰指南。讀完你會知道：Claude Agent SDK 到底解決了什麼、Agent harness 的四個核心組件怎麼拆、端到端 pipeline 的六個階段該怎麼接、多 Agent 之間的上下文交接該用 file-based 還是 in-memory、自動驗收層怎麼設計才不會被 Agent 自己騙過去，以及——業界踩過哪些坑你最好繞開。

為什麼「AI 會寫程式」不等於「自動化開發」

先把話講清楚。Cursor、Copilot、ChatGPT 寫出來的程式碼，是給人類工程師用的 — 你 review、你修改、你提交、你跑測試、你開 PR。整個 development loop 的迴圈閉合在你的大腦裡。

Agent pipeline 不一樣。Agent pipeline 是把這個 loop 從人腦搬到工程系統裡，由 Agent 自己讀需求、自己拆任務、自己寫程式、自己跑測試、自己看到失敗自己修、自己決定什麼時候交給下一個 Agent、自己把 context 寫成可追溯的檔案。人類介入的點只剩兩個：上游（規格）和下游（最後驗收）。

這個差距有多大？VentureBeat 2026 年的調查指出，即使通過了 QA 和 staging 階段，仍有 43% 的 AI 生成 code 在 production 需要手動 debug。Cursor 寫的程式碼進 production 失敗率高，正是因為驗收環節留在人類身上；而完整的 Agent harness 會把驗收 loop 推到 commit 之前，每個小任務都自帶 known-good checkpoint。

從架構角度看，兩種模式的本質差異可以一張表講完：

面向	AI Coding 工具（Cursor / Copilot）	Agent Pipeline（Claude Agent SDK）
迴圈閉合點	人類工程師	Agent harness 自己
失敗處理	等人類發現再修	在 commit 前自動驗證、失敗就回退或重試
上下文管理	單一 session、context 滿了就斷	多 Agent 隔離 context、透過檔案交接
可重複性	依賴工程師當下狀態	可被 cron 觸發、可被 CI 觸發、可被另一個 Agent 觸發
適用場景	互動式開發、學習、prototype	重複性任務、夜間批次、規格驅動的功能交付
工程成本	低（裝外掛就好）	中高（要設計 harness 與驗收層）

如果你在公司現在的議題是「要不要讓工程師都裝 Cursor」，那不是這篇要解的問題。如果你想的是「能不能讓 Agent 替我們把整個 backlog 從 spec 到 PR 一條龍跑完，工程師只負責 review」，那 Agent harness 是你必須開始研究的工程主題。

想看不同 Agent 框架（LangGraph、CrewAI、AutoGen、OpenAI Agents SDK）的橫向比較，可以搭配閱讀 AI Agent 框架選型完整指南；想理解工程師一人接案怎麼用 Claude Code 一條龍交付，可以看 工程師用 Claude Code 一人交付完整接案指南。

拆解 Claude Agent SDK：跟直接呼叫 API、Claude Code 差在哪

工程師常見的誤解是：「不就是另一個 LLM SDK 嗎，跟 anthropic 套件呼叫 messages.create 差在哪？」差很多。

Anthropic 官方的 engineering blog把這個差異講得很乾脆：「Agent SDK 把驅動 Claude Code 的同一套 agent loop、tools、context management 開放出來，可以用 Python 或 TypeScript 寫進你自己的 process。」

意思是你拿到的就是 Claude Code 本體的 runtime。具體差異看下表。

層級	直接呼叫 Messages API	Claude Agent SDK	Claude Code CLI
Agent loop	你自己寫	SDK 內建	內建
Tool execution	你自己解析 tool_use / tool_result	SDK 自動跑（client tools）	自動跑
Context 管理	你自己截斷	auto-compact 自動摘要	auto-compact
內建工具	無（要自己寫）	20+ 個（Read / Write / Edit / Bash / Glob / Grep / WebFetch / Agent / ToolSearch...）	同 SDK
Subagent 支援	無	原生 Agent tool 可 spawn subagent	subagent 原生支援
MCP 整合	無	內建 MCP server / client	內建
適合誰	想完全控制 prompt 與 loop 的研究者	要把 agent 包進自家服務的工程師	互動式開發 / 終端使用者

白話講：如果你直接呼叫 messages.create，你要自己寫 while loop、自己處理 stop_reason、自己解析 tool_use blocks、自己跑 bash、自己接 tool_result、自己決定什麼時候要 compact context。Agent SDK 把這些全部包起來。你只需要傳一段 prompt 進去，然後 await 結果。

另一個關鍵差異是 server tools vs client tools。Augment Code 的拆解很實用：web_search、code_execution、web_fetch、tool_search 跑在 Anthropic 雲端，你不用部署；bash、text_editor 跟你自定義工具則跑在你的 process 裡，由 SDK 自動處理 tool result 回傳。

最小可運作的 Agent SDK 呼叫（Python 範例）：

Python

from claude_agent_sdk import query, ClaudeAgentOptions

options = ClaudeAgentOptions(
    system_prompt="You are a senior backend engineer. Read the PRD, implement, test, commit.",
    allowed_tools=["Read", "Write", "Edit", "Bash", "Grep", "Glob", "Agent"],
    permission_mode="acceptEdits",  # 'ask' / 'acceptEdits' / 'bypassPermissions'
    cwd="/workspace/my-project",
)

async for message in query(
    prompt="Implement the user-export endpoint per docs/prd-user-export.md, "
           "then run pytest and commit when green.",
    options=options,
):
    print(message)

三十行不到，你就有了一個會自己讀檔、自己改檔、自己跑 bash、自己 spawn subagent 的 Agent。但這只是 Agent，還不是 harness。下一節談 harness 是什麼。

Agent harness 四大組件架構 Agent Loop Tool Context Verification

Agent harness 必備的四個組件：Loop、Tool、Context、Verification

「Harness」這個詞在 Anthropic 內部用得很頻繁。直譯是「韁繩」「挽具」——把一匹會自己跑的馬綁進可控的工作迴圈。Agent harness Agent harness 指的是把 Agent 包進生產級工作流的那層工程。

一個能上 production 的 harness 至少要有四個組件。

組件一：Agent Loop（決策迴圈）

這是 Claude Agent SDK 直接給你的。Loop 的生命週期是：init → 評估 prompt → 呼叫 tool → 收 tool_result → 再評估 → ... → 終止。Anthropic 文件叫這個 agent loop reference。重點是 SDK 已經幫你把 stop_reason 判讀、tool_use block 解析、結果回傳全包好了，你不要再自己寫一個 while True——這是新手最常見的反模式。

組件二：Tool Layer（能力層）

Agent 能做什麼，等於 tool layer 給它什麼。預設的 20 多個 built-in tools 涵蓋 Read/Write/Edit/Bash/Grep/Glob/Agent/WebFetch，已經夠跑一般軟體開發。但實戰上你一定要加自家的 custom tool — 連你們的 ticketing 系統、發 PR 到 GitLab、查 Sentry error、寫 Slack 通知。這些用 in-process MCP server 寫，幾十行搞定。

組件三：Context Engine（上下文引擎）

這是最容易被低估的一塊。Reddit 的 agent builder 社群把它列為 2026 上半年最痛的議題：每次新 session 都要燒 token 重新「認識」repo，agents waste expensive output tokens narrating their way back into repo context。Claude Agent SDK 的 auto-compact 解決了單一 Agent 的 context overflow 問題（接近上限時自動摘要），但跨 Agent 的 context 交接還是 harness 層要解的事，下一節會深入講。

組件四：Verification Layer（驗收層）

缺了這層，整套 harness 就是個更貴的 Cursor。Verification 不是「最後跑一次 test」這麼簡單，它要在 每個 commit 之前、每個 subtask 完成時、每個 Agent handoff 點 都觸發。AWS Prescriptive Guidance 把這個叫 evaluator reflect-refine loop——一個獨立 Agent 跑 verification，跟寫程式的 Agent 用不同的 prompt、不同的 context、有權力把工作打回票。

工程實作建議

永遠不要讓同一個 Agent 同時負責「寫程式」跟「驗證自己寫的程式」。LLM 有強烈的 confirmation bias，會把自己剛寫的 code 評為正確。Verification 必須用獨立 subagent、獨立 system prompt、可獨立失敗。

端到端 Pipeline 的六個階段：從 spec 到驗收歸檔

端到端自動化開發 pipeline 六階段示意從需求到驗收歸檔

好，把四個組件拼起來，端到端 pipeline 長什麼樣？我們在實作客製化系統交付時，把 pipeline 拆成六個階段。每個階段都有獨立的 Agent，有明確的 input / output contract，失敗會落到下一輪。

圖表載入中…

階段一：Spec Agent — 需求收斂

Input 是 PM 寫的（往往不完整的）PRD 或一張 ticket。這個 Agent 的工作是把模糊需求變成 BDD 格式的 acceptance criteria（Given/When/Then），並 spawn 出每個 user story 對應的測試規格。TestQuality 把這個叫 Plan-Act-Verify Reasoning Loop，重點是讓 Agent 把「acceptance criteria 是否完整」也列為驗收項目。輸出寫到 `docs/specs/.md`，下一個 Agent 從這裡接手。

階段二：Architect Agent — 任務拆解

讀 spec，產出 task graph：每個任務的 input file、output file、相依關係、優先順序。實戰經驗：task 越小越好，每個任務都應該是「可以獨立驗收的最小單位」，最好 30 分鐘以內能跑完。把工作切成小 subtask、實作一個、驗證一個、commit 一個，再進下一個 — 每個 commit 是 known-good checkpoint，失敗 blast radius 只有一個 subtask。

階段三：Coder Agent — 實作

拿到單一 task 的 spec，寫 code，跑 unit test，commit。這裡的關鍵設計是 permission_mode：development 時開 acceptEdits，CI 時開 bypassPermissions，永遠不要在 production 操作 secret 的場景開 bypass。權限粒度可以細到 tool 等級（allowed_tools 白名單）。

階段四：Test Agent — 驗證

跑完整 test suite，並且補 Coder Agent 沒覆蓋的 edge case test。這是兩個獨立工作：一是 regression（跑既有測試）；二是 mutation testing 風格的補 case，由 Test Agent 主動找出沒覆蓋到的分支。失敗會把任務踢回 Coder Agent，附上 stack trace 跟假設原因。

階段五：Reviewer Agent — Code Review

獨立 context、獨立 prompt、明確的 checklist：security 紅線、效能 anti-pattern、命名、是否符合 spec。一旦不通過，整個 task 回到 Coder。AddyOsmani 的 self-improving agents 文章強調這一點：「The validator is a separate agent with a separate prompt, separate context, and explicit permission to fail the work.」

階段六：Archive Agent — 歸檔與交接

把整個 task 的 spec、code diff、test result、review comment、context 摘要寫成 handoff report，存到 `.claude/reports/handoff/-.md`。下一輪相關任務啟動時，新 Agent 從這裡讀歷史，避免重新發現整個 repo。這個檔案就是 multi-Agent 的「記憶」。

ℹ️Pipeline 觸發方式

整條 pipeline 可以被三種來源觸發：(1) cron 排程定時跑（夜間 backlog batch）、(2) git webhook（PR 開啟自動跑 review Agent）、(3) 另一個 Agent 主動 spawn。Claude Agent SDK 的 query() 是 async，包進你家的 worker pool 就能水平擴展。

多 Agent 上下文交接：file-based 還是 in-memory，怎麼選

多 Agent 上下文交接 file-based handoff 工程實作示意

前一節提到「Archive Agent 寫 handoff report」，這就是多 Agent 上下文交接的具體做法。但為什麼是 file-based，不是把 context 在 memory 裡直接傳給下一個 Agent？這是 harness 層最重要的工程選擇之一。

面向	In-memory handoff	File-based handoff（推薦）
實作方式	把上一個 Agent 的 transcript 塞進下一個 prompt	上一個 Agent 寫檔，下一個 Agent 讀檔
Token 成本	高（前一個 Agent 的所有對話都進 context）	低（只讀摘要後的 report）
Context 污染	嚴重（不相關的 tool call 都跟進來）	低（只留結論性資訊）
可追溯性	差（runtime 結束就沒了）	好（檔案可以 commit 進 repo）
Debug 難度	高	低（直接打開檔案看）
適用場景	極短鏈、即時對話	production pipeline、長鏈、批次

Claude Agent SDK 官方文件講得非常直接：「Subagent 的 context 從零開始，唯一從 parent 到 subagent 的通道是 Agent tool 的 prompt 字串」。換句話說，subagent 不知道 parent 在幹嘛 — 你必須在啟動 subagent 的 prompt 裡，明確告訴它要讀哪個檔、有什麼錯誤、做過什麼決定。

這是設計上的選擇，不是 bug。Context 隔離是 multi-Agent 系統能 scale 的前提。如果 parent 跟 subagent 共用 context，spawn 三個 subagent 就會撐爆。隔離後，主 Agent 只看到 subagent 回報的結論（幾百字），不會被它的 tool 操作細節（幾萬字）污染。

Handoff report 的最小可用模板：（存成 `.claude/reports/handoff/-.md`）

Markdown

# Handoff: user-export-endpoint → reviewer
date: 2026-05-17
from: Coder Agent (claude-sonnet-4-6)
to: Reviewer Agent
status: ready-for-review

## What was implemented
- POST /api/users/export endpoint (src/api/users.ts:120-180)
- New service UserExportService (src/services/export.ts)
- Background job via BullMQ for >10k rows

## Test results
- pytest: 47 passed, 0 failed
- Coverage: 92% (target: 90%)
- E2E: user-export.spec.ts passes

## Decisions made (need reviewer to confirm)
- Used streaming JSON not full buffer (memory ceiling: 256MB)
- Rate limit at 5 req/min/user (spec didn't specify, defaulted to existing pattern)

## Known open questions
- Should the export include soft-deleted rows? (spec ambiguous, defaulted to NO)
- CSV vs JSON format? (implemented both, default JSON via Accept header)

## Files touched
src/api/users.ts (+58 -2)
src/services/export.ts (new, 220 lines)
tests/api/user-export.spec.ts (new, 180 lines)
docs/api.md (+12 -0)

## Next Agent should
1. Run `pnpm test:integration` before approving
2. Verify the rate limit assumption against PRD section 3.2
3. Check security: PII fields in export are masked

看起來很「人類」是吧？這就是重點。Handoff report 不只是 Agent 之間用，工程主管半夜被 page 時也能用同一份檔案知道 pipeline 卡在哪。File-based 的好處不只是 token 便宜，是把 Agent 的決策變成可追溯的工程文件。

想看更深的多 Agent 通訊理論基礎，可以參考 FIPA ACL 完整解析：multi-agent 系統的通訊標準；想看視覺化監控 subagent 的工具，可以看 Claude Code Agent View subagent 監控介面教學。

自動驗收層怎麼做：把驗收條件變成可執行測試

驗收層是整條 pipeline 的守門員。設計不好，Agent 會用一堆「看起來在做事」的 tool call 騙過你，最後產出一坨跑得動但功能不對的 code。業界踩過這個坑：一個 Agent 把 spec 中的「使用者刪除後 30 天內可復原」實作成「刪除後 30 秒可復原」，所有 unit test 都過了，因為它自己寫的 test 也只測 30 秒。

教訓很直接：acceptance criteria 必須跟測試分開，由不同的人（或 Agent）產生。

驗收層的三道關卡

關卡一：Spec-to-test mapping。Spec Agent 階段就把每條 acceptance criterion 對應到一個或多個自動化測試 ID（例：AC-3.2 → tests/user-export/test_soft_delete_window.py::test_30_day_window）。Coder Agent 在實作時，被禁止 modify 這些 test，只能加 fixture。
關卡二：Reviewer Agent 的獨立 checklist。它不看 code，只看「spec 中的每條 AC 是否有 passing test」「test 是否真的測 AC 描述的條件」「有沒有 trivially passing test（assert True 之類）」。
關卡三：Production-like smoke test。在 staging 環境跑一輪真實流量 replay。43% AI 生成 code 在 production 翻車就是因為跳過這層。即使 unit test 100% 過，沒在類 production 環境驗過就不能放行。

Spec 與 test mapping 的最小格式：（YAML 或 JSON 都可以，這裡用 YAML 範例）

YAML

feature: user-export
acceptance_criteria:
  - id: AC-1
    given: 一個 admin 使用者
    when: 呼叫 POST /api/users/export
    then: 回傳 202 + job_id
    tests:
      - tests/api/test_export.py::test_admin_can_trigger_export
      - tests/api/test_export.py::test_returns_job_id

  - id: AC-2
    given: 一個 non-admin 使用者
    when: 呼叫 POST /api/users/export
    then: 回傳 403
    tests:
      - tests/api/test_export.py::test_non_admin_forbidden

  - id: AC-3.2
    given: 一個被 soft-delete 的使用者
    when: 在刪除後 30 天內呼叫 GET /api/users/restore
    then: 使用者狀態恢復為 active
    tests:
      - tests/api/test_export.py::test_30_day_restore_window
    forbidden_to_change_by_coder: true

最後一個欄位 `forbidden_to_change_by_coder: true` 是 harness 層用權限機制守住的 — Coder Agent 的 allowed_tools 在跑這個 feature 時，Edit 工具被 hook 過濾，碰到這個 spec 列出的 test file 直接拒絕修改。簡單但極其有效。

⚠️Agent 騙人的常見模式

(1) 加 try/except 把錯誤吞掉，讓測試「通過」；(2) 把 test assertion 改寬鬆（assertEqual → assertIsNotNone）；(3) 跳過 failing test 加 @skip decorator；(4) 直接 mock 掉真正在驗證的那個函數。Reviewer Agent 的 checklist 必須明確列這四個模式，並要求 grep 整個 diff 查證。

最小可運作 Pipeline 的程式碼骨架

理論講完，直接給你一段能跑的骨架。這是一個用 Claude Agent SDK 串起來的四段 pipeline：spec → code → test → review。把它當起點，往上加 Architect / Archive 階段就是 production-ready 的版本。

Python

import asyncio
from pathlib import Path
from claude_agent_sdk import query, ClaudeAgentOptions

WORKSPACE = Path("/workspace/my-project")
REPORTS = WORKSPACE / ".claude/reports/handoff"
REPORTS.mkdir(parents=True, exist_ok=True)

def make_options(role: str, allowed_tools: list[str], permission="acceptEdits"):
    return ClaudeAgentOptions(
        system_prompt=Path(f"prompts/{role}.md").read_text(),
        allowed_tools=allowed_tools,
        permission_mode=permission,
        cwd=str(WORKSPACE),
    )

async def run_agent(role: str, prompt: str, options: ClaudeAgentOptions) -> str:
    """Run an agent and return the final text result."""
    final_text = ""
    async for msg in query(prompt=prompt, options=options):
        if msg.type == "result":
            final_text = msg.text
    return final_text

async def pipeline(feature: str, prd_path: str):
    # Stage 1: Spec Agent — turn PRD into BDD acceptance criteria
    spec_prompt = f"Read {prd_path}. Output BDD acceptance criteria to docs/specs/{feature}.yaml"
    spec = await run_agent(
        "spec",
        spec_prompt,
        make_options("spec", ["Read", "Write", "Grep"]),
    )

    # Stage 2: Coder Agent — implement
    code_prompt = (
        f"Read docs/specs/{feature}.yaml. Implement, run pytest, commit when green. "
        f"Do NOT modify any test file marked forbidden_to_change_by_coder."
    )
    coded = await run_agent(
        "coder",
        code_prompt,
        make_options("coder", ["Read", "Write", "Edit", "Bash", "Grep", "Glob"]),
    )

    # Stage 3: Test Agent — verify and add edge cases
    test_prompt = (
        f"Run full test suite. For each AC in docs/specs/{feature}.yaml, "
        f"verify at least one test exists and actually asserts the described behaviour. "
        f"Add edge case tests for uncovered branches. Output verification report."
    )
    tested = await run_agent(
        "tester",
        test_prompt,
        make_options("tester", ["Read", "Write", "Bash", "Grep"]),
    )

    # Stage 4: Reviewer Agent — independent code review
    review_prompt = (
        f"Independent code review for feature {feature}. "
        f"Check for: silent except, weakened assertions, skipped tests, mocked-out logic. "
        f"Reject and write rejection report if any found, else write approval to "
        f".claude/reports/handoff/{feature}-approved.md"
    )
    reviewed = await run_agent(
        "reviewer",
        review_prompt,
        make_options("reviewer", ["Read", "Grep", "Glob", "Bash", "Write"], "ask"),
    )

    return reviewed

if __name__ == "__main__":
    asyncio.run(pipeline("user-export", "docs/prd-user-export.md"))

幾個重點要看懂：

每個 Agent 有獨立的 system prompt（放在 `prompts/.md`），這是把 Agent 「specialized」的關鍵，不要全部塞同一個 prompt。
每個 Agent 有不同的 allowed_tools，例如 Reviewer 沒有 Edit 工具，物理上不可能改 code。
Reviewer 用 permission="ask" 而非 acceptEdits — 任何寫入動作都會擋下來等人類確認，這是 production safety 的最後一道防線。
`forbidden_to_change_by_coder` 透過 in-process hook 實作 — Coder 嘗試 Edit 那些 test file 時，hook 直接 deny。Claude Agent SDK 1.0+ 的 hooks 系統原生支援。

六個常見的坑：工程陷阱與資安紅線

下面這些都是業界在實際接案中踩過的坑。寫出來給你繞開。

坑一：把 secret 直接暴露給 Bash 工具

新手最容易犯的錯。Agent 一旦能跑 bash，預設就能讀 `.env`、能跑 `aws s3` 用你的 credential。務必用 permission hooks 把敏感檔案 blocklist，或者整個 process 用降權使用者跑。

我們專門寫過一篇 別讓 Claude Code 看到你的 .env：四道防線完整守住敏感檔案 講這部分，連 hook 設定範例都有，自建 harness 一定要看。

坑二：context 不隔離導致整條 pipeline 失憶

把所有階段的對話塞在同一個 session，跑到第三 stage 就會 hit context limit，前面的決策被 auto-compact 摘掉，後面 Agent 開始亂做。正解：每個 stage spawn 獨立 Agent，handoff 走 file。前一節給的骨架就是這樣設計。

坑三：silent failure — Agent 在「忙著」但什麼都沒做

Reddit Agent builder 社群把這個列為日常痛：「silent failures — agents burn tokens without producing results」。Agent 可能跑了 50 個 tool call，最後產出零 commit。Harness 必須有 progress checkpoint：每 N 個 tool call、每 M 分鐘，強制要求 Agent 輸出「目前進度 / 預計剩餘步驟」，沒輸出就 abort。

坑四：成本失控

Multi-agent 燒 token 的速度是 single agent 的 5-10 倍。Pockit 的 2026 比較數據顯示，同一個「research and summarize」任務，AutoGen 平均吃 8000 tokens，LangGraph 只要 2000。Pipeline 上線前一定要在 staging 跑壓力測試，用 Anthropic 的 usage tracking API 算出 per-task 成本，找老闆批預算才不會被 30 倍超支砸頭。

坑五：沒有 rollback 機制

Agent commit 了一段壞 code，然後又跑了 20 個 task 在它上面。發現問題時整個 git history 一團糟。每個階段必須是 atomic commit，且每個 commit 都標 tag (例：pipeline/stage-3/user-export/attempt-2)。失敗就 git reset 到上一個 tag 重來，不要用 fixup commit 蓋。

坑六：人類驗收環節被自己嫌煩然後跳過

Pipeline 跑順之後最大的誘惑是「Reviewer Agent 我都同意，乾脆自動 merge 吧」。這是把公司營運押在 LLM 一張臉的舉動。我們的硬規定是：production-bound code 永遠有人類最後 approve，且 PR 必須附完整 handoff report 鏈，這條鏈缺一段就不能 merge。

🚨資安紅線一次說清

(1) Agent 不能跑在 prod credential 環境；(2) sensitive file (.env / secrets / *.key) 永遠在 hook blocklist；(3) external API 呼叫走代理伺服器、所有外連 log 起來；(4) permission_mode=bypassPermissions 永遠不上 production；(5) Agent 寫的程式碼進 main branch 前必須通過 SAST 掃描。這五條不能談判。

工程主管的決策框架：自建 vs 採購 vs 接案

這節給技術主管。前面講了這麼多工程細節，最後一個現實問題是：你公司該不該投資建這套 pipeline？答案不總是「該」。

情境	自建 Agent harness	採購現成 SaaS	找外包接案開發
適合公司規模	工程團隊 10 人以上	5 人以下、不想養工程	5-15 人、要客製但缺人
前期投入	高（3-6 個月、2-4 位工程師）	低（按月訂閱）	中（一次性 80-300 萬）
客製化程度	完全自主	受限於 vendor	高（合約定義）
資料主權	完全在自家	看 vendor 政策	需要合約紅線
失敗風險	中（自己人，可調整）	低（vendor 已驗過）	中高（要看廠商實力）
六個月 ROI	成功者平均 171%（DigitalApplied 2026 統計）	通常 6-12 個月見效	需驗收條款明確

做選擇前，先回答三個問題：

你的核心 workflow 是不是「可以被規格化的重複任務」？ 如果是 — 例如「每週把 50 個客服 ticket 分類派工」、「每天從 5 個資料源更新一份報表」 — 自建或外包都值得；如果工作高度創意、規格無法明確寫，先別碰 Agent pipeline。
你有沒有人能維護？ Harness 不是裝完就跑十年的系統，模型更新、tool 改版、prompt 漂移，都需要工程師持續調整。沒人維護就採購 SaaS，把維護外包給 vendor。
失敗成本能不能承受？ 如果是內部工具、出包頂多重跑，自建學習成本值得；如果是面對客戶的關鍵流程，先用 SaaS 驗證價值，確認有效再自建。

想看完整的採購決策框架（包括報價區間、合約紅線、廠商評估），可以搭配閱讀 AI Agent 系統採購完整框架：老闆視角的 Workflow vs Agent 判斷。

我們團隊的實際做法是：自己用 Claude Agent SDK 建內部 harness（給工程團隊跑 daily backlog、客服分派、文件歸檔），同時把這套經驗變成客戶接案的核心能力。如果你的公司也想沿這條路走，可以從一個「重複性最高、痛點最明顯、失敗成本最低」的內部工作流開始，例如夜間自動 dependency update 跟 changelog 生成。三個月驗證可行，再往生產級任務擴展。

找專業團隊一起建 Agent pipeline，可以直接聯繫我們的 客製化 AI 系統開發服務。我們我們站在 Claude Agent SDK 這個工程平台上，幫你把 pipeline 從 design doc 跑到 production。

常見問題

QClaude Agent SDK 跟 Claude Code 是同一個東西嗎？

不是同一個但共用底層。Claude Code 是 Anthropic 出的終端工具（CLI），給工程師互動式使用；Claude Agent SDK 是把 Claude Code 背後的同一套 agent loop、tool layer、context engine 開放出來給開發者用 Python / TypeScript 程式化呼叫的函式庫。你可以把 SDK 想成「Claude Code 的引擎」，CLI 是其中一種介面。SDK 在 2025 年 9 月從 Claude Code SDK 改名為 Claude Agent SDK，明確定位成通用 agent 開發平台，不只是寫程式用。

Q為什麼要用 file-based handoff 而不是直接把上一個 Agent 的記憶傳給下一個？

三個原因：(1) 成本 — in-memory 把整段 transcript 塞進下一個 Agent 的 context，token 燒得超快；(2) 隔離 — Agent 多了之後共享 context 會互相污染，subagent 看到 parent 的不相關 tool call 會誤判；(3) 可追溯 — 檔案能 commit 進 repo、能被人類 review、能跨 session 復用，runtime memory 結束就沒了。Claude Agent SDK 官方設計就是「subagent context 從零開始，唯一通道是啟動 prompt」，這不是限制是刻意的工程選擇。

Q自建 Agent harness 要多久？需要幾個人？

從零到 production-ready 約 3-6 個月，2-4 位資深工程師。第一個月做 PoC（單一 pipeline 跑通）、第二三個月做四大組件（loop / tool / context / verification）、第四個月做監控與 fallback、第五六個月做 production hardening 與成本優化。如果只想跑一個內部工具（例如夜間 backlog），可以縮到 4-6 週、1-2 位工程師。重點是先選一個「規格化、失敗成本低、痛點明顯」的工作流當起點，不要一開始就想做 universal agent。

Q如果我已經在用 LangGraph / CrewAI / AutoGen，還要換成 Claude Agent SDK 嗎？

看你目前痛點。LangGraph 強在 stateful graph 跟 checkpointing，CrewAI 強在角色化的 multi-agent，AutoGen 強在 conversation 風格的協作。Claude Agent SDK 的差異化在於：(1) 跟 Claude 模型深度整合，agent loop 是 Claude 自己驅動不是你寫；(2) 內建 20+ 個工具，包含完整檔案操作；(3) subagent / hooks / MCP 都是一級公民。如果你的 stack 已經圍繞另一個框架穩定運作，沒必要硬換；如果你正在從零選型，且主要場景是 coding agent / 開發工作流，Claude Agent SDK 起步成本是最低的。

QAgent 寫的 code 進 production 安全嗎？

Verification 層做對了就安全，做錯了會出大事。最低底線是：(1) 每個 commit 在 staging 跑類 production 流量；(2) Reviewer Agent 跟 Coder Agent 完全隔離（不同 prompt / context / 模型版本）；(3) main branch 永遠需要人類最後 approve；(4) production 環境的 secret 永遠不暴露給 Agent process；(5) 整條 pipeline 的 commit 都用 tag 標記，失敗能 git reset 回退。VentureBeat 2026 年的調查顯示 43% AI 生成 code 在 production 翻車，幾乎全部是跳過 staging smoke test 的案例。

QAgent pipeline 跑壞了怎麼 debug？

三層 debug 策略：(1) 看 handoff report — 因為每個 stage 都會寫檔，先去看哪一段 report 不對勁；(2) 看 SDK 的 messages stream — 用 verbose log 模式把每個 tool call 跟 result 印出來；(3) Re-run with seed — Claude 模型支援 temperature=0 跟 fixed prompt，把出問題的 stage 在隔離環境用同樣 input 重跑，看是不是 deterministic 出錯。最常見的問題是 Agent 在某個 tool call 後做了反直覺決定，這時直接拿那個決策點的 prompt 跟 context 去人工 trace，通常 30 分鐘內能定位。

從一個小 pipeline 開始，三個月內看見成果

讀到這裡你應該已經理解：Agent harness 不是一個套件，是一套工程設計。Claude Agent SDK 給了你引擎，但 loop 怎麼接、context 怎麼交、驗收怎麼守、資安怎麼分層，全部是工程主管要做的決策。

把這篇文章存起來，下次團隊討論「要不要建 Agent pipeline」時當共同語言。如果你想要一個真的能跑、跟你既有 stack 整合、能扛住資安稽核的 production-grade harness，我們做的事就是這個。

下一步

預約 30 分鐘技術討論，我們會根據你的工程團隊規模、現有 CI/CD、痛點工作流，給你一份「自建 vs 採購 vs 外包」的客製化建議書。沒有銷售簡報，只談工程：預約諮詢

想看更多技術深度的 Agent 與自動化主題，可以順路看 Claude Code /loop 完整教學：把 Claude 變成定時自動執行 agent、Multi-Agent Debate 架構設計與實作指南，這兩篇跟本文形成完整的 Agent 工程知識鏈。

分享文章

自

AUTHOR

自由揚John

查看作者頁

留言(0)

尚無留言，成為第一個留言的人吧！

SERVICES

GET IN TOUCH

需要網站系統架設或軟體開發？

無論是品牌官網、客製化系統還是應用程式，我們的團隊擁有豐富經驗，歡迎聯繫我們，讓專業為您的事業加分。

免費諮詢看我們做過的案例 →