Before you build anything, answer one question
What specific workflow are you trying to eliminate, accelerate, or scale?
If your answer starts with "we want AI that can..." — you're not ready. Successful first agents solve a specific, measurable problem. "Reduce the 3-hour daily research workflow to 20 minutes" is a good brief. "Use AI to be more productive" is not.
We refuse engagements that start with technology choices instead of workflow choices. Every agent we've seen fail in production failed for the same reason: scope was too broad, and nobody could tell whether it was working.
The minimum viable agent architecture
Every production-grade agent has the same five components:
- A reasoning model — Claude, GPT, or Gemini for most use cases. This is the brain.
- A tool registry — a list of things the agent is allowed to do (query a database, call an API, draft an email, run a shell command).
- A memory layer — short-term context for the current run, optionally long-term memory across runs.
- Guardrails — budget caps, permission tiers, approval checkpoints, and denylists that prevent the agent from going off-rails.
- Observability — logs of every decision, for auditing and debugging.
If your build doesn't have all five of these, it's not an agent — it's a prompt with delusions.
Choosing the reasoning model
For most first agents, Claude from Anthropic is the right default. Its instruction-following is currently the best for multi-step work, and its refusal behavior is sensible for business contexts. GPT is close behind and sometimes better for specific structured output tasks. Gemini wins on massive-context work.
Don't lock into one model. Abstract the model call so you can swap providers as capabilities change. We've seen teams burn months re-engineering after a vendor change that could have been a one-line swap with the right abstraction.
Tool registry: keep it small
The temptation is to give the agent access to everything. Resist. Each tool you add is another surface area for the agent to do something dumb.
Your first agent should have 3-8 tools, max. Typical starter set:
- Search (over your knowledge base or the web, depending on workflow)
- Read from one or two data sources (your CRM, your ticket system, etc.)
- Draft output (email, summary, report)
- Escalate / flag for human review
- Log / record what happened
Guardrails are the hard part
Most teams underestimate guardrails. They build a demo that works 80% of the time, ship it, and then spend three months firefighting the other 20%. Front-load the guardrails and you save that time.
- Budget caps — dollar amount per run, per day, per week. Hard limits, not soft warnings.
- Permission tiers — the agent can read customer data but not write; can propose actions but not execute high-stakes ones.
- Approval checkpoints — before the agent takes action in specific categories (sending money, public communications, irreversible changes), a human signs off.
- Denylists — topics, tools, or actions the agent is explicitly prohibited from.
- Kill switch — one toggle that halts the agent immediately.
- Rate limits — prevent a runaway agent from making 10,000 API calls in a minute.
The guardrail test
For every tool you expose, ask: what's the worst thing the agent could do with it? If you can't accept that outcome in production, the guardrail isn't tight enough.
Observability: log everything
Every decision the agent makes should be logged with timestamp, reasoning, tools called, data accessed, and output generated. This isn't a nice-to-have — it's the only way to debug, improve, or defend the agent.
Think of it as black-box recording for your agent. When something goes wrong (and something will), you need to be able to replay exactly what the agent saw, thought, and did.
A realistic timeline
- Weeks 1-2: Scoping. Shadow the workflow, document edge cases, define success metrics.
- Weeks 3-6: Build + test. Implement the agent, run against historical data, iterate on prompts.
- Weeks 7-8: Pilot with mandatory human review. Every output reviewed before it ships.
- Weeks 9-12: Gradual rollout. Expand automation surface as accuracy proves out on real data.
- Ongoing: Monthly tuning, quarterly audits, annual model re-evaluation.
What you'll need from your team
- An executive sponsor who owns the outcome (not the technology)
- A subject-matter expert who deeply understands the workflow being automated
- An IT or security contact who can approve data access and tool integrations
- An ops lead who owns the human-review loop during pilot
Five common failure modes
- Scope too broad — "automate all of sales" fails; "automate the 20-minute research step before each outbound call" works.
- Skipping pilot review — you'll miss edge cases and lose team trust the first time the agent does something unexpected.
- No measurement baseline — without data on how long the workflow used to take, you can't prove the agent helped.
- Treating AI output as ground truth — it's a draft, not a deliverable. Build a human approval step for anything that matters.
- Not planning for model drift — models change. Test your agent against a gold-standard data set monthly and compare to baseline.
What comes after the first agent
Once the first agent is live and producing measurable ROI, the natural next steps are: scaling the same agent to adjacent workflows, adding new agents for related problems, and eventually composing agents into multi-agent systems where one agent hands off to another.
We cover multi-agent systems in a separate guide. For now: nail your first agent before you dream about the second. Most businesses that skip this step end up with a graveyard of half-finished agents that nobody trusts.