AI Agent Skills: Real Token Costs and 4 Failure Patterns From 17 Production Skills

The pitch for AI agent skills is simple: write the instructions once, load them when needed, save tokens compared to repeating prompts. The reality is messier. Savings depend on how you structure skills, how often they fire, and

whether your tooling auto-loads them when they shouldn't.

Over six months of running 17 skills in a production Django REST Framework healthcare codebase, the token math worked out well for about 13 of them. The other four cost more than they saved. Here are the real numbers, the four patterns that failed, and the honest case for when a plain prompt is the better call.

What AI Agent Skills Actually Cost in Tokens

The code review skill is the lightest in the set. At 112 lines of Markdown, it loads at ~790 tokens and covers secrets scanning, PHI in log calls, naming conventions, import order, test coverage, migration safety, and output format. It loads once per session, no matter how many files you review.

A comparable inline checklist runs 300-500 tokens per request. But the comparison isn't 790 vs. 300. It's "790 once" vs. "300 every request, abbreviated differently each time, with checks dropped under deadline pressure."

Review five files in one session: the inline approach repeats five times (1,500-2,500 tokens total), each repetition slightly shorter than the last. The skill loads once and runs identically on all five.

	Tokens	Consistency
Inline prompt	~300-500 per request + the diff	Drifts across sessions
SKILL.md	~790 once per session + the diff	Same checklist every time

The real return isn't the arithmetic. It's that the checklist runs complete every time, with no drift and no forgetting. That consistency is how the skill caught a silent data loss bug in a medical chat system that a hastily abbreviated prompt would have missed.

On the heavier end, HIPAA-Guardian loads at roughly 2,050 tokens across 253 lines. That's expensive enough to feel in the context window, which is why it only fires on explicit invocation or when context clearly calls for a PHI scan, never on routine code changes. The architecture follows progressive disclosure (as documented in the Claude Code skills overview): the core workflow remains in SKILL.md, while detailed reference material (HIPAA Safe Harbor identifiers, risk-scoring methodology) lives in separate references/files loaded only when needed. Without that split, the skill would exceed 4,000 tokens. Nobody should load that for a routine commit.

The sweet spot across our library: under 150 lines for skills that might auto-load in Claude Code, and up to 250 lines for skills that only fire on explicit invocation.

Measure Your Own Skills

Every figure above came from running this against the actual SKILL.md files. Run it on yours for directly comparable numbers:

python

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

with open("SKILL.md") as f:
    content = f.read()

tokens = len(enc.encode(content))
lines = len(content.splitlines())
print(f"{tokens} tokens, {lines} lines, {tokens/lines:.2f} tokens/line")

4 Failure Patterns That Cost More Than They Save

Token math only works if your skills fire at the right time. These four patterns broke that assumption.

1. The Skill That Loaded on Everything

I built a general "backend guidance" skill with eight reference files covering Django patterns, API conventions, query structure, and deployment. Its description was broad enough to match nearly any backend question. So it loaded constantly, including for "what does this error message mean."

Over 1,500 tokens of overhead per invocation. Across a full day of coding, 10,000-15,000 wasted tokens for a skill being used as an expensive search engine.

Fix: Split into three focused skills: migration-safety, query-optimizer, and permissions-audit. Each has a narrow description matching exactly one problem category. Token waste dropped to near zero.

This pattern matters most in runtimes that auto-discover skills based on description matching - Claude Code, Cursor 2.4 and later, GitHub Copilot, and OpenCode. In runtimes that only fire on explicit invocation, broad descriptions waste your time but not your tokens. The invocation comparison across runtimes covers these differences in detail.

2. The Flat Output Nobody Read

hipaa-guardian scanned a codebase and returned 47 items as a flat list. SSN in a test fixture. Email in a README. Real SSN in a production .env file. All presented at identical visual weight, identical formatting.

A developer on the team ran it, got 47 items, scrolled through the first ten, and closed the report. The .env result, the only one that mattered, was number 31.

The tokens were spent. The skill ran correctly. And the critical result was invisible because it looked the same as a test fixture match.

Fix: Risk scoring. Critical items surface first. The .env SSN is now item #1. Same 47 issues, but the one that matters is at the top. No extra tokens, just better output design.

3. The Skill That Silently Drifted From the Codebase

scaffold-api had hardcoded assumptions about what the base ViewSet class supported. When I added a new mixin to that base class, the skill kept generating scaffolding without it.

The output looked right. No errors. Just missing functionality that showed up in code review two weeks later. Fourteen days of scaffolded endpoints missing a mixin, because the skill assumed static source code instead of reading it.

Fix: The skill now reads core/viewsets.py at runtime before generating anything. It can't fall behind the codebase because it checks the source on every invocation. Costs one extra file read per run, but prevents silent errors that waste far more developer time than tokens.

4. Two Skills That Fired at Once

The code review skill and hipaa-guardian both mentioned "HIPAA compliance" in their descriptions. Touching a serializer triggered both in Claude Code. Two overlapping reports. Confused output. Double the token bill.

Fix: Exclusive trigger domains. The code review skill owns structural quality (naming, duplication, tests). HIPAA-Guardian owns PHI/PII detection. No shared keywords. Test by running common tasks and checking which skills auto-load.

When a Prompt Is the Better Choice

Not every repeated task justifies a skill. Of the 17 we built, four shouldn't exist. They ran rarely, the requirements kept shifting, or no one but me needed them. Each one costs time to write, time to maintain, and tokens every time it auto-loads on a request it didn't need to handle.

Use a prompt when you'll do this fewer than five times total, when requirements are still shifting (a stale skill gives confident wrong answers that are harder to catch than an agent that admits uncertainty), when the task is conversational and exploratory, when the instruction fits in under 200 tokens, or when you can't write a one-sentence description that matches exactly the right situations.

The last one is the practical test: try to write the description field before you write anything else. If you struggle to make it specific, the pattern isn't stable enough yet - scope differently or wait.

Use a skill when you've copied-pasted the same instructions three or more times with inconsistent results, when multiple people need identical output from the same process, when requirements are stable, and the output format is defined, or when you want the instructions versioned in Git and reviewable by the team.

The decision isn't about whether the task is "important enough" for a skill. It's about whether consistency and repeatability matter more than the overhead of maintaining a file.

Rules for Keeping AI Agent Skills Lean on Tokens

A few rules from managing a growing skill library:

Cap the core file at 200 lines. If it runs longer, it's doing two jobs. Two focused skills that fire independently are cheaper than one large skill that fires on both contexts.

Push reference material into separate files. Core workflow in SKILL.md. Scoring methodologies, vulnerability databases, and style guides go in references/, loaded on demand. This keeps the base token cost low for every invocation.

Add a complete output example. A full example in examples/ is worth hundreds of tokens of formatting instructions. The model reads the example and matches the structure, which means you can skip verbose output specifications in the main file.

Audit auto-loading quarterly. Check which skills load automatically and when. If a skill fires on conversations where it adds no value, narrow the description. Every unnecessary load is wasted context. If you're running many skills in Claude Code, understanding how auto-loading differs from Cursor's manual invocation will save you from the most common token waste patterns.

For the full build process, from writing a SKILL.md to testing and publishing, see How to Build AI Agent Skills That Work in Production. For the security implications of skill design choices, including how to prevent skills from leaking the sensitive data they process, see Prompt Injection and Supply Chain Risks.

At Procedure, our engineers build and deploy AI agent systems that run in production with MCP connections to external tools. If you're building skill-based agent workflows and need engineering help, talk to our team.

Token counts measured with tiktoken's cl100k_base encoding (cross-checked with o200k_base, within 1%). Per-line ratios across the 19 skills in the production repo ranged from 5.62 to 9.88, averaging 7.07 - too wide for a single rule of thumb to be reliable, so every figure above is a measured value rather than an estimate. Session measurements are approximate and vary by model and context window management.

If you found this post valuable, I’d love to hear your thoughts. Let’s connect and continue the conversation on LinkedIn.

Mangesh Bide

Guest Author

Mangesh Bide is a software engineer working on Django backends, AI agent tooling, and cloud infrastructure for healthcare. Day-to-day, that's migrations, audit trails, Terraform, and the bugs that only show up in production.