Code-native agents (Claude Code) at research scale
§ Tutorial: Pre-class checklist
Four things, all done before you walk in:
claude --version returns a version. claude opens without an auth prompt.~/github/aem7010-ai exists, with a clean working tree. Clone an empty one if you missed the previous session.tidyverse, rvest, readr, ellmer. ellmer is recent and unlikely to be installed by default. install.packages("ellmer") once, before class.The first ten minutes are for stragglers. If any of the four failed, say so immediately.
We will use VS Code’s Source Control panel and the terminal interchangeably for Git. Whichever is closer to where you are working.
§ Tutorial: Where we are
The ladder so far:
Mode A vs Mode B, one more time:
Each rung adds capability. Each rung makes verification more your job. Mode B is the rule that carries through all three.
§ Tutorial: What changes
Three shifts from Cowork to Claude Code.
The diff is the only audit trail you should trust. The terminal transcript above the diff describes what the agent says it did. The diff tells you what it actually did. When they disagree, the diff wins.
§ Tutorial: Chat, Cowork, and Claude Code, side by side
| Dimension | Chat | Cowork | Claude Code |
|---|---|---|---|
| Where it lives | Web browser | Desktop app | Terminal, in the repo |
| Workspace unit | Copy-paste buffer | A folder you grant | A Git project |
| Default action | Text in the chat panel | Edits a file directly | Proposes a diff you accept |
| Audit trail | Git commit of pasted code | Git diff of folder | Git diff of accepted edits |
| Best at | One-off questions, prose | Exploratory file work | Multi-file project work |
| Review surface | The chat panel | Chat + file tree | Diff and commit history |
| Friction | Copy-paste fatigue | Permission dialogs | Per-edit accept or reject |
Capability up at each step. Workspace from buffer → folder → project. Risk up at each step. Git turns risk from catastrophic to annoying.
§ Tutorial: What Claude Code is good at · What Claude Code is bad at
Good:
placement parsed in this repo?”)Bad:
§ Tutorial: Subagents and templates
Claude Code can delegate to subagents: specialized agents with a defined role, a constrained tool kit, and their own system prompt. Each lives as a markdown file in .claude/agents/<name>.md.
Mental image. The set of agents in .claude/agents/ is a small army of specialists. Each one has a single job, a constrained tool kit, a standing order in its system prompt. Some are producers (they do the work). Some are critics (they inspect what was produced). The army travels with you from project to project; the units you actually deploy on a given project, you tailor for that project.
Why this matters for research:
§ Tutorial: What a subagent is
---
name: fact-checker
description: Verifies every numeric claim in a markdown report against the data it cites.
tools: Read, Grep, Bash
---
You are a fact-checker for descriptive reports built from CSV data.
Your job: read a markdown report and verify that every numeric claim
in the prose appears in the report's tables, figure captions, or notes
section, OR can be recomputed from the data file the report cites.
For each claim: PASS, ROUNDED, or FAIL. End with a one-line summary.
Do not edit any file.Four parts: name, description, tools (allowlist), and the body (system prompt). That is the entire anatomy. Agents are markdown files. They are committed with the repo.
§ Tutorial: How does an LLM “check” anything? · A second example: hallucinated citations
The fact-checker is itself an LLM. So how does it verify?
Mechanism (same shape for numbers, citations, anything):
Read / Bash / WebFetch (whatever ground truth lives where).The LLM’s role: translation (English → query) and adjudication (does the prose match what came back?). Not memory.
Hallucinated citations are the same pattern, different domain. Mata v. Avianca (2023): lawyer sanctioned for a court brief containing six ChatGPT-generated citations to cases that did not exist. Fix: a citation-checker agent with Read, Bash, WebFetch tools. Query CrossRef per citation. Return VERIFIED / MISMATCH / HALLUCINATED.
General lesson. LLMs fail at problems where the answer lives in a database they have not seen. The fix is not a smarter LLM; it is a critic with tools that reaches the database.
§ Tutorial: Are agents implicit in chat and Cowork?
Agents are not new today. They have been in chat and Cowork all along, just hidden.
| Implicit (chat / Cowork) | Explicit (Claude Code) |
|---|---|
| Low setup, no agent literacy | You write or copy a spec |
| Provider updates centrally | Updates are your job |
| Role is invisible | Role is a markdown file you can read |
| Cannot customize for your project | Tailorable to project conventions |
| Not part of your repo | Versioned, committed, citable in methods |
| Behavior shifts between releases | Pinned to your committed file |
Concrete example. A coauthor opens your repo two years from now to extend your placements study. If your classifier ran in a Cowork chat session: the prompt is gone, the model variant is undocumented, the labels are not reproducible. If it ran through code/classify_llm.R plus .claude/agents/classifier.md: agent file in the repo, prompt in the script, model pinned, labels reproducible. That is why explicit matters for research.
§ Tutorial: A wider catalog of agent types
The pattern is not unique to applied economics. The same shape, different roles.
| Domain | Examples |
|---|---|
| Applied economics | data-validator, regression-reviewer, replication-tester, table-formatter, fact-checker |
| Other research | irb-reviewer (social sci), dataset-citer (climate), protocol-reviewer (bio), experiment-tracker (ML), figure-auditor |
| Outside academia | code-reviewer (software), test-writer, pipeline-monitor (data eng), contract-reviewer (legal), model-risk-reviewer (finance) |
The portable lesson is the shape, not the role. A markdown file in .claude/agents/, a tight system prompt, a constrained tool allowlist, a clear job. The army you assemble reflects the domain you work in.
§ Tutorial: Where did the idea come from, and is it here to stay?
The word “agent” in AI is not new. Minsky’s Society of Mind (1986). Actor-critic in RL (1980s-90s). Multi-agent systems as a research field for decades.
The specific LLM-agent pattern crystallized 2022-23:
Why specialization helps: focused prompts get better attention; constrained tools prevent specific failures; producer + critic catches errors. Same logic as function decomposition in programming, applied to a new substrate.
Is it here to stay? The specific form (markdown files in .claude/agents/) may evolve. The underlying discipline is permanent: separating concerns, scoping responsibilities, building audit trails. Software engineering, not AI fad.
§ Tutorial: A note on Constitutional AI, and how the critic pattern differs across models
The producer-critic pattern you use today at inference time has a direct mirror in how Claude is trained.
Constitutional AI (CAI), Anthropic 2022. Replaces much of the human ranking step in alignment with AI feedback driven by an explicit constitution. The training loop: a helpful-but-uncritical model generates a response; a critic instance is shown the response plus a principle and asked to identify problems and revise. RLAIF (reinforcement learning from AI feedback) replaces RLHF’s human ranking step.
Differences across model families:
| Family | Post-training | Critic in training? |
|---|---|---|
| Claude (Anthropic) | RLAIF + Constitutional AI | Yes, explicitly |
| GPT (OpenAI), Gemini (Google) | Mostly RLHF, some rule-based components | Implicit in labeler decisions |
| Open-weight (Llama, Mistral, Qwen) | SFT, RLHF, DPO, some CAI | Inconsistent across models |
Why this matters today. When you ask Claude to play a fact-checker role, you activate a behavior shape it has practiced thousands of times during training. That does not make it immune to error; it does make role-following more natural for this family.
The portable claim. Whichever model you use as a critic, the discipline of separating roles holds. The specific reliability of each role on each model is something you check with your own examples.
§ Tutorial: Templates: blueprints for whole armies
A subagent codifies what role to play. A template is the level above: a blueprint for a whole army plus conventions for a class of projects.
Three templates worth knowing:
scrape-placement-page (skill): how to handle these specific PhD pages.paper-bootstrap (project template): scaffolds a new paper repo with folders, a starter CLAUDE.md, a Makefile, and a small army of stub agents already installed.replication-package (template): turns a finished chapter into a reviewer-ready bundle.A copied army is rarely the right army for your fight. Use the template as a starting line, not a finish line. Tailoring tip: open Cowork (last session’s tool), point it at your CLAUDE.md and your data, ask it to propose tighter wording for an inherited agent’s system prompt. Iterate, save back to .claude/agents/, commit. One mode (Cowork) maintaining the infrastructure of another (Claude Code).
§ Tutorial: Today’s arc
CLAUDE.md (~15 min). Launch claude, run /init, replace the auto-generated file with the project’s standing conventions. Commit.CLAUDE.md carries the context.ellmer + cached responses. No API key needed today.code/analysis.R + output/analysis.md. Verify the numbers.fact-checker subagent reviews the report you just produced.The point is the workflow: a Git project, a code-native agent reading CLAUDE.md for standing context, small scripts each verifiable in isolation, a write-up where every number traces to code, one critic agent that reviews the result before you do.
§ Tutorial: Two prompt styles
Each task block today shows two ways to drive Claude Code:
CLAUDE.md. The conventions live in one file; the per-task ask stays focused.Both produce the same artifacts.
The short style is what you will use in your own research once CLAUDE.md is in place. The long style is the fallback for one-off tasks, for collaborators who do not have the standing context, or for the first time you build a project from scratch. Today, default to short. The Cowork-style alternative lives in a collapsed callout next to each short prompt.
§ Tutorial: Setup and first run
From a terminal at the repo root:
First sanity check inside Claude Code:
List the top-level files and folders in this repo. Do not edit anything.
Two slash commands worth knowing:
/init reads the repo and writes a starter CLAUDE.md. We will use it next./clear clears conversation history but keeps working directory and Git state.Per-edit approval is the default. Today, choose yes per edit. The auto-accept mode (Shift+Tab) exists. Day one is not the day to use it.
CLAUDE.md as Standing Context§ Tutorial: What CLAUDE.md is · Beyond project conventions: encoding your personal style
CLAUDE.md is a markdown file Claude Code reads automatically on every session in this project. It is the standing context for the repo:
output/ versus paper/ boundaryhere::here(), naming conventions, plotting defaults, logging patterns)Once CLAUDE.md is in place, every prompt you write becomes shorter. The rules do not need to be re-stated. The prompt is part of the artifact: versioned, diffable, reviewable like code.
Your personal style travels with you. Copy your personal-style block from project to project. Over a year it becomes a battle-tested document. The tutorial has a list of style rules worth encoding (paths, tidyverse vs base, naming, assignment, errors, returns, plotting, logging).
CLAUDE.md Together§ Tutorial: Author CLAUDE.md together
Step 1. In Claude Code, run /init to scaffold a starter file based on the current repo state.
Step 2. Replace it with the project conventions. Two paths:
Step 3. Read the file. Roughly fifty lines, covering folder layout, the five departments, schemas, scraping rules, LLM classifier rules, the reports/papers boundary, and scope rules.
Step 4. Commit. Add project conventions for Claude Code. Sync.
⟶ Switch to the tutorial: The full CLAUDE.md for this project. Read it line by line.
§ Tutorial: The data goal · The five departments
Goal CSV at data/placements_all.csv:
| Column | Example |
|---|---|
dept |
dyson |
name |
Sharan Banerjee |
year |
2025 |
placement |
Postdoctoral Fellow at KAPSARC, Riyadh |
source_url |
https://dyson.cornell.edu/programs/graduate/placements/ |
The five departments (in CLAUDE.md):
| dept | Department |
|---|---|
dyson |
Cornell Dyson |
berkeley |
UC Berkeley ARE |
davis |
UC Davis ARE |
minnesota |
Minnesota Applied Economics |
wisconsin |
Wisconsin AAE |
§ Tutorial: The shared scrape prompt
With CLAUDE.md in place, the scrape prompt is short:
For each of the five departments listed in CLAUDE.md, write code/scrape_<dept>.R
following the scraping conventions in CLAUDE.md. Then write code/stack.R that
produces data/placements_all.csv per the schema in CLAUDE.md. Run all scripts
and report per-department row counts and the total.
What is not in the prompt: the URLs, the column names, the selector strategy, the package allowlist. All in CLAUDE.md. The standing context handles the rest.
⟶ Switch to the tutorial: The shared scrape prompt (and the collapsed Cowork-style alternative).
§ Tutorial: Paste and wait · Verification checklist (scrape)
Paste the prompt. Claude Code writes five code/scrape_<dept>.R files plus code/stack.R, runs them, reports row counts. Do not trust the row counts yet.
Six-step verification (all must pass before commit):
Rscript code/stack.R runs cleanly from a fresh R session?data/placements_all.csv has the right schema (5 columns, right order)?If a check fails, ask for a patch on the one script that broke. Do not accept a full rewrite.
§ Tutorial: Commit and push #1
VS Code Source Control or terminal. Either works.
Commit message: Five-department scrape and stack.
Refresh github.com/<your-handle>/aem7010-ai. Click data/placements_all.csv. Spot-check the table.
Two minutes of staging-and-pushing is part of the rhythm. Get it into muscle memory now; you will do it for the rest of your career.
§ Tutorial: Why move from rules to a model
The previous session classified placements by keyword rules. Today the rules go away.
Free-text strings are messy:
Keyword rules flip a coin on the edge cases. A model reads the role and the institution together and decides.
The trade is real. Rules are transparent, cheap, deterministic. A model is opaque, costs money, drifts when the version changes. Worth it for tasks where the edge cases dominate.
§ Tutorial: Reproducibility duties when the classifier is a model
Five new responsibilities. Skip any one and the pipeline stops being a research artifact.
temperature = 0. Default temperature is non-zero; same prompt twice → different labels.CLAUDE.md and the script header.Honest reproducibility profile: the cached labels are reproducible byte-for-byte. The live API behavior is almost reproducible (model pinning + temperature 0) but not perfectly: providers retire models, infrastructure shifts marginally. The cache is what survives all of that.
§ Tutorial: A 30-second introduction to ellmer
ellmer is Posit’s R package for LLM APIs. One interface, multiple providers (Anthropic, OpenAI, Google, Groq).
Two patterns, both visible in the script:
purrr::map_chr() over a vector of placements.In class today, no live API calls. The cache covers every row. No ANTHROPIC_API_KEY needed.
§ Tutorial: Download the cache · The shared classifier prompt
Download the cache:
Short prompt (paste into Claude Code):
Write code/classify_llm.R that produces data/placements_all_classified.csv
following the LLM classifier conventions in CLAUDE.md. Read placements from
data/placements_all.csv and the cache from data/cache/llm_responses.csv. The
script must NOT call the API when no ANTHROPIC_API_KEY is set: instead, label
any cache-miss row as "uncached" and warn with message(). Run the script and
report per-label counts.
The defensive clause matters. Even if the live page added rows since the pre-flight, the script does not blow up on a missing API key. It labels the new rows uncached, warns, and keeps going. The verification checklist catches anything uncached.
§ Tutorial: Verification checklist (classifier)
Five checks. All must pass before the second commit.
Rscript code/classify_llm.R runs cleanly from a fresh R session?data/placements_all_classified.csv has the right schema (6 columns) and same row count as the input?academic, government, industry, other, possibly uncached. Anything else → parse failure. Any uncached rows → flag the cache as incomplete.If a check fails, ask Claude Code for a patch on the specific failure.
§ Tutorial: Commit and push #2
Stage three files: code/classify_llm.R, data/placements_all_classified.csv, data/cache/llm_responses.csv.
Commit message: LLM classifier with cached responses.
Why the cache is committed. It is data the script needs to be reproducible. Without the cache, the script either falls back to uncached or (with a key set) calls the API. With the cache, anyone can run the pipeline end-to-end without an API key.
output/ vs paper/§ Tutorial: A convention worth introducing
| Folder | Purpose | Who writes the prose |
|---|---|---|
output/ |
AI-drafted intermediate reports, memos, summaries | The agent. You verify the numbers. |
paper/ |
Research papers, dissertation chapters | You. The agent helps with code, not prose. |
The discipline lives at the paper boundary, not at every numeric document. Today’s write-up is output/. Your dissertation chapters are paper/.
The boundary is also encoded in CLAUDE.md. The agent reads it on every prompt. A short output/AGENTS.md and paper/AGENTS.md make the boundary explicit at the folder level too.
§ Tutorial: The shared analysis prompt
Write code/analysis.R that produces a counts table (dept x class_llm) at
output/analysis_table.csv, a horizontal bar chart of academic-share by dept
with a mean reference line at output/figures/academic_share_by_dept.png
(8x5 in, 150 dpi), and a message() block summarizing total rows, year range,
per-dept counts, the table, and the model+date from the cache.
Then write output/analysis.md as an AI-drafted report (per CLAUDE.md) with
sections: ## Description (two paragraphs, every number sourced from the
message block), ## Counts by department and class, ## Academic share by
department, ## Notes. Run the script first; use its message output as the
source of truth for every number in the prose.
The agent drafts the prose because CLAUDE.md says output/ is for AI-drafted reports. Verification is on the numbers.
§ Tutorial: Verification: every number in the prose traces to the script
Open output/analysis.md. For each number in the description paragraphs, find it in the table or in the notes block. Three things to watch for:
If you find an issue, edit the prose by hand. Do not re-run the agent for a wording fix. That is the kind of edit you make yourself.
§ Tutorial: Final commit and push
Stage four files: code/analysis.R, output/analysis_table.csv, output/figures/academic_share_by_dept.png, output/analysis.md.
Commit message: Descriptive analysis of multi-department placements.
To mark the analysis complete:
Refresh github.com. Click output/analysis.md. The report renders with description, table, figure, and notes.
§ Tutorial: Critic in action: the fact-checker demo
The conceptual framing is in place from the agents block earlier. Here we deploy one specialist on the report we just produced.
Step 1. Download the agent:
Step 2. In Claude Code:
Use the fact-checker agent to review output/analysis.md against
data/placements_all_classified.csv. Report the result. Do not edit anything.
The agent reads the report, extracts numeric claims, verifies each against the CSV, and reports PASS / ROUNDED / FAIL. Almost all should PASS. Anything that FAILs is the critic catching what the producer missed.
If the natural-language invocation does not spawn the subagent, use /agents to open the picker, select fact-checker, and provide the same instruction. The triggering UX is version-dependent; the file format is not.
Why we are not designing new agents today. Designing a new agent (system prompt iteration, tool allowlist tuning, eval against examples) is its own skill. Future session.
Then commit: git add .claude/agents/fact-checker.md && git commit -m "Add fact-checker subagent". The army travels with the repo.
§ Tutorial: The three modes, lived
The module is one ladder. You have now climbed all three rungs and added a critic on top of the highest one.
| Rung | Tool | What you watched | Where the discipline lived |
|---|---|---|---|
| Chat | one school, one script | You pasted code, it answered | The script you saved by hand |
| Cowork | one school, small project | The agent built scrape+pipeline | The git diff of the folder |
| Claude Code | five schools + LLM classifier + write-up + critic | The agent built and ran a project, edit by edit; a critic reviewed the result | Diff of accepted edits + commit history + critic’s report |
The bottleneck moved at each rung. Chat made you type. Cowork made you watch. Claude Code makes you read diffs. The agent layer shifts part of the diff-reading to a second model.
§ Tutorial: Where each mode fits in your research life
You will not use code-native agents for everything. You should not.
The one rule that carries forward, across all three modes plus the agent layer:
Mode B. The artifact is code, including CLAUDE.md and the agent files. The audit trail is Git. The verification is yours, even when a critic agent does some of it for you.
The tools will keep getting more capable. The rule will keep getting more important.
§ Tutorial: After class
github.com/<your-handle>/aem7010-ai: CLAUDE.md, scrape, classifier, analysis, fact-checker..claude/agents/<name>.md with a tight system prompt and a minimal tool allowlist. Iterate against three or four examples until it works. Now you have one piece of permanent infrastructure that did not exist this morning.The bottleneck is not typing. It is verification. The cache, the diff, the table, the prose, the critic’s report: all of them are objects you read.
Companion site: arielortizbobea.github.io/aem7010