Session 8: AI Tools III

Code-native agents (Claude Code) at research scale

Prof. Ariel Ortiz-Bobea

Pre-Class Checklist

§ Tutorial: Pre-class checklist

Four things, all done before you walk in:

Claude Code installed and logged in. claude --version returns a version. claude opens without an auth prompt.
The repo ~/github/aem7010-ai exists, with a clean working tree. Clone an empty one if you missed the previous session.
R installed, with tidyverse, rvest, readr, ellmer. ellmer is recent and unlikely to be installed by default. install.packages("ellmer") once, before class.
Internet works. Two small files download today (the LLM cache and the fact-checker agent).

The first ten minutes are for stragglers. If any of the four failed, say so immediately.

We will use VS Code’s Source Control panel and the terminal interchangeably for Git. Whichever is closer to where you are working.

Quick Recap: Two Modes Down, One to Go

§ Tutorial: Where we are

The ladder so far:

Chat. No file access, no code execution. You paste, it answers.
Cowork. Same model, desktop interface, can read files and run code in a sandbox.
Claude Code (today). Same model, terminal-native, treats your Git project as the workspace.

Mode A vs Mode B, one more time:

Mode A: AI does the thing. Output is data. Reasoning hidden. → Almost never right for research.
Mode B: AI writes code that does the thing. Output is a script. Reasoning visible. → The reproducible path.

Each rung adds capability. Each rung makes verification more your job. Mode B is the rule that carries through all three.

What Changes When the Agent Reads Your Repo

§ Tutorial: What changes

Three shifts from Cowork to Claude Code.

The repository becomes the primary interface. Cowork sees a folder. Claude Code sees a Git project: tracked files, branch, commit history. Every change lands as a diff against the current commit.
The loop is faster and tighter. Write, run, read, write again, all without leaving the terminal. No tool-switching friction. The discipline of stopping to verify must come from you.
Actions compose. A single prompt can produce ten files, modify three, and run two scripts. The plan is shown step by step before execution. You can interrupt, reject any single edit, or rewind to a prior commit.

The diff is the only audit trail you should trust. The terminal transcript above the diff describes what the agent says it did. The diff tells you what it actually did. When they disagree, the diff wins.

Chat, Cowork, Claude Code: Side by Side

§ Tutorial: Chat, Cowork, and Claude Code, side by side

Dimension	Chat	Cowork	Claude Code
Where it lives	Web browser	Desktop app	Terminal, in the repo
Workspace unit	Copy-paste buffer	A folder you grant	A Git project
Default action	Text in the chat panel	Edits a file directly	Proposes a diff you accept
Audit trail	Git commit of pasted code	Git diff of folder	Git diff of accepted edits
Best at	One-off questions, prose	Exploratory file work	Multi-file project work
Review surface	The chat panel	Chat + file tree	Diff and commit history
Friction	Copy-paste fatigue	Permission dialogs	Per-edit accept or reject

Capability up at each step. Workspace from buffer → folder → project. Risk up at each step. Git turns risk from catastrophic to annoying.

What Claude Code Is Good At / Bad At

§ Tutorial: What Claude Code is good at · What Claude Code is bad at

Good:

Multi-file refactors (rename across files, extract helpers, restructure)
Project scaffolding at scale (today: five scrapers, a stacker, a classifier, an analysis script, an agent)
Tasks that need to run and read (run, see error, patch, run again)
Working from a written plan (numbered steps execute in order, one tool call per step)
Repository-aware questions (“where is placement parsed in this repo?”)

Bad:

Outputs you cannot sanity-check by eye (an estimate that looks right in code that runs, with units that are wrong)
One-off conversational questions (open a chat tab in your browser instead)
Sensitive data you have not isolated (Claude Code reads everything in the project unless you constrain it)
Long stretches without review (“auto-accept edits” exists; today is not the day to use it)

Subagents: Your Small Army of Specialists

§ Tutorial: Subagents and templates

Claude Code can delegate to subagents: specialized agents with a defined role, a constrained tool kit, and their own system prompt. Each lives as a markdown file in .claude/agents/<name>.md.

Mental image. The set of agents in .claude/agents/ is a small army of specialists. Each one has a single job, a constrained tool kit, a standing order in its system prompt. Some are producers (they do the work). Some are critics (they inspect what was produced). The army travels with you from project to project; the units you actually deploy on a given project, you tailor for that project.

Why this matters for research:

Specialization beats generalist prompts (one job, done well)
Composition makes verification tractable (producer → critic → your eyes)
Reusability turns one careful agent into infrastructure across projects

Anatomy of a Subagent

§ Tutorial: What a subagent is

---
name: fact-checker
description: Verifies every numeric claim in a markdown report against the data it cites.
tools: Read, Grep, Bash
---

You are a fact-checker for descriptive reports built from CSV data.

Your job: read a markdown report and verify that every numeric claim
in the prose appears in the report's tables, figure captions, or notes
section, OR can be recomputed from the data file the report cites.

For each claim: PASS, ROUNDED, or FAIL. End with a one-line summary.
Do not edit any file.

Four parts: name, description, tools (allowlist), and the body (system prompt). That is the entire anatomy. Agents are markdown files. They are committed with the repo.

How an LLM “Checks”: Tools, Not Memory

§ Tutorial: How does an LLM “check” anything? · A second example: hallucinated citations

The fact-checker is itself an LLM. So how does it verify?

Mechanism (same shape for numbers, citations, anything):

Read prose into context.
Reach for the data via Read / Bash / WebFetch (whatever ground truth lives where).
Recompute or requery the answer from the data, not from memory.
Compare the prose claim to the recomputed result. Report PASS / FAIL.

The LLM’s role: translation (English → query) and adjudication (does the prose match what came back?). Not memory.

Hallucinated citations are the same pattern, different domain. Mata v. Avianca (2023): lawyer sanctioned for a court brief containing six ChatGPT-generated citations to cases that did not exist. Fix: a citation-checker agent with Read, Bash, WebFetch tools. Query CrossRef per citation. Return VERIFIED / MISMATCH / HALLUCINATED.

General lesson. LLMs fail at problems where the answer lives in a database they have not seen. The fix is not a smarter LLM; it is a critic with tools that reaches the database.

Implicit vs. Explicit Agents

§ Tutorial: Are agents implicit in chat and Cowork?

Agents are not new today. They have been in chat and Cowork all along, just hidden.

Implicit (chat / Cowork)	Explicit (Claude Code)
Low setup, no agent literacy	You write or copy a spec
Provider updates centrally	Updates are your job
Role is invisible	Role is a markdown file you can read
Cannot customize for your project	Tailorable to project conventions
Not part of your repo	Versioned, committed, citable in methods
Behavior shifts between releases	Pinned to your committed file

Concrete example. A coauthor opens your repo two years from now to extend your placements study. If your classifier ran in a Cowork chat session: the prompt is gone, the model variant is undocumented, the labels are not reproducible. If it ran through code/classify_llm.R plus .claude/agents/classifier.md: agent file in the repo, prompt in the script, model pinned, labels reproducible. That is why explicit matters for research.

A Wider Catalog of Agent Types

§ Tutorial: A wider catalog of agent types

The pattern is not unique to applied economics. The same shape, different roles.

Domain	Examples
Applied economics	`data-validator`, `regression-reviewer`, `replication-tester`, `table-formatter`, `fact-checker`
Other research	`irb-reviewer` (social sci), `dataset-citer` (climate), `protocol-reviewer` (bio), `experiment-tracker` (ML), `figure-auditor`
Outside academia	`code-reviewer` (software), `test-writer`, `pipeline-monitor` (data eng), `contract-reviewer` (legal), `model-risk-reviewer` (finance)

The portable lesson is the shape, not the role. A markdown file in .claude/agents/, a tight system prompt, a constrained tool allowlist, a clear job. The army you assemble reflects the domain you work in.

Where Did “Agents” Come From, And Will They Stay?

§ Tutorial: Where did the idea come from, and is it here to stay?

The word “agent” in AI is not new. Minsky’s Society of Mind (1986). Actor-critic in RL (1980s-90s). Multi-agent systems as a research field for decades.

The specific LLM-agent pattern crystallized 2022-23:

ReAct (Yao et al., 2022): interleaving reasoning + tool calls beats either alone. Direct ancestor.
AutoGPT, BabyAGI (early 2023): popularized task decomposition.
LangChain, AutoGen, CrewAI: made the pattern accessible.
By 2024-25: “agent as markdown file with system prompt and tool allowlist” stabilized as a convention.

Why specialization helps: focused prompts get better attention; constrained tools prevent specific failures; producer + critic catches errors. Same logic as function decomposition in programming, applied to a new substrate.

Is it here to stay? The specific form (markdown files in .claude/agents/) may evolve. The underlying discipline is permanent: separating concerns, scoping responsibilities, building audit trails. Software engineering, not AI fad.

Constitutional AI: The Critic Pattern at Training Time

§ Tutorial: A note on Constitutional AI, and how the critic pattern differs across models

The producer-critic pattern you use today at inference time has a direct mirror in how Claude is trained.

Constitutional AI (CAI), Anthropic 2022. Replaces much of the human ranking step in alignment with AI feedback driven by an explicit constitution. The training loop: a helpful-but-uncritical model generates a response; a critic instance is shown the response plus a principle and asked to identify problems and revise. RLAIF (reinforcement learning from AI feedback) replaces RLHF’s human ranking step.

Differences across model families:

Family	Post-training	Critic in training?
Claude (Anthropic)	RLAIF + Constitutional AI	Yes, explicitly
GPT (OpenAI), Gemini (Google)	Mostly RLHF, some rule-based components	Implicit in labeler decisions
Open-weight (Llama, Mistral, Qwen)	SFT, RLHF, DPO, some CAI	Inconsistent across models

Why this matters today. When you ask Claude to play a fact-checker role, you activate a behavior shape it has practiced thousands of times during training. That does not make it immune to error; it does make role-following more natural for this family.

The portable claim. Whichever model you use as a critic, the discipline of separating roles holds. The specific reliability of each role on each model is something you check with your own examples.

Templates: Blueprints for Whole Armies

§ Tutorial: Templates: blueprints for whole armies

A subagent codifies what role to play. A template is the level above: a blueprint for a whole army plus conventions for a class of projects.

Three templates worth knowing:

scrape-placement-page (skill): how to handle these specific PhD pages.
paper-bootstrap (project template): scaffolds a new paper repo with folders, a starter CLAUDE.md, a Makefile, and a small army of stub agents already installed.
replication-package (template): turns a finished chapter into a reviewer-ready bundle.

A copied army is rarely the right army for your fight. Use the template as a starting line, not a finish line. Tailoring tip: open Cowork (last session’s tool), point it at your CLAUDE.md and your data, ask it to propose tighter wording for an inherited agent’s system prompt. Iterate, save back to .claude/agents/, commit. One mode (Cowork) maintaining the infrastructure of another (Claude Code).

Today’s Arc

§ Tutorial: Today’s arc

Setup, then author CLAUDE.md (~15 min). Launch claude, run /init, replace the auto-generated file with the project’s standing conventions. Commit.
Scale to five departments (~15 min). Short prompt. CLAUDE.md carries the context.
Classify with an LLM (~15 min). ellmer + cached responses. No API key needed today.
Descriptive write-up (~10 min). code/analysis.R + output/analysis.md. Verify the numbers.
Critic in action (~5 min). Live demo: fact-checker subagent reviews the report you just produced.
Debrief (~10 min). Three modes + the agent layer.

The point is the workflow: a Git project, a code-native agent reading CLAUDE.md for standing context, small scripts each verifiable in isolation, a write-up where every number traces to code, one critic agent that reviews the result before you do.

Two Prompt Styles, Both Today

§ Tutorial: Two prompt styles

Each task block today shows two ways to drive Claude Code:

Short prompt. Two or three sentences leaning on CLAUDE.md. The conventions live in one file; the per-task ask stays focused.
Cowork-style long prompt. All-in-one specification. No standing context required.

Both produce the same artifacts.

The short style is what you will use in your own research once CLAUDE.md is in place. The long style is the fallback for one-off tasks, for collaborators who do not have the standing context, or for the first time you build a project from scratch. Today, default to short. The Cowork-style alternative lives in a collapsed callout next to each short prompt.

Setup and First Run

§ Tutorial: Setup and first run

From a terminal at the repo root:

cd ~/github/aem7010-ai
git status      # nothing to commit, working tree clean
claude          # launch Claude Code

First sanity check inside Claude Code:

List the top-level files and folders in this repo. Do not edit anything.

Two slash commands worth knowing:

/init reads the repo and writes a starter CLAUDE.md. We will use it next.
/clear clears conversation history but keeps working directory and Git state.

Per-edit approval is the default. Today, choose yes per edit. The auto-accept mode (Shift+Tab) exists. Day one is not the day to use it.

`CLAUDE.md` as Standing Context

§ Tutorial: What CLAUDE.md is · Beyond project conventions: encoding your personal style

CLAUDE.md is a markdown file Claude Code reads automatically on every session in this project. It is the standing context for the repo:

What folders exist and what each is for
What schemas the data follows
Which packages are allowed
Selector strategies, classifier rules, the output/ versus paper/ boundary
The “house style” of the repo, including your personal style (paths via here::here(), naming conventions, plotting defaults, logging patterns)

Once CLAUDE.md is in place, every prompt you write becomes shorter. The rules do not need to be re-stated. The prompt is part of the artifact: versioned, diffable, reviewable like code.

Your personal style travels with you. Copy your personal-style block from project to project. Over a year it becomes a battle-tested document. The tutorial has a list of style rules worth encoding (paths, tidyverse vs base, naming, assignment, errors, returns, plotting, logging).

Author `CLAUDE.md` Together

§ Tutorial: Author CLAUDE.md together

Step 1. In Claude Code, run /init to scaffold a starter file based on the current repo state.

Step 2. Replace it with the project conventions. Two paths:

Download the seed (faster)
Paste via Claude Code

curl -L -o CLAUDE.md \
  https://raw.githubusercontent.com/arielortizbobea/aem7010/main/ai-tools/seeds/CLAUDE.md

Tell Claude Code: “Replace CLAUDE.md with the content I am about to paste:” then paste from the tutorial.

Step 3. Read the file. Roughly fifty lines, covering folder layout, the five departments, schemas, scraping rules, LLM classifier rules, the reports/papers boundary, and scope rules.

Step 4. Commit. Add project conventions for Claude Code. Sync.

⟶ Switch to the tutorial: The full CLAUDE.md for this project. Read it line by line.

Five Departments, One Tidy Panel

§ Tutorial: The data goal · The five departments

Goal CSV at data/placements_all.csv:

Column	Example
`dept`	dyson
`name`	Sharan Banerjee
`year`	2025
`placement`	Postdoctoral Fellow at KAPSARC, Riyadh
`source_url`	https://dyson.cornell.edu/programs/graduate/placements/

The five departments (in CLAUDE.md):

dept	Department
`dyson`	Cornell Dyson
`berkeley`	UC Berkeley ARE
`davis`	UC Davis ARE
`minnesota`	Minnesota Applied Economics
`wisconsin`	Wisconsin AAE

The Shared Scrape Prompt: Three Lines

§ Tutorial: The shared scrape prompt

With CLAUDE.md in place, the scrape prompt is short:

For each of the five departments listed in CLAUDE.md, write code/scrape_<dept>.R
following the scraping conventions in CLAUDE.md. Then write code/stack.R that
produces data/placements_all.csv per the schema in CLAUDE.md. Run all scripts
and report per-department row counts and the total.

What is not in the prompt: the URLs, the column names, the selector strategy, the package allowlist. All in CLAUDE.md. The standing context handles the rest.

⟶ Switch to the tutorial: The shared scrape prompt (and the collapsed Cowork-style alternative).

Paste, Wait, Verify, Commit (~15 min)

§ Tutorial: Paste and wait · Verification checklist (scrape)

Paste the prompt. Claude Code writes five code/scrape_<dept>.R files plus code/stack.R, runs them, reports row counts. Do not trust the row counts yet.

Six-step verification (all must pass before commit):

All six scripts exist at the expected paths?
Rscript code/stack.R runs cleanly from a fresh R session?
Each per-department CSV has at least 30 rows?
data/placements_all.csv has the right schema (5 columns, right order)?
Three random rows from three different departments match the live pages?
Read every diff. Anything you cannot explain → ask Claude Code.

If a check fails, ask for a patch on the one script that broke. Do not accept a full rewrite.

Commit and Push #1

§ Tutorial: Commit and push #1

VS Code Source Control or terminal. Either works.

Commit message: Five-department scrape and stack.

Refresh github.com/<your-handle>/aem7010-ai. Click data/placements_all.csv. Spot-check the table.

Two minutes of staging-and-pushing is part of the rhythm. Get it into muscle memory now; you will do it for the rest of your career.

Why Move From Rules to a Model

§ Tutorial: Why move from rules to a model

The previous session classified placements by keyword rules. Today the rules go away.

Free-text strings are messy:

“Assistant Professor at X University” → academic. Easy.
“Data Scientist at Uber” → industry. Easy.
“Research Economist at the Federal Reserve Bank of Kansas City” → ?
“Postdoctoral Fellow at NBER” → ?
“Senior Economist at the World Bank” → ?

Keyword rules flip a coin on the edge cases. A model reads the role and the institution together and decides.

The trade is real. Rules are transparent, cheap, deterministic. A model is opaque, costs money, drifts when the version changes. Worth it for tasks where the edge cases dominate.

Reproducibility Duties For LLM-in-the-Loop

§ Tutorial: Reproducibility duties when the classifier is a model

Five new responsibilities. Skip any one and the pipeline stops being a research artifact.

Pin the model. Specific version string, not “the latest Claude”.
Set sampling to deterministic. temperature = 0. Default temperature is non-zero; same prompt twice → different labels.
Cache the responses. One file with raw response, model, date, label. Commit it.
Store the raw response, not just the label. Label can be re-derived. The reverse is not true.
Document the model and the prompt. In CLAUDE.md and the script header.

Honest reproducibility profile: the cached labels are reproducible byte-for-byte. The live API behavior is almost reproducible (model pinning + temperature 0) but not perfectly: providers retire models, infrastructure shifts marginally. The cache is what survives all of that.

ellmer in 30 Seconds

§ Tutorial: A 30-second introduction to ellmer

ellmer is Posit’s R package for LLM APIs. One interface, multiple providers (Anthropic, OpenAI, Google, Groq).

library(ellmer)

chat <- chat_anthropic(
  model = "claude-haiku-4-5-20251001",
  params = params(temperature = 0)
)
chat$chat("Classify this placement as academic, government, industry, or other: ...")

Two patterns, both visible in the script:

Line 1 of the chat. Pin the model. Pin the temperature. This is your reproducibility contract.
Line 2 of the chat. Send a prompt, get a string back. Easy to wrap in purrr::map_chr() over a vector of placements.

In class today, no live API calls. The cache covers every row. No ANTHROPIC_API_KEY needed.

Cache, Short Prompt, Defensive Cache-Miss (~12 min)

§ Tutorial: Download the cache · The shared classifier prompt

Download the cache:

mkdir -p data/cache
curl -L -o data/cache/llm_responses.csv \
  https://raw.githubusercontent.com/arielortizbobea/aem7010/main/ai-tools/seeds/llm_responses.csv

Short prompt (paste into Claude Code):

Write code/classify_llm.R that produces data/placements_all_classified.csv
following the LLM classifier conventions in CLAUDE.md. Read placements from
data/placements_all.csv and the cache from data/cache/llm_responses.csv. The
script must NOT call the API when no ANTHROPIC_API_KEY is set: instead, label
any cache-miss row as "uncached" and warn with message(). Run the script and
report per-label counts.

The defensive clause matters. Even if the live page added rows since the pre-flight, the script does not blow up on a missing API key. It labels the new rows uncached, warns, and keeps going. The verification checklist catches anything uncached.

Classifier Verification Checklist

§ Tutorial: Verification checklist (classifier)

Five checks. All must pass before the second commit.

Rscript code/classify_llm.R runs cleanly from a fresh R session?
data/placements_all_classified.csv has the right schema (6 columns) and same row count as the input?
Labels well-formed? Expected values: academic, government, industry, other, possibly uncached. Anything else → parse failure. Any uncached rows → flag the cache as incomplete.
Hand-label twenty random rows and compare. Note disagreements. Edge cases are where the lesson lives.
Read the prompt template inside the script. It is the most important variable in the file. If you cannot defend the wording, change it.

If a check fails, ask Claude Code for a patch on the specific failure.

Commit and Push #2

§ Tutorial: Commit and push #2

Stage three files: code/classify_llm.R, data/placements_all_classified.csv, data/cache/llm_responses.csv.

Commit message: LLM classifier with cached responses.

Why the cache is committed. It is data the script needs to be reproducible. Without the cache, the script either falls back to uncached or (with a key set) calls the API. With the cache, anyone can run the pipeline end-to-end without an API key.

A Convention: `output/` vs `paper/`

§ Tutorial: A convention worth introducing

Folder	Purpose	Who writes the prose
`output/`	AI-drafted intermediate reports, memos, summaries	The agent. You verify the numbers.
`paper/`	Research papers, dissertation chapters	You. The agent helps with code, not prose.

The discipline lives at the paper boundary, not at every numeric document. Today’s write-up is output/. Your dissertation chapters are paper/.

The boundary is also encoded in CLAUDE.md. The agent reads it on every prompt. A short output/AGENTS.md and paper/AGENTS.md make the boundary explicit at the folder level too.

The Shared Analysis Prompt

§ Tutorial: The shared analysis prompt

Write code/analysis.R that produces a counts table (dept x class_llm) at
output/analysis_table.csv, a horizontal bar chart of academic-share by dept
with a mean reference line at output/figures/academic_share_by_dept.png
(8x5 in, 150 dpi), and a message() block summarizing total rows, year range,
per-dept counts, the table, and the model+date from the cache.

Then write output/analysis.md as an AI-drafted report (per CLAUDE.md) with
sections: ## Description (two paragraphs, every number sourced from the
message block), ## Counts by department and class, ## Academic share by
department, ## Notes. Run the script first; use its message output as the
source of truth for every number in the prose.

The agent drafts the prose because CLAUDE.md says output/ is for AI-drafted reports. Verification is on the numbers.

Verification: Numbers Trace, Not Words

§ Tutorial: Verification: every number in the prose traces to the script

Open output/analysis.md. For each number in the description paragraphs, find it in the table or in the notes block. Three things to watch for:

Rounded numbers that disagree with the table. “About 60 percent” vs 57.3 percent: fine. “62 percent” vs 57.3 percent: edit.
Comparison claims without supporting numbers. “Berkeley places more academically than Davis” → both shares must be in the table.
Pattern statements that overreach the data. “Academic share has declined over the decade” requires year-by-year evidence the figure may not show.

If you find an issue, edit the prose by hand. Do not re-run the agent for a wording fix. That is the kind of edit you make yourself.

Final Commit and Tag

§ Tutorial: Final commit and push

Stage four files: code/analysis.R, output/analysis_table.csv, output/figures/academic_share_by_dept.png, output/analysis.md.

Commit message: Descriptive analysis of multi-department placements.

To mark the analysis complete:

git tag -a v1.0-ai-tools -m "End of AI-tools module"
git push origin v1.0-ai-tools

Refresh github.com. Click output/analysis.md. The report renders with description, table, figure, and notes.

Critic in Action: Fact-Checker Demo

§ Tutorial: Critic in action: the fact-checker demo

The conceptual framing is in place from the agents block earlier. Here we deploy one specialist on the report we just produced.

Step 1. Download the agent:

mkdir -p .claude/agents
curl -L -o .claude/agents/fact-checker.md \
  https://raw.githubusercontent.com/arielortizbobea/aem7010/main/ai-tools/seeds/fact-checker.md

Step 2. In Claude Code:

Use the fact-checker agent to review output/analysis.md against
data/placements_all_classified.csv. Report the result. Do not edit anything.

The agent reads the report, extracts numeric claims, verifies each against the CSV, and reports PASS / ROUNDED / FAIL. Almost all should PASS. Anything that FAILs is the critic catching what the producer missed.

If the natural-language invocation does not spawn the subagent, use /agents to open the picker, select fact-checker, and provide the same instruction. The triggering UX is version-dependent; the file format is not.

Why we are not designing new agents today. Designing a new agent (system prompt iteration, tool allowlist tuning, eval against examples) is its own skill. Future session.

Then commit: git add .claude/agents/fact-checker.md && git commit -m "Add fact-checker subagent". The army travels with the repo.

Three Modes + Agent Layer, Lived

§ Tutorial: The three modes, lived

The module is one ladder. You have now climbed all three rungs and added a critic on top of the highest one.

Rung	Tool	What you watched	Where the discipline lived
Chat	one school, one script	You pasted code, it answered	The script you saved by hand
Cowork	one school, small project	The agent built scrape+pipeline	The git diff of the folder
Claude Code	five schools + LLM classifier + write-up + critic	The agent built and ran a project, edit by edit; a critic reviewed the result	Diff of accepted edits + commit history + critic’s report

The bottleneck moved at each rung. Chat made you type. Cowork made you watch. Claude Code makes you read diffs. The agent layer shifts part of the diff-reading to a second model.

Where Each Mode Fits + The One Rule

§ Tutorial: Where each mode fits in your research life

You will not use code-native agents for everything. You should not.

Chat: paragraphs, quick questions, decisions about approach.
Cowork: exploratory work over real files when you do not yet know the shape of the project.
Claude Code: when the task is multi-file, multi-step, and Git is going to track the work.
Critic agents: wherever the cost of a missed error is high.

The one rule that carries forward, across all three modes plus the agent layer:

Mode B. The artifact is code, including CLAUDE.md and the agent files. The audit trail is Git. The verification is yours, even when a critic agent does some of it for you.

The tools will keep getting more capable. The rule will keep getting more important.

After Class

§ Tutorial: After class

Confirm five new commits on github.com/<your-handle>/aem7010-ai: CLAUDE.md, scrape, classifier, analysis, fact-checker.
Pull the repo onto another machine. Run all three scripts from a clean checkout. If something breaks, the missing piece is in your dependencies. Today is a good day to find it.
Stretch: design your own agent. Pick one job in your research where you reliably catch (or miss) errors by hand. Write that role as .claude/agents/<name>.md with a tight system prompt and a minimal tool allowlist. Iterate against three or four examples until it works. Now you have one piece of permanent infrastructure that did not exist this morning.

The bottleneck is not typing. It is verification. The cache, the diff, the table, the prose, the critic’s report: all of them are objects you read.

Companion site: arielortizbobea.github.io/aem7010

Session 8: AI Tools III

Pre-Class Checklist

Quick Recap: Two Modes Down, One to Go

What Changes When the Agent Reads Your Repo

Chat, Cowork, Claude Code: Side by Side

What Claude Code Is Good At / Bad At

Subagents: Your Small Army of Specialists

Anatomy of a Subagent

How an LLM “Checks”: Tools, Not Memory

Implicit vs. Explicit Agents

A Wider Catalog of Agent Types

Where Did “Agents” Come From, And Will They Stay?

Constitutional AI: The Critic Pattern at Training Time

Templates: Blueprints for Whole Armies

Today’s Arc

Two Prompt Styles, Both Today

Setup and First Run

CLAUDE.md as Standing Context

Author CLAUDE.md Together

Five Departments, One Tidy Panel

The Shared Scrape Prompt: Three Lines

Paste, Wait, Verify, Commit (~15 min)

Commit and Push #1

Why Move From Rules to a Model

Reproducibility Duties For LLM-in-the-Loop

ellmer in 30 Seconds

Cache, Short Prompt, Defensive Cache-Miss (~12 min)

Classifier Verification Checklist

Commit and Push #2

A Convention: output/ vs paper/

The Shared Analysis Prompt

Verification: Numbers Trace, Not Words

Final Commit and Tag

Critic in Action: Fact-Checker Demo

Three Modes + Agent Layer, Lived

Where Each Mode Fits + The One Rule

After Class

`CLAUDE.md` as Standing Context

Author `CLAUDE.md` Together

A Convention: `output/` vs `paper/`