Session 8: AI Tools III

Code-native agents (Claude Code) at research scale

Slides for this session: View the slide deck (opens in your browser; press F for fullscreen). The slides are a lean anchor to the concepts below. The walkthrough on this page is the substantive material and the reference you will come back to.

Want a PDF for note-taking? Open the slides in your browser, append ?print-pdf to the URL, and use File → Print → Save as PDF. Reveal.js handles the layout. Works in Chrome, Edge, and Firefox.

Pre-class checklist. Four things, all done before you walk in.

  1. Claude Code installed and logged in. From a terminal, claude --version returns a version string and claude opens an interactive prompt without asking you to log in. The browser login completes once and persists.
  2. The repo ~/github/aem7010-ai exists, with a clean working tree. If you completed the previous session, you already have it. If you did not, clone an empty aem7010-ai repo from GitHub now (see the previous session’s tutorial for the steps; we will not redo them in class).
  3. R installed, with the four packages we will use. From R or RStudio, run install.packages(c("tidyverse", "rvest", "readr", "ellmer")) once before class. ellmer is recent and unlikely to be installed by default; the other three may already be there from previous sessions. The class is not the place to discover that a library() call fails because the package is missing.
  4. Internet works on your laptop. Two small files get downloaded during class (the LLM cache and the fact-checker agent). Confirm curl https://www.google.com returns HTML, or just open a browser tab.

The first ten minutes of class are reserved for stragglers. If any of the four above failed, say so immediately.

Where we are in the course

Today closes the AI-tools module. The previous two sessions established a ladder. Chat is a probabilistic pattern-completer with no file access and no code execution. Cowork is the same kind of model wrapped in a desktop interface that can read your files and run code in a sandbox. Claude Code is the third rung: an agent that lives in your terminal, treats your repository as the primary interface, and edits and runs code by writing diffs that you accept or reject.

Two things change today, and one stays the same. The scope grows from one department to five. The classifier moves from a keyword rule to a language model called from inside a script. The discipline that keeps you safe stays the same. Mode B, plus Git, plus pushing to GitHub. Today you will exercise that discipline at a scale that makes it count.

This session does not assume you completed the previous one. If your aem7010-ai repo is mostly empty, the prompts below will produce everything Session 8 needs. If your repo already holds a Cowork-drafted Dyson scraper and pipeline, the new files Claude Code writes today will land beside them in the same code/, data/, and output/ folders. The previous work does not have to be perfect for today to land cleanly.

NoteThese sessions assume Sessions 4 and 5

Every exercise starts with a clean Git working tree and ends with a git push. Without Git, a code-native agent is dangerous. Today we will use both VS Code’s Source Control panel and the terminal for Git, depending on which is closer to where you are working.

Recap and the third rung

Mode A vs Mode B, one more time

The single most important distinction from the previous two sessions carries straight into today.

Mode A: AI as runtime. You ask the AI to do the thing. The output is data. The reasoning lives inside the model.

Mode B: AI as code author. You ask the AI to write code that does the thing. The output is a script. The reasoning lives in the script, visible and rerunnable.

Mode B is the reproducible path. Mode A almost never belongs in a paper.

The temptation in Claude Code looks different from the temptation in Cowork. Cowork wanted to give you a table in the chat panel. Claude Code wants to give you a finished result you accept without reading. The diff view exists precisely so you can resist that temptation. Every prompt today is written so the script is the artifact, the diff is the audit trail, and your eyes are the verifier.

What changes when the agent reads your repo

Three capabilities shift between Cowork and Claude Code.

First, the repository becomes the primary interface. Cowork sees a folder. Claude Code sees a Git project. It can read your branch, your commits, your tracked and untracked files, and your .gitignore. When it proposes a change, the change is presented as a diff against your current commit, not as a free-floating file edit. That diff is the only object you should trust.

Second, the loop is faster and tighter. Claude Code writes, runs, reads the result, and writes again, all without leaving the terminal session. Cowork could do this too, but its UI is conversational. Claude Code’s UI is the file tree and the diff. The friction of switching between tools disappears, which means the discipline of stopping to verify must come from you.

Third, the actions compose better. A single prompt to Claude Code can produce ten files, modify three, and run two scripts, in a few minutes. The model’s plan is shown step by step before it executes. You can interrupt, you can reject any single edit, and you can rewind to a prior commit. None of this matters if you do not look. All of it matters when you do.

Git as the safety net, even more so

Every Claude Code session in this course follows the same skeleton, with two small additions for the terminal-native workflow.

  1. Before you prompt: confirm the working tree is clean. git status shows “nothing to commit, working tree clean”.
  2. Prompt Claude Code. Watch each tool call appear in the terminal. Read each diff before accepting.
  3. After Claude Code finishes: review the full set of changes. In VS Code’s Source Control panel, click each changed file to see its diff. Or use git diff from the terminal.
  4. Stage, commit, push. Either VS Code’s three clicks (the +, the checkmark, the cloud-with-arrow) or the three-line terminal pattern (git add, git commit, git push). Use whichever is closer to where you are.

The point is the same one as the previous session, with one extra emphasis. The diff view is the only reliable record of what the agent did. The terminal transcript above the diff describes what Claude Code says it did. The diff tells you what it actually did. When the two disagree, the diff wins.

WarningNever point Claude Code at a dirty working tree

Same rule as before. If git status is not clean, commit or stash first. Otherwise you cannot tell which changes came from you and which came from the agent. Today this matters more than last week, because the agent can edit ten files in one prompt.

Claude Code as a category

The category, not the brand

Claude Code is a current example of code-native agentic AI: a chat interface attached to a model that runs inside your terminal, treats a Git project as its workspace, and presents every file change as a diff you accept or reject. The brand names will rotate. At the time of writing the category includes Claude Code, Cursor’s CLI, Aider, and a handful of open-source agents in the same shape. In two years the list will look different. The category will not.

What you should remember is the category profile, because that is what transfers.

Chat, Cowork, and Claude Code, side by side

The clearest way to place Claude Code is against the two tools you have already used. Same underlying class of model in all three. Different surface, different unit of work, different review pattern.

Dimension Chat (two sessions ago) Cowork (previous session) Claude Code (today)
Where it lives Web browser Desktop app Terminal, run from inside the repo
Workspace unit None; a copy-paste buffer A folder you grant A Git project
Default action Returns text in the chat panel Edits a file directly Proposes a diff you accept
Audit trail Your Git commit of pasted code Git diff of touched folder Git diff of explicit accepted edits
Best at One-off questions, prose drafts Exploratory work with files Multi-file, multi-step project work
Review surface The chat panel The chat panel + file tree The diff and the commit history
Loop speed Slow (manual context transfer) Fast Faster, less context-switching
Friction Copy-paste fatigue Permission dialogs Per-edit accept or reject

Read the table left to right. Capability goes up at each step. The unit of work moves from a copy-paste buffer to a folder to a Git project. The review object moves from “what code did I paste back” to “what did the agent change in my folder” to “do I accept this specific diff before it lands”. Risk goes up at each step because the agent can act on more of your machine at once. Git, again, is what turns that risk from catastrophic to annoying.

What Claude Code is good at

Claude Code is strong in five areas.

  • Multi-file refactors. Splitting a script into helpers, renaming a variable across files, lifting shared logic into one place. Tasks that are tedious by hand and that you can verify with git diff afterwards.
  • Project scaffolding at scale. Five scrapers, one stacker, one classifier, one analysis script, one README, in a few minutes. Today’s session leans heavily on this.
  • Tasks that require running and reading. Run the script, read the error, patch the script, run again. Claude Code does this loop without you typing the run command yourself.
  • Working from a written plan. A prompt that lists six numbered steps will be executed in order, with one tool call per step. This is the most controllable mode of any of the three tools we have used.
  • Repository-aware questions. “Where in this repo is the column placement parsed?” or “What does code/stack.R actually do?” Claude Code can read the files, find the answer, and quote the lines.

What Claude Code is bad at

Claude Code is a poor fit for four kinds of work.

  • Outputs you cannot sanity-check by eye. Same rule as before, with extra force. A code-native agent will happily produce an estimate that looks right, in a script that runs, with units that are wrong. You have to read the code.
  • One-off conversational questions. “What is a good ggplot theme for academic figures?” Open the chat in your browser. Claude Code is the wrong tool for a question that does not touch your repo.
  • Sensitive data you have not isolated. Claude Code reads everything in the project unless you tell it otherwise. If your repo contains a private dataset under embargo, do not point Claude Code at the project root without first reviewing what the agent will see.
  • Long stretches without review. The “auto-accept edits” mode exists, and it is useful for a trusted refactor. It is not appropriate for the first week of using the tool. Today, accept each edit explicitly.

Examples: when to reach for which

The good-at and bad-at lists are general. The harder skill is recognizing which list applies to the specific task in front of you. Three short examples drawn from the kind of work you actually do.

Chat

Example 1: Choose between two estimators in a paragraph

You are deciding whether to use OLS with clustered standard errors or a multilevel model in your placement-data analysis. You want a paragraph that lays out the trade-offs.

Why chat? No data, no code, no repo. The task is reasoning over text. Claude Code’s repo access adds nothing. You also do not want this paragraph to land in your manuscript directly: the chat output is one input to a paragraph you will write yourself, not the paragraph itself. There is no script to commit and no Mode B discipline to apply, because there is no artifact.

Cowork

Example 2: Eyeball thirty CSVs from a coauthor

Same example as last session, deliberately. A coauthor sent thirty USDA county-year CSVs with column names that drift across years. You want to see which columns exist in which year before deciding how to clean them.

Why Cowork? This is exploratory inspection, not project building. Cowork’s chat-plus-folder surface is the right shape: list the folder, peek at three files, propose a mapping, run it once, report. The committed artifact at the end is one cleaning script. Claude Code would do this fine too, but the project framing is overkill for an inspection task.

Claude Code

Example 3: Build the multi-school placements study

You want to scrape five PhD placement pages, stack them, classify each row with an LLM, and produce a write-up. The work spans many files and several scripts, and depends on Git for safe iteration.

Why Claude Code? This is a project. Five scrapers in code/, one classifier, one analysis script, one figure, one markdown report. The work happens across many files, the output of one step is the input to the next, and the reproducible artifact is the entire repo. Claude Code is the only one of the three tools that treats the repo as the unit of work. Today’s session is exactly this task.

Subagents and templates

Most of today’s hands-on work runs through one main Claude Code session. The agent reads files, writes scripts, runs them, and proposes diffs. But Claude Code can also delegate to subagents, and that capability is worth front-loading before the work starts.

A mental image worth carrying. The set of agents in your repo’s .claude/agents/ folder is a small army of specialists. Each one has a single job, a constrained tool kit, and a standing order in its system prompt. Some are producers; they do the work. Some are critics; they inspect what was produced. The army travels with you from project to project, but the units you actually deploy on any given project you tailor for that project. We will use one specific specialist at the end of class, and the framing here is what makes the demo land.

What a subagent is

A subagent is a specialized agent invoked by the main Claude Code session with a defined role, a constrained tool allowlist, and its own system prompt. It lives as a single markdown file in .claude/agents/<name>.md. The frontmatter declares the agent’s name, description, allowed tools, and optionally a model. The body is the prompt that describes what this agent is for and how it should behave.

A simple example, the kind of file we will use at the end of class:

---
name: fact-checker
description: Verifies every numeric claim in a markdown report against the data it cites.
tools: Read, Grep, Bash
---

You are a fact-checker for descriptive reports built from CSV data.

Your job: read a markdown report and verify that every numeric claim in the prose
appears in the report's tables, figure captions, or notes section, OR can be
recomputed from the data file the report cites.

For each claim you check, return one line: PASS, ROUNDED (the prose rounds a
number from the table; report what was rounded and to what), or FAIL (the prose
contains a number that does not appear in or follow from the data; show the
prose claim and what the data actually says).

End with a one-line summary of how many claims you checked and how many failed.
Do not edit any file. Reading and reporting only.

The agent has a name (fact-checker), a description that tells the main agent when to invoke it, a tool allowlist (read-only here), and a system prompt that describes its job. That is the entire anatomy. Agents are markdown files. They are committed with the repo.

How does an LLM “check” anything?

A reasonable question to pause on. The fact-checker is itself a language model. Language models generate tokens. So how can one verify anything?

The answer is that subagents are not just text generators. They are text generators with tools. The tools: Read, Grep, Bash line in the frontmatter is doing the heavy lifting. When you ask the fact-checker to verify a numeric claim, the agent does not look the answer up in its training data, and it does not “remember” the contents of your data file. It does this:

  1. Read the prose into context. The Read tool returns the bytes of output/analysis.md. The text of every numeric claim now sits in the agent’s context window, verbatim.
  2. Reach for the data. Either Read the CSV into context, or, for files too large to fit, run a Bash command like head or wc to get a usable slice. Either way the data the agent will reason about comes from disk, not from training.
  3. Recompute each claim. For each numeric claim from step 1, the agent generates a small Bash command that recomputes the same number from the data file. For example: awk -F, '$1=="dyson" && $6=="academic" {n++} END {print n}' data/placements_all_classified.csv. The shell runs the command. The shell returns a real integer.
  4. Compare and report. The prose number (from step 1) and the recomputed number (from step 3) are now both in the agent’s context, side by side. The agent reports PASS, ROUNDED, or FAIL by comparing the two values.

The LLM’s role is translation and adjudication: turn an English claim into a query, run the query, compare the result against the prose. The actual numbers come from the file, not from the model. The model never has to “remember” that Berkeley placed fourteen students; it recomputes that number from the CSV every time, and only then compares.

This is why the tool allowlist matters. Without Read and Bash, a fact-checker would be a language model hallucinating about numbers it has never seen. With them, the agent has a route to ground truth that does not pass through training data. Retrieval-and-recompute is the mechanism. Pure generation is not.

Where this still fails, and what to watch for.

  • Misreading the prose. The agent extracts a claim it did not see correctly. Catch by reading the agent’s list of extracted claims; if it looks wrong, the verification of that claim is not meaningful.
  • A wrong query. The agent generates a Bash or Grep command that does not actually correspond to the prose claim (a wrong filter, a wrong column). The output is a real number, but it is the wrong real number. Catch by skimming the tool-call trace: each PASS or FAIL should be accompanied by a visible command whose logic you can follow.
  • A claimed result without a real query. Rare but possible. The agent reports a verification outcome without having actually run the tool. Catch by checking the tool-call trace: every numeric claim should have a corresponding tool invocation. If a number appears in the report without a query behind it, treat that line as unverified.

The discipline carries through, in the same shape as everywhere else in this course. The artifact (the agent’s report) is reviewable. The audit trail (the tool calls) is visible. The verification of the verifier is yours.

A second example: hallucinated citations and the citation-checker

Numbers are not the only thing language models hallucinate. Citations are arguably the most notorious case for academic readers, and the fix follows the same pattern as the fact-checker. Worth walking through, because the failure mode is one every PhD student writing with AI assistance will encounter.

A language model asked to support a claim with a citation will routinely produce a reference that looks real. Plausible authors. A plausible journal. A reasonable year. A DOI that follows the right format. None of it exists. The model has seen the shape of citations across its training data; it produces something with that shape; and the shape is convincing even when the content is invented.

This is not hypothetical. In Mata v. Avianca (2023), a US lawyer filed a brief in federal court containing six citations to court cases that did not exist; ChatGPT had generated them. The lawyer was sanctioned. Editors of academic journals report similar issues with rejected submissions, and several PhD-program guidelines have started naming citation hallucination explicitly. If you write with AI assistance, this is a risk you actively manage.

The mechanism behind the failure is the same one we just discussed for numbers. The model has seen many citations. It has not seen a database of which citations are real. Asked to produce one, it generates structure-conforming text. Asked from memory whether a citation is real, it generates a confident yes. There is no ground truth in the model’s weights for this question.

The fix is structurally identical to the numeric fact-checker. Build a critic with tools that route around the model’s memory. The agent’s job is not to remember whether a paper exists. The agent’s job is to:

  1. Read the draft and extract every citation.
  2. For each citation, query an actual citation database (CrossRef, OpenAlex, PubMed for biomedical work). The query is constructed from the citation’s metadata: title plus first author last name, or DOI when present.
  3. Compare what the database returns to what the draft cites.
  4. Report VERIFIED, MISMATCH (the paper exists but the cited metadata is wrong, for example a year that disagrees), or HALLUCINATED (no plausible match in the database).

A minimal version of this agent’s frontmatter:

---
name: citation-checker
description: Verifies every citation in a markdown or LaTeX draft against CrossRef.
tools: Read, Bash, WebFetch
---

You are a citation checker for academic drafts.

Your job: read a draft and verify each citation against CrossRef (or
OpenAlex, or another canonical citation database accessible from the
shell).

Workflow:
1. Read the draft. Extract every citation, including in-text citations
   and references.
2. For each citation, query CrossRef using its API. The query is the
   title plus first author last name, or the DOI when one is present.
   Compare returned metadata to the citation in the draft.
3. Return one line per citation: VERIFIED, MISMATCH (the paper exists
   but the citation is wrong somewhere; show the disagreement), or
   HALLUCINATED (no plausible match).

Constraints. You are read-only. Do not edit the draft.

The shape is the fact-checker again, with WebFetch added so the agent can hit a public API. The LLM is still the translator and adjudicator: extract the claim, generate a query, compare what comes back. The ground truth lives in CrossRef, not in the model.

This is the general lesson that hallucinated references make especially vivid. The class of problems language models fail at by themselves is the class of problems where the answer is in a database the model has not seen. The fix is not to make the model bigger, smarter, or more careful in its instructions. The fix is to give the model a tool that reaches the database, and to constrain the agent’s job to translation and comparison. Numbers in a CSV, citations in CrossRef, DOIs in DataCite, gene names in NCBI, drug interactions in DrugBank: same shape every time. Different domain, same critic-with-tools pattern.

TipIn your own writing, before any submission

If you used AI to draft prose anywhere in a manuscript, run a citation-checker before you submit. The cost is a few minutes per draft. The cost of submitting a paper with a hallucinated reference can be a desk rejection, a retraction, or worse. The asymmetry favors checking.

Why agents matter for research

Three reasons, each tied directly to the verification reflex that has been the spine of this course.

First, specialization beats generalist prompts. A fact-checker agent that does only one job does that job better than the same model doing it as task seven of nine. The constrained system prompt cuts noise. The tool allowlist prevents drift.

Second, composition makes verification tractable. A producer agent writes the analysis. A critic agent reviews it. Your eyes confirm. Three layers of checking that previously lived only in your eyes. The bottleneck moves, but the bottleneck always moves; what matters is that the verification work gets done at all.

Third, reusability turns one careful agent into infrastructure. A code-reviewer written for this course’s style works on any of your future projects with minimal edits. The agent file is portable. Your future self, on a different project, drops the same fact-checker.md into .claude/agents/ and gets the same critic.

Templates: blueprints for whole armies

A subagent codifies what role to play. A template is the level above: a blueprint for assembling a whole army of agents and conventions for a class of projects. The course has been showing you templates implicitly all along, in the functional folder layout, the verification checklists, and the CLAUDE.md you will write today. A formal template lives in a place the agent reads it: a .claude/skills/<name>/SKILL.md file, or an entire project-template repo on GitHub.

By way of example, three templates worth knowing. A scrape-placement-page skill that captures how to handle these specific PhD pages, including selector strategies, common quirks, and the verification checks. A paper-bootstrap template that scaffolds a new research-paper repo with the right folders, a starter CLAUDE.md, a Makefile, a .gitignore, and a small army of stub agents. A replication-package template that turns a finished chapter into a reviewer-ready bundle.

The way templates work in practice. You start a new paper. You clone the paper-bootstrap template. The first commit of the new project is not from scratch but from a tested baseline: the layout is already right, three or four agents are already in .claude/agents/, and a starter CLAUDE.md is in place.

The catch is the army metaphor again. A copied army is rarely the right army for your fight. The agents that came in the template are general. Your project has specifics: a particular data shape, a particular verification rule, a methodological convention you want enforced. The work that makes the template useful rather than ceremonial is tailoring the inherited agents to your project.

A practical workflow for tailoring. Open the inherited agent’s markdown file in VS Code. Read it. Then open Cowork (the agentic desktop tool from the previous session), point it at your repo, and ask it to read your CLAUDE.md and a sample of your data and propose tighter wording for the agent’s system prompt and a tighter tool allowlist. Iterate against two or three real examples in the chat. Save the result back to .claude/agents/ and commit. This is one of the few places where Cowork’s chat surface beats Claude Code’s terminal: writing the prose that is an agent’s system prompt is conversational work, and Cowork’s loop is built for that. The output is still a committed markdown file. Mode B holds; the discipline does not change. You are using one mode (Cowork) to maintain the infrastructure of another (Claude Code).

Critics as a class

Among the agent types you might write, the critic family is the one most directly tied to research discipline. A critic does not produce the artifact. It reviews it. Its job is to catch errors before you do.

Three caveats worth keeping in mind. First, a critic from the same model as the producer shares blind spots. The same model can be wrong about the same things in the same direction. Diversification across providers (one model produces, a different model criticizes) is the eventual move; today, both layers are Claude. Second, a critic does not replace your eyes. It catches what it was told to catch. Errors outside the critic’s scope still fall to you. Third, a critic adds tokens and time. Run it where the cost of a missed error is high.

Are agents implicit in chat and Cowork?

Yes, and the difference between implicit and explicit is the lesson worth carrying out of class.

In chat, you have one continuous conversation with one general-purpose agent. There is no visible agent abstraction. The model adapts to the task by reading your prompt, but you cannot see, edit, or commit any specification of what role the model is playing.

In Cowork, the system goes a step further. It loads skills and plugins automatically based on the task. Ask Cowork to “make a slide deck” and it routes through a presentation-building specialization. Ask it to “review this PDF” and it loads PDF skills. The specialization exists; the user does not write the spec.

Claude Code’s subagents are the same idea made explicit. The spec lives as a markdown file in your repo. You write it (or copy and tailor someone else’s). It is versioned, diffable, and committed alongside the code.

The trade-offs separate cleanly:

Implicit (chat / Cowork) Explicit (Claude Code subagents)
Low setup, no agent literacy required Setup cost; you write or copy a spec
The provider can update behavior centrally Updates are your job and your responsibility
The role the model plays is invisible to you The role is a markdown file you can read
Cannot customize for your project Tailorable to project conventions
Not part of your repo Versioned, committed, citable in methods
Behavior can shift between releases without warning Pinned to your committed file

For day-to-day exploration, the implicit path is faster and lighter. For research that has to survive peer review, the explicit path is the only one that survives.

Why explicit, in one example

A coauthor opens your repo two years from now

They want to extend your placements study to a sixth department, and they want to reproduce the labels you generated for the original five.

If your classifier ran in a Cowork chat session. The transcript is gone. The exact prompt you used is gone. The model variant that was loaded behind the scenes is undocumented. Your coauthor can rerun a classifier, but only after reverse-engineering what method you actually used. The labels they get may not match the labels you committed.

If your classifier ran through code/classify_llm.R plus .claude/agents/classifier.md. The script is in the repo. The agent file is in the repo. The system prompt is in the agent file. The model is pinned in the script. Your coauthor opens the agent, reads it, runs the script, and gets the same labels.

That difference is the entire reason explicit agents matter for research. The implicit path is faster while you are working. The explicit path is faster two years later, when reproducing the work is the actual job.

A wider catalog of agent types

The pattern of capturing a specialized role in a markdown file is not unique to applied economics. The same shape is used across research and industry; only the role changes. Three brief catalogs to show how transferable the move is.

In applied economics and adjacent fields:

  • data-validator: checks a panel CSV against a schema (column types, ranges, missingness). After every scrape or merge.
  • regression-reviewer: reads a regression script and flags missing controls, ad-hoc clustering, or specifications without clear identification.
  • replication-tester: clones a repo, runs the Makefile or script chain, compares outputs to committed expected outputs.
  • table-formatter: takes raw R or Stata output and produces a publication-ready LaTeX table.
  • fact-checker: verifies numeric claims in prose against the data. Today’s demo.

In other research fields:

  • irb-reviewer (social sciences): reads a survey instrument and flags wording that may need IRB review or that risks introducing bias.
  • dataset-citer (climate, earth sciences): checks that every dataset cited in a draft has a valid DOI and a matching license.
  • protocol-reviewer (biology, biomedical): reviews a wet-lab protocol against safety and method standards before a run.
  • experiment-tracker (machine learning): logs hyperparameters and evaluation results across runs into a structured ledger.
  • figure-auditor (any quantitative field): checks figures for missing axis labels, illegible legends, or stale data.

Outside academia:

  • code-reviewer (software engineering): the canonical critic. Reads a diff before commit, flags style and correctness issues. Mature open-source examples exist for most languages.
  • test-writer (software): writes unit tests for new functions, often as a partner to the producer agent.
  • pipeline-monitor (data engineering): watches production data pipelines for schema drift or output anomalies, summarizes for the on-call.
  • contract-reviewer (legal): flags potentially problematic clauses in contract drafts, points the human at the sections that need attention.
  • model-risk-reviewer (finance): audits a quantitative model’s assumptions, stress tests, and documentation against internal model-risk standards.

The portable lesson is the shape, not the role. A markdown file in .claude/agents/, a tight system prompt, a constrained tool allowlist, and a clear job. What you encode reflects the domain you work in. The fact-checker we use at the end of class is one specific instance of a pattern you will reach for in every project that follows.

TipA note on where these lists came from

Most of the catalog above is composite: a mix of agents that exist in production at large companies, agents described in research-tooling threads, and agents I have seen colleagues build for their own labs. None of them is canonical. The point is the shape of the pattern. When you read about a new “AI agent for X” in your own field over the next year, look for the markdown-file underneath. If it is there, the lesson from today applies. If it is not, the tool is something else.

Where did the idea come from, and is it here to stay?

A short intellectual history, because students will hear the word “agent” used a lot of different ways over the next few years, and the term has a longer arc than the current LLM moment.

The word “agent” in AI is not new. Marvin Minsky’s Society of Mind in the 1980s framed intelligence as the emergent product of many small specialized agents. Multi-agent systems have been a research area in computer science for decades. In machine learning specifically, actor-critic methods (a producer policy paired with a critic that scores its outputs) are textbook material from the late 1980s and early 1990s.

The specific LLM-agent pattern we use today crystallized around 2022 and 2023. The ReAct paper (Yao and co-authors, 2022) showed that interleaving reasoning steps with tool calls produced better answers than either pure reasoning or pure tool use, and that paper is the most direct ancestor of how subagents work today. AutoGPT and BabyAGI in early 2023 popularized the idea that an LLM could decompose a task into subtasks and tackle each in sequence. Frameworks like LangChain, AutoGen, and CrewAI made the pattern accessible to ordinary developers. By 2024 and 2025, the convention of “specialized agent defined as a markdown file with a system prompt and a tool allowlist” had stabilized across several tools, including Claude Code.

NoteA note on Constitutional AI, and how the critic pattern differs across models

The producer-critic pattern you are using today at inference time has a closer ancestor than ReAct: it has a direct mirror in how Anthropic trains the Claude family in the first place. The approach is called Constitutional AI (Bai and co-authors at Anthropic, 2022), and it is worth a brief explanation because the differences matter when you reach for any model as a critic.

Constitutional AI in one paragraph. Most large models are post-trained with reinforcement learning from human feedback (RLHF): humans rank model outputs, a reward model is fitted to those rankings, and the base model is fine-tuned to produce outputs humans prefer. Constitutional AI replaces much of the human ranking step with AI feedback driven by an explicit set of principles, a “constitution”. The training loop has two key moves. First, a helpful-but-uncritical model generates a response; then a critic instance of the model is shown the response together with a principle from the constitution and asked to identify problems and rewrite the response. The revised data is used for supervised fine-tuning. Second, a preference model is trained on AI-generated comparisons (this is the RLAIF step: reinforcement learning from AI feedback). The result is a model that has been shaped by an explicit, inspectable set of principles and that has had a critic in its training loop from the start.

Why this matters for what you are doing. The producer-critic pattern is not just a clever inference-time trick. It is built into the training of the models we use as critics today. When you ask Claude to play a fact-checker role, you are activating a behavior shape the model has practiced thousands of times during training. That does not make the model immune to error. It does make the role-following more natural for this family of models than it might be for a model trained primarily with RLHF and no critic step.

Differences with other model families. The contrast is more about training methodology than about raw capability, and any specific claim about which model is “better at being a critic” has to be earned with empirical evaluation on a real task. That said, two patterns are worth knowing.

GPT and Gemini families are post-trained primarily with variants of RLHF, sometimes with constitutional or rule-based components added later. The principles guiding alignment are mostly implicit, encoded in the labelers’ decisions. These models are generally strong at the producer role; the critic role is reachable through prompting but is not a first-class part of training.

Open-weight models (Llama, Mistral, Qwen, others) are fine-tuned with various combinations of supervised fine-tuning, RLHF, DPO (direct preference optimization), and increasingly some constitutional-style methods. The diversity of approaches means the agent role-following can be inconsistent across these models, and a critic prompt that works on Claude may need careful adaptation on a different model. This is one reason why diversification across providers (one model produces, a different model criticizes) requires evaluation rather than blind trust.

The portable claim. The producer-critic separation we deploy in .claude/agents/ is the inference-time analog of a training-time pattern. The pattern works because criticism is a different cognitive task from production, and forcing the separation, whether at training time or at inference time, catches errors that a single pass would miss. Whichever model family you use, the discipline of separating roles holds; the specific reliability of each role on each model is something you check with your own examples.

Why specialization helps is partly empirical and partly engineering. Empirically, focused prompts get better attention than long generalist ones; constrained tool kits prevent specific failure modes (an agent that cannot write files cannot corrupt data); separating producer from critic catches errors the producer alone would miss. Engineering-wise, the same logic that makes function decomposition useful in programming makes agent decomposition useful in LLM work. Smaller, scoped units are easier to test, replace, and reason about. None of this is unique to language models. It is software engineering applied to a new substrate.

Will the abstraction stay? Two answers, both true. The specific form may evolve. Today’s subagent in .claude/agents/<name>.md is one implementation choice. Other tools use other abstractions: graph nodes, role classes, mixtures of experts at the model level. As models become more capable at following long instructions, some of the specialization that today requires separate agents may collapse into different prompts within a single agent context. The exact name “agent” may shift again before the decade is out. The underlying discipline is permanent. Separating concerns, scoping responsibilities, building audit trails, composing small pieces into larger systems: this is software engineering, not an AI fad. As long as you are building research code that needs to survive scrutiny, some version of “this thing did this work, here is its scope, here is its log” will be there. Whether it is called an agent, a tool, a skill, or a job, the shape transfers.

For your purposes in this course. Use the agent pattern when it earns the verification it adds. Do not use it where one prompt would do. The cost is real: more files, more tokens, more cognitive overhead. The benefit is real: auditable specialization, composition, reusability across projects. The tradeoff resolves the same way most engineering tradeoffs do, case by case, with a gentle preference for the more disciplined option as a project matures.

Today’s arc

With the framing in place, here is the arc for the next 75 minutes.

  1. Setup, then author CLAUDE.md. Launch claude, run /init, and replace the auto-generated CLAUDE.md with the project’s standing conventions: folder layout, the five-department list, schemas, scraping rules, LLM classifier rules, and the output/ versus paper/ boundary. Commit the file.
  2. Scale to five departments. Drive Claude Code with a short prompt to write code/scrape_<dept>.R for five departments and code/stack.R. The prompt is short because CLAUDE.md carries the context.
  3. Classify with an LLM. Drive Claude Code to write code/classify_llm.R using ellmer, with a cache layer. Hand-label twenty rows to verify.
  4. Descriptive write-up. Drive Claude Code to write code/analysis.R and output/analysis.md. Verify every number against the script’s output.
  5. Critic in action. A short live demo of the fact-checker subagent reviewing the report you just produced. The conceptual framing is already in place from the Subagents and templates block earlier in this tutorial; the demo is where the framing becomes a concrete tool you have used.
  6. Debrief across the three modes you have now used, plus the new agent layer.

The point is not the placements data. The point is the workflow: a Git project, a code-native agent reading a CLAUDE.md for standing context, a stack of small scripts each verifiable in isolation, a write-up where every number traces to code, and one critic agent that reviews the result before you do.

NoteTwo prompt styles, both will appear today

Each task block below shows two ways to drive Claude Code. The short prompt is two or three sentences that lean on the conventions in CLAUDE.md. The Cowork-style long prompt is the all-in-one specification that does not require any standing context. The short style is what you will use in your own research once CLAUDE.md is in place. The long style is the fallback for one-off tasks, for collaborators who do not have the standing context, or for the first time you build a project from scratch.

Setup and first run

Open the repo

Open a terminal at the repo root.

cd ~/github/aem7010-ai
git status

git status should report “nothing to commit, working tree clean”. If it does not, commit or stash first. Today we will accumulate a number of edits across several Claude Code sessions; starting from a clean slate is what makes each commit point useful.

If you also want VS Code open in parallel for Source Control, run code . from the same folder. We will use VS Code’s diff view at the commit points, and the terminal for Claude Code itself.

Launch claude

From the same terminal, run:

claude

You should see a Claude Code prompt with the project name and the current branch. The first thing to do is a tiny sanity check that the agent can see the repo. At the prompt, type:

List the top-level files and folders in this repo. Do not edit anything.

Claude Code will run a tool call to list the directory, print the result, and stop. Read the output. The list should match what you see in the file tree on the left of VS Code or in ls.

Two commands worth knowing

Inside Claude Code’s prompt, two slash commands matter today.

  • /init reads the repo and writes a short CLAUDE.md describing the project. We will use this in a moment.
  • /clear clears the conversation history but keeps the working directory and the Git state. Useful when you want to start a fresh prompt without restarting Claude Code.

To exit, press Ctrl+C twice or type exit.

Per-edit approval is the default

When Claude Code wants to write or modify a file, it stops and asks. The prompt shows the file name, a preview of the diff, and three options: yes, yes for this kind of action this session, or no. Today, choose yes per edit. The auto-accept mode (Shift+Tab) exists, but using it on day one defeats the verification reflex this course is built around.

Quick package check

The pre-class checklist asked you to install four R packages: tidyverse, rvest, readr, ellmer. If you skipped that step, run this one-liner in R or RStudio now, before any script runs. It is the only command in today’s session that touches the package manager.

install.packages(setdiff(c("tidyverse", "rvest", "readr", "ellmer"), rownames(installed.packages())))

The line installs only what is missing. If everything is already installed, it is a no-op. If ellmer is the only thing missing (the most common case), you will see one short install pass and nothing else. Once it returns, you are ready.

Author CLAUDE.md together

What CLAUDE.md is, and why it matters

CLAUDE.md is a markdown file Claude Code reads automatically on every session in this project. It is the standing context for the repo: what folders exist, what schemas the data follows, which packages we use, what the rules are. Once it exists, every prompt you write becomes shorter, because the rules do not need to be re-stated.

Think of it as the “house style” of the repo. A new collaborator (human or agent) reads this file first and then knows how to work here.

For a research course, CLAUDE.md carries one extra weight. The prompt is part of the artifact. It is versioned, it shows up in git diff, and it is reviewable like any other code. When a reviewer asks “how were these labels produced”, the answer is in the script and in the conventions file the agent was reading when the script was written.

Run /init to scaffold a starting CLAUDE.md

In Claude Code, type:

/init

The agent will read the current state of the repo and write a CLAUDE.md describing what it found. For a near-empty repo this will be short. Read the file in VS Code. It is a starting point, not the final version.

Replace it with the course conventions

Now we replace the auto-generated CLAUDE.md with the conventions for this project. Two ways to do this. Pick whichever is faster on your machine.

The course repo ships a known-good CLAUDE.md for this project. From a separate terminal at the repo root:

curl -L -o CLAUDE.md \
  https://raw.githubusercontent.com/arielortizbobea/aem7010/main/ai-tools/seeds/CLAUDE.md

Open CLAUDE.md in VS Code and read it line by line. Make sure every section makes sense. If any rule looks wrong for your machine, edit it now.

In the claude prompt, paste the content shown below as one message after a sentence like “Replace CLAUDE.md with this exact content:”. Claude Code will write the file. Accept the diff.

The content of CLAUDE.md is reproduced here so you can see the shape and edit it on the page if you need to.

The full CLAUDE.md for this project

# AEM 7010: AI-tools running exercise

This repository holds the multi-department PhD placements study built across the AI-tools module of AEM 7010.

## Project structure

Functional folders, no chronology.

- `code/`: R scripts, one per task. Each script must run end-to-end from a fresh R session.
- `data/`: input data and processed CSVs. Schemas below.
- `data/cache/`: committed cache files for LLM-in-the-loop steps.
- `output/`: AI-drafted intermediate reports, tables, and figures.
- `paper/`: researcher-authored prose. The agent assists with code only here. (Empty for now.)

## The five departments

| dept code | URL |
|---|---|
| `dyson` | https://dyson.cornell.edu/programs/graduate/placements/ |
| `berkeley` | https://are.berkeley.edu/graduate/job-market-placement |
| `davis` | https://are.ucdavis.edu/graduate/phd-program/placement |
| `minnesota` | https://apec.umn.edu/graduate/job-placements |
| `wisconsin` | https://aae.wisc.edu/graduate-programs/placement/ |

## Data schemas

`data/placements_<dept>.csv` (per-department scraper output): `name`, `year`, `placement`, `source_url` in this order. The `placement` column is the job title joined to the institution by " at ", e.g. "Assistant Professor at University of Illinois Urbana-Champaign". The `source_url` column is the dept's page URL repeated on every row.

`data/placements_all.csv` (stacked): `dept`, `name`, `year`, `placement`, `source_url`.

`data/placements_all_classified.csv` (classified): the columns above plus `class_llm` with values in {`academic`, `government`, `industry`, `other`}.

`data/cache/llm_responses.csv` (LLM cache, one row per unique placement string): `placement`, `model`, `date_run`, `raw_response`, `label`.

## Scraping conventions

- Use only `rvest` and `readr`.
- Anchor selectors on stable text (heading text, recognizable column header), not CSS class names.
- Drop rows where all data cells are empty.
- End each script with a `message()` reporting the row count.

## LLM classifier conventions

- Use only `ellmer`.
- Pin the model to `claude-haiku-4-5-20251001`.
- Set sampling to deterministic: `params = params(temperature = 0)`.
- Always go through the cache at `data/cache/llm_responses.csv`. Cache hit means no API call. Cache miss means one API call and one new cache row.
- Store the raw response, not just the label.
- Commit the cache.

## Reports versus papers

- `output/`: AI-drafted intermediate reports. Agent writes prose; you verify numbers.
- `paper/`: researcher-authored prose. The agent does not draft prose that survives into the manuscript. It assists with code, tables, and figures only. The voice and the argument are the researcher's.

## Scope rules for any task

- Stay inside the task. Do not modify files outside the requested scope.
- Do not edit `.gitignore` or `README.md` unless the task says to.
- Do not install packages beyond `tidyverse`, `rvest`, `readr`, `ellmer`.
- Every script must run end-to-end from a fresh R session.
TipWhy this file is short

Long agent-instruction files are tempting, especially the first time you write one. They are also a known failure mode. The agent reads CLAUDE.md on every prompt; long files cost tokens and add noise. The version above is roughly fifty lines. That is enough to encode the conventions and short enough that the agent reads it at the start of every interaction without losing focus.

Commit CLAUDE.md

This is your first commit of the day. Either path works.

  1. Stage CLAUDE.md.
  2. Commit message: Add project conventions for Claude Code.
  3. Commit, then Sync Changes.
git add CLAUDE.md
git commit -m "Add project conventions for Claude Code"
git push

Refresh github.com/<your-handle>/aem7010-ai. The CLAUDE.md file is now the first thing a visitor sees on the repo page. That is what you want: anyone who lands on the repo, including future you and any future agent, reads the conventions before reading the code.

Beyond project conventions: encoding your personal style

The CLAUDE.md we just committed is heavy on project-specific information: the five departments, the schemas, the LLM rules. The file is also the right place to encode your personal style, the choices that travel with you across projects rather than living in any one repo. This is the answer to a question every student asks within a week of using a code-native agent: can I make it write code in my own style? Yes. You write the style down once. The agent reads it on every prompt.

A useful way to think about what to put in your personal-style section. Organize by dimension, not by ad-hoc rule. Below is a working taxonomy of eleven dimensions of code style. Pick the ones that matter to you and write one or two specific rules under each. The taxonomy is more durable than any specific rule, because it gives you somewhere to put a new rule when you discover one.

  • Project structure. Folder layout (functional or chronological), where data lives, where outputs land, naming conventions for script files. Example rule: “Functional folders only: code/, data/, output/. No session-numbered subfolders.”

  • File-level conventions. What appears at the top of every script (purpose, inputs, outputs, dependencies), how to structure section dividers, maximum length before splitting. Example rule: “Every script begins with a four-line header: title, purpose, depends on, produces. Section dividers use # ---- section name ----.”

  • Naming. Variables, functions, files, data-frame columns, constants. The most-asked dimension; the easiest to encode. Example rule:snake_case for variables and functions. UPPER_SNAKE for constants. File names that mirror their main function.”

  • Formatting. Indentation, line length, alignment, blank-line conventions. Most of this is handled by an autoformatter; what is left is taste. Example rule: “Two-space indent. Hard wrap at 100 characters. Two blank lines between top-level functions, one between logical blocks within a function.”

  • When to write a function. The factoring rules. Example rule: “Three repetitions or thirty lines, whichever comes first. Helpers used by more than one script live in code/helpers.R. Helpers used in one script live near the top of that script.”

  • Function design. Argument order, default values, the shape of return values, function-level documentation. Example rule: “Data first, options second. Functions returning multiple values return a named list, never a parallel-position vector. Roxygen documentation for any function called from another file.”

  • Control flow and language idioms. Pipes versus nesting, loops versus map, base versus tidyverse. Example rule: “Use the native R pipe |>. Prefer purrr::map_*() over for loops when iteration is not for side effects. dplyr and tidyr for data manipulation; base R for inner loops where speed matters.”

  • Errors, messages, logging. When to use which mechanism, and what to print at the start and end of long-running steps. Example rule:rlang::abort() for errors. cli::cli_inform() for user-facing messages. Print row counts before and after every data-transformation step.”

  • Comments and documentation. Density, format, function-level docs, READMEs. Example rule: “Roxygen for any exported function. Inline comments only when the why is not obvious from the code. One paragraph at the top of every script explaining what it does and what it produces.”

  • Interactive workflow. Whether the script can be run line by line in the REPL with useful intermediate state, and whether you can drop a debugger in cleanly. This is the dimension easiest to lose to AI-generated code if you do not call it out. Example rule: “Break long pipes into intermediate named objects when each step is something you might want to inspect. Avoid anonymous lambdas longer than two lines inside map_*(). Anything you might browser() into should be its own named function.”

  • Reproducibility hygiene. Paths, seeds, package management, runtime checks, inline assertions. Example rule: “All paths via here::here(). All set.seed() calls in one block at the top of an analysis. renv.lock committed. Every function that accepts a data frame asserts the schema with stopifnot() at the top.”

Two further notes on using this taxonomy. First, you do not need a rule under every dimension. Some dimensions you will not care about; leave them blank. The point is that when you find yourself fixing the same little thing in three of your scripts, you ask which dimension it lives under, and you add a rule there. The taxonomy is a thinking tool, not a checklist to fill in.

Second, the dimensions interact. A rule about interactive workflow (“break long pipes into intermediate variables”) changes the kind of code Claude Code writes more than a rule about formatting ever will, because it shifts the shape of the script and not just its appearance. When you weigh which dimension to encode next, prefer the ones that change shape over the ones that change appearance. Autoformatters can handle appearance; only CLAUDE.md can teach the agent to write scripts you can step through line by line.

A small habit worth forming. When you start a new project, open the most recent project’s CLAUDE.md, copy your personal-style section to the new repo, then add the project-specific conventions on top. Over a year, the personal-style block becomes a refined and battle-tested document. You bring it to coauthors. You hand it to your future students. It is one of the longest-lived artifacts in your research workflow, and it costs almost nothing to maintain.

TipOther places code style can live

For style rules that are not specific to LLM agents, keep using the standard tools too. An .editorconfig file at the repo root tells most editors about indentation and line endings. R has the styler package for automatic formatting and lintr for static checks. Python has black and ruff. The agent will respect these conventions when they exist; CLAUDE.md is for the rules that go beyond what an autoformatter can enforce.

Scale to five departments

The data goal

Five PhD placement pages, scraped into one tidy panel.

Column Example
dept dyson
name Sharan Banerjee
year 2025
placement Postdoctoral Fellow at KAPSARC School of Public Policy, Riyadh
source_url https://dyson.cornell.edu/programs/graduate/placements/

The deliverable for this section is a working script per department under code/, one CSV per department under data/, and a stacked data/placements_all.csv that adds the dept column. Five scripts, six CSVs, one commit.

The five departments

dept code Department URL
dyson Cornell Dyson https://dyson.cornell.edu/programs/graduate/placements/
berkeley UC Berkeley ARE https://are.berkeley.edu/graduate/job-market-placement
davis UC Davis ARE https://are.ucdavis.edu/graduate/phd-program/placement
minnesota Minnesota Applied Economics https://apec.umn.edu/graduate/job-placements
wisconsin Wisconsin AAE https://aae.wisc.edu/graduate-programs/placement/

The instructor pre-flighted these the weekend before. They render placement tables in roughly comparable shapes. Roughly. Each page has at least one quirk that the scraper will need to handle.

How we work this together

We do this in lockstep. The instructor projects the same screen you have. Each step happens on every laptop in the room. We pause at the verification checkpoints. Do not skip ahead, and do not lag silently.

The rhythm is the same as the previous session, adapted to the diff-driven loop:

  1. Working tree clean? Check.
  2. Paste the shared prompt into claude.
  3. Watch each tool call. Accept each edit explicitly.
  4. Run the verification checklist.
  5. Commit. Push. Refresh github.com on a side tab to confirm.

The shared scrape prompt

With CLAUDE.md in place, the prompt to drive the scrape is two sentences. The agent already knows the department list, the schema, the selector strategy, the package constraints, and the scope rules. We just need to ask for the work.

Prompt to paste into Claude Code:

For each of the five departments listed in CLAUDE.md, write code/scrape_<dept>.R
following the scraping conventions in CLAUDE.md. Then write code/stack.R that
produces data/placements_all.csv per the schema in CLAUDE.md. Run all scripts
and report per-department row counts and the total.

That is the whole prompt. Three lines, no schema repetition, no scope clauses, no list of packages. The standing context handles the rest. Notice what is not in the prompt: the URLs (in CLAUDE.md), the column names (in CLAUDE.md), the selector strategy (in CLAUDE.md), the package allowlist (in CLAUDE.md), the run order (implied by “run all scripts”).

This is the workflow you should expect to use in your own research once a project’s CLAUDE.md is in place. The conventions live in one file. The per-task prompts stay short, focused, and reusable.

If your CLAUDE.md is incomplete, or you want the all-in-one specification visible at the moment of pasting, this is the long-form prompt that does not require any standing context. It is the same shape we used in the previous session’s Cowork prompt, scaled up.

I want you to write five R scrapers and one stacker, then run them once. Work in this repo.

Department list. Use exactly these five departments and their URLs:
- dyson: https://dyson.cornell.edu/programs/graduate/placements/
- berkeley: https://are.berkeley.edu/graduate/job-market-placement
- davis: https://are.ucdavis.edu/graduate/phd-program/placement
- minnesota: https://apec.umn.edu/graduate/job-placements
- wisconsin: https://aae.wisc.edu/graduate-programs/placement/

Per-department scrapers. For each department, write code/scrape_<dept>.R that scrapes its placement table and saves a CSV at data/placements_<dept>.csv with exactly these columns, in this order: name, year, placement, source_url. The placement column is the job title joined to the institution by the word " at ". The source_url column is the URL above for that department, repeated on every row. Drop any row in the table where all four cells are empty. Use only rvest and readr.

Selector strategy. For each page, anchor your selector on a stable text feature of the page (a heading, a section title, or a recognizable column header) rather than on a CSS class name.

Stacker. Write code/stack.R that reads each data/placements_<dept>.csv, prepends a dept column with the dept code, binds the rows, and writes data/placements_all.csv with columns dept, name, year, placement, source_url. Print the row count per department and the total at the end with message().

Run order. After writing the scripts, run each code/scrape_<dept>.R in turn, then code/stack.R. Stop after stacking. Report the per-department row counts and the total.

Scope. Do not modify any existing file in code/, data/, output/, .gitignore, or README.md other than the new files listed above. Do not create any other files.

Both prompts produce the same artifacts. The short version composes with CLAUDE.md; the long version stands alone. Today, use the short version. The lesson is the workflow.

Paste and wait

Time: ~20 minutes. Paste the prompt. Read each tool call as it appears. Accept each edit. Claude Code will write five scrapers and one stacker, run them, and report row counts.

When Claude Code reports the totals, do not yet trust them. Move to the verification checklist.

Verification checklist (scrape)

Six checks, in order. All must pass before you commit. If any fails, fix it first.

  1. Do all six scripts exist at the expected paths? code/scrape_dyson.R, code/scrape_berkeley.R, code/scrape_davis.R, code/scrape_minnesota.R, code/scrape_wisconsin.R, code/stack.R. If any is named differently, rename it. The names are part of the contract.

  2. Does code/stack.R run cleanly from a fresh R session? From the terminal:

    Rscript code/stack.R

    Read the row counts in the message output. If stack.R errors, the problem is upstream; run each scrape_<dept>.R individually to find which one failed.

  3. Does each per-department CSV have a plausible row count? All five should have at least 30 rows. If any has fewer than 10, the scraper either failed silently or the page changed.

  4. Does data/placements_all.csv have the right schema? In R: readr::read_csv("data/placements_all.csv") then names() and nrow(). The columns must be dept, name, year, placement, source_url, in that order.

  5. Pick three random rows from three different departments and verify them against the live pages. This is the only check that catches silent parsing errors. If even one row is wrong, the scraper for that department is wrong, even if the row count looks right.

  6. Read the diff for each script. In VS Code’s Source Control panel, click each new file. Skim the code. Anything you do not understand, ask Claude Code to explain inline. Then decide whether to keep it.

If a check fails, prompt Claude Code to fix the specific failure. Do not accept a full rewrite: ask for a patch on the one script that broke.

Commit and push #1

When all six checks pass, commit and push.

  1. Stage all six new R scripts and the six new CSVs.
  2. Commit message: Five-department scrape and stack.
  3. Click the checkmark (or Cmd+Enter / Ctrl+Enter) to commit.
  4. Click Sync Changes at the bottom.
git add code/scrape_*.R code/stack.R data/placements_*.csv
git commit -m "Five-department scrape and stack"
git push

Open github.com/<your-handle>/aem7010-ai and refresh. The new scripts and CSVs should appear. Click data/placements_all.csv; GitHub renders it as a table you can spot-check.

Replace the keyword classifier with an LLM classifier

Time: ~20 minutes. This is the new piece. The previous session classified placements with keyword rules. Today the rules go away. The classifier is a language model called from inside an R script, with a cache that makes the script reproducible.

Why move from rules to a model

Free-text placement strings are messy. The previous session’s keyword rules (“Professor”, “Postdoctoral”, “Bureau”, “Department of”) catch the obvious cases and miss the interesting ones. “Research Economist at the Federal Reserve Bank of Kansas City” is a research role at a public-sector institution. Is that academic? Government? Something else? The keyword classifier flips a coin based on which word matched first. A model can read the full string and decide based on the role and the institution together.

The trade-off is real. Rules are transparent, cheap, and deterministic. A model is opaque, costs money, and gives different answers tomorrow if the model version changes. The trade is worth making for tasks where the edge cases dominate. Today’s task is one of those.

Reproducibility duties when the classifier is a model

If you put a model in your pipeline, you accept five new responsibilities. Skip any of them and the pipeline stops being a research artifact.

  • Pin the model. Use a specific version string, not “the latest Claude”. The classifications change when the model changes.
  • Set sampling to deterministic. Pass temperature = 0 (or the closest equivalent in the package you use). Default temperature is non-zero, which means the same prompt sent twice can return different labels.
  • Cache the responses to disk. One file with one row per unique placement string, including the model name, the date, the prompt, and the raw response. Commit it. The cache is the boundary between the script and the model.
  • Store the raw response, not just the label. The label can be re-derived from the response. The reverse is not true.
  • Document the model and the prompt in the README and the script header. A reviewer in two years should be able to read the README and tell what model and what prompt produced the labels.

If you cannot do all five, do not use an LLM classifier in a paper. Use rules and document their limits.

NoteWhat “reproducible” actually means here, honestly

The labels in data/placements_all_classified.csv are fully reproducible from the committed cache file. Anyone who clones the repo and runs Rscript code/classify_llm.R gets the same labels, byte-for-byte, without calling any API. That is the strong claim, and it is the one downstream analysis depends on.

The model behavior itself, in the sense of “the API will return the same answer if I call it tomorrow on a new placement string”, is weaker. Pinned model versions can be retired. Provider-side infrastructure changes can shift outputs at the margin. Even with temperature = 0, two providers can disagree about exactly how deterministic “deterministic” is. The cache is what protects the artifact from all of this. The script’s API path is the canonical one; the cache is the version that survives.

This is the real reproducibility profile of any LLM-in-the-loop method: tight on the cached labels, looser on the live API. The job of the cache, the model pinning, and the documentation is to keep the looser part out of the analysis.

A 30-second introduction to ellmer

ellmer is Posit’s R package for talking to language model APIs. It supports Anthropic, OpenAI, Google, Groq, and others under one consistent interface. The two patterns we will use today are simple.

library(ellmer)

chat <- chat_anthropic(model = "claude-haiku-4-5-20251001")
chat$chat("Classify this placement as academic, government, industry, or other: ...")

The first line creates a chat object pinned to a specific model version. The second line sends a prompt and returns the model’s response as a string. There is no boilerplate to set up an API client by hand. The model name is part of the script, which is what we want for reproducibility.

In class today, you will not call the API live. The script is real, and the API path is the canonical one, but a precomputed cache covers every placement we have. The cache is what you read.

Download the cache

Before we ask Claude Code to write the classifier, we need the cache file in the repo. From the terminal at the repo root, run:

mkdir -p data/cache
curl -L -o data/cache/llm_responses.csv \
  https://raw.githubusercontent.com/arielortizbobea/aem7010/main/ai-tools/seeds/llm_responses.csv

The cache is a CSV with one row per unique placement string, containing the model name, the date the call was made, the raw model response, and the parsed label. It covers every row in the five-department panel. Confirm with:

wc -l data/cache/llm_responses.csv

You should see at least 300 rows.

TipWhy this is a faithful Mode B move

The cache is data, but the script is the artifact. The script defines the prompt, the model, and the parse logic. The cache is the precomputed evaluation of that script against today’s placement strings. A reviewer who runs the script with their own API key gets the same answers (modulo model drift, which the model pinning bounds). A reviewer who runs the script without an API key reads the cache. Both paths produce the same labelled CSV. That is what reproducibility looks like for an LLM in the loop.

The shared classifier prompt

CLAUDE.md already encodes the LLM classifier conventions: the package, the model pin, the temperature, the cache contract, the rule about storing raw responses. The prompt is short.

Prompt to paste into Claude Code:

Write code/classify_llm.R that produces data/placements_all_classified.csv
following the LLM classifier conventions in CLAUDE.md. Read placements from
data/placements_all.csv and the cache from data/cache/llm_responses.csv. The
script must NOT call the API when no ANTHROPIC_API_KEY is set: instead, label
any cache-miss row as "uncached" and warn with message(). Run the script and
report per-label counts.

The defensive clause matters. The course-provided cache covers every placement we will see today, so in practice no API calls fire. But if the live page changed since the pre-flight and a new row slipped in, you do not want the script to error on a missing API key in the middle of class. Telling the agent up front to treat cache misses as a soft failure with the label uncached keeps the script running, makes the gap visible, and pushes the question of “what about that row?” to where it belongs: the verification checklist.

NoteDo you need an API key today?

No. The committed cache covers every placement string in data/placements_all.csv. The script reads labels from the cache without calling the API. No ANTHROPIC_API_KEY is required to run the classifier in class.

When would you need one? When you scrape a new department, or the live page adds rows the cache has not seen, or you want to change the prompt and re-classify. At that point you set ANTHROPIC_API_KEY in your environment (Anthropic Console → API Keys, then export ANTHROPIC_API_KEY=... or set it in ~/.Renviron), and the script’s API path activates automatically. That is a homework conversation, not a class one.

The all-in-one prompt that does not lean on CLAUDE.md:

Write code/classify_llm.R, then run it. Work in this repo.

Inputs. The script reads data/placements_all.csv (columns: dept, name, year, placement, source_url) and data/cache/llm_responses.csv (columns: placement, model, date_run, raw_response, label).

What the script does.

1. Load ellmer. Define a function that creates a chat object pinned to claude-haiku-4-5-20251001 with deterministic sampling (params = params(temperature = 0)). Do NOT create the chat object eagerly: only create it when a cache miss requires it.

2. Define the classification prompt as a single string variable at the top of the script. The prompt asks the model to classify a placement into one of four labels: academic, government, industry, other. The prompt explains each label in one sentence and asks for the label as the first line of the response, lowercase, no extra text.

3. Load the cache. For every unique placement string in data/placements_all.csv, look it up in the cache by exact match. If a row is in the cache, use the cached label. If a row is missing from the cache and ANTHROPIC_API_KEY is set, call the chat object once per missing string, parse the first line of the response as the label, and append a new row to the cache with model, date_run, raw_response, and label. If a row is missing and no ANTHROPIC_API_KEY is available, assign the label "uncached" and emit a message() warning naming the row. Save the updated cache back to data/cache/llm_responses.csv at the end.

4. Join the labels back to data/placements_all.csv on placement, producing data/placements_all_classified.csv with the original five columns plus a new class_llm column.

5. Print the count by class_llm at the end with message().

Constraints. Use only ellmer, dplyr, and readr. Do not install other packages. Do not modify any other file. The script and the two CSV outputs are the artifacts.

After writing the script, run it once. Stop and report the per-label counts.

Both prompts produce the same script. The short version composes with CLAUDE.md; the long version stands alone.

Paste, wait, verify

Time: ~10 minutes. Paste the prompt. Read each tool call. Accept each edit. Claude Code writes the script, runs it, and reports the per-label counts. Because the cache covers every placement, no API call should fire. If Claude Code’s run output mentions “calling the API” or “missing in cache”, stop and re-read the cache download step.

Verification checklist (classifier)

Five checks, in order. All must pass before the commit.

  1. Does code/classify_llm.R exist and run cleanly from a fresh R session?

    Rscript code/classify_llm.R

    Read the per-label counts in the message output.

  2. Does data/placements_all_classified.csv have the right schema? Columns dept, name, year, placement, source_url, class_llm, in that order. Same row count as data/placements_all.csv.

  3. Are the labels well-formed? In R: table(read_csv("data/placements_all_classified.csv")$class_llm). The expected values are academic, government, industry, other, and possibly uncached. Anything else is a parse failure. If you see any uncached rows, the cache is incomplete: the live page added placements after the pre-flight. Flag it now. The instructor will refresh the cache and push an updated seed; you re-download and re-run.

  4. Hand-label twenty random rows. Open the classified CSV. Pick twenty rows at random across departments. Read each placement string. Decide for yourself which of the four labels you would assign. Compare to class_llm. Note disagreements. We will discuss them at the debrief.

  5. Read the prompt template inside the script. It is the most important variable in the file. If you cannot defend the wording, change it. The prompt is part of your method section.

If a check fails, prompt Claude Code for a patch on the specific failure. Do not accept a full rewrite.

Commit and push #2

When all five checks pass, commit and push. The cache file goes in too: it is data the script needs to be reproducible.

  1. Stage code/classify_llm.R, data/placements_all_classified.csv, and data/cache/llm_responses.csv.
  2. Commit message: LLM classifier with cached responses.
  3. Commit. Sync Changes.
git add code/classify_llm.R data/placements_all_classified.csv data/cache/llm_responses.csv
git commit -m "LLM classifier with cached responses"
git push

Refresh github.com/<your-handle>/aem7010-ai. Click data/placements_all_classified.csv. Spot-check the labels.

Descriptive write-up

Time: ~10 minutes. The deliverable is a small output/analysis.md with one table, one figure, and two paragraphs of prose. Claude Code writes the script and drafts the prose. You verify the numbers.

A convention worth introducing: output/ versus paper/

Before we ask Claude Code to draft a report, name the line. This module has been writing into output/. From now on, treat that folder as the home for AI-drafted intermediate reports. Tables, figures, descriptive memos, and short summaries that exist to inform you or your coauthors live there. The agent can write them end-to-end. The discipline is on the numbers: every number in the prose must trace to the script’s output.

A separate folder, paper/, holds research papers and dissertation chapters. There the rules are stricter. The agent does not draft prose that survives into your manuscript. You write the words; the agent helps with code, tables, and figures. The voice and the argument are yours, because the byline is yours.

We are not creating a paper/ folder today. The point is to know that the boundary exists. When you build a research project this summer, place each document on the correct side of that boundary at the moment you create it.

TipA .cursorrules-style note for output/ and paper/

Some students like to commit a short output/AGENTS.md (or output/README.md) saying “this folder is for AI-drafted reports” and a paper/AGENTS.md saying “this folder is for researcher-authored prose; the agent assists with code only”. Future agents and future you will read those notes and stay on the right side of the line.

What we will produce

Four artifacts, all in output/.

  • A table in output/analysis.md: counts by dept and class_llm, with row totals.
  • A figure at output/figures/academic_share_by_dept.png: the share of academic placements per department, as a bar chart, with the four-department-wide mean as a reference line.
  • A two-paragraph descriptive section at the top of output/analysis.md, drafted by Claude Code from the script’s output. One paragraph on data and method, one paragraph on patterns.
  • A notes section at the bottom of output/analysis.md: total rows, year range, model name, classification date.

The shared analysis prompt

The output/ versus paper/ boundary is in CLAUDE.md, so the agent already knows it can draft the prose for an output/ document. The prompt names the artifacts and the structure of the report, and that is enough.

Prompt to paste into Claude Code:

Write code/analysis.R that produces a counts table (dept x class_llm) at
output/analysis_table.csv, a horizontal bar chart of academic-share by dept
with a mean reference line at output/figures/academic_share_by_dept.png
(8x5 in, 150 dpi), and a message() block summarizing total rows, year range,
per-dept counts, the table, and the model+date from the cache.

Then write output/analysis.md as an AI-drafted report (per CLAUDE.md) with
sections in this order: ## Description (two paragraphs, every number sourced
from the message block), ## Counts by department and class (the table),
## Academic share by department (the figure), ## Notes (total rows, year
range, model, classification date). Run the script first; use its message
output as the source of truth for every number in the prose.
Write code/analysis.R, then run it, then write output/analysis.md. Work in this repo.

Inputs. The script reads data/placements_all_classified.csv (columns: dept, name, year, placement, source_url, class_llm) and data/cache/llm_responses.csv (for model name and classification date).

What code/analysis.R does.

1. Compute a table of counts by dept and class_llm, with row totals. Save it as output/analysis_table.csv.

2. Compute the share of academic placements per department. Plot it as a horizontal bar chart with department on the y-axis and share on the x-axis, using ggplot2. Add a vertical reference line at the across-department mean. Save the plot to output/figures/academic_share_by_dept.png at 8 by 5 inches, 150 dpi.

3. Print a summary block with message() that contains, on separate lines: total rows, year range (min to max), per-department row counts, the table from step 1, and the model name and date from the cache. This block is the source of every number that goes into the write-up.

4. Use only tidyverse and base R.

Then write output/analysis.md. This is an AI-drafted intermediate report, not a research paper. Use the message block above as the source of truth for every number. Sections in this order:

1. ## Description. Two paragraphs. The first describes what was scraped, from how many departments, over what years, with the total row count, the model used for the classifier, and the classification date. The second describes one or two patterns visible in the table or figure.
2. ## Counts by department and class. The markdown table from step 1.
3. ## Academic share by department. A markdown image link to the figure.
4. ## Notes. Four lines: total rows, year range, model name, classification date.

Constraints on the prose. Every number must come directly from the message block. Do not invent numbers. Keep each paragraph to three to five sentences.

Run the script first, then write the markdown using its message output. Stop and report file paths.

Paste and run

Time: ~7 minutes. Paste the prompt. Read the diffs for code/analysis.R and output/analysis.md. Accept the edits. Confirm the four artifacts exist.

Verification: every number in the prose traces to the script

Open output/analysis.md in VS Code. Read the two paragraphs in the ## Description section. For each number you find, locate it in the table or in the notes block at the bottom. If a number does not appear elsewhere in the document, it should not be in the prose. The agent may have drafted the words, but the numbers are checked by you.

Three small things to watch for, in order of how often they happen:

  • Rounded numbers that disagree with the table. “About 60 percent” when the table says 57.3 percent is fine; “62 percent” when the table says 57.3 percent is not. Edit the prose to match.
  • A claim that compares two departments without the comparison appearing in the figure or table. If the prose says “Berkeley places more academically than Davis”, confirm both shares are in the table.
  • A pattern statement that overreaches the data. “Academic placements have declined over the decade” is a claim about a year-by-year trend that the figure does not show. Soften, qualify, or remove.

If you find an issue, edit the prose directly in output/analysis.md. Do not re-run the agent for a wording fix; this is the kind of edit you make by hand.

WarningThe boundary, one more time

This is output/. The agent drafts the prose because the document is an intermediate report. If this were paper/, the rules would be different: you would write the prose, the agent would help only with code, and the verification would happen line by line on every sentence. Knowing which folder you are in changes which rules apply.

Final commit and push

Commit the analysis script and the four artifacts in output/.

  1. Stage code/analysis.R, output/analysis_table.csv, output/figures/academic_share_by_dept.png, and output/analysis.md.
  2. Commit message: Descriptive analysis of multi-department placements.
  3. Commit. Sync Changes.
git add code/analysis.R output/analysis_table.csv output/figures/academic_share_by_dept.png output/analysis.md
git commit -m "Descriptive analysis of multi-department placements"
git push

If you want to mark this commit as the end of the AI-tools module, tag it. From the terminal:

git tag -a v1.0-ai-tools -m "End of AI-tools module"
git push origin v1.0-ai-tools

Refresh github.com/<your-handle>/aem7010-ai. Click output/analysis.md. It renders as a small report with a heading, a table, a figure, and two paragraphs of prose.

Critic in action: the fact-checker demo

Time: ~5 minutes. The conceptual framing is in the Subagents and templates block earlier in the tutorial. Here we deploy one specific specialist from that army on the report you just produced.

The course repo ships a pre-built fact-checker.md agent. Two steps, then we watch it work.

Step 1. Download the agent file into your repo’s .claude/agents/ folder.

mkdir -p .claude/agents
curl -L -o .claude/agents/fact-checker.md \
  https://raw.githubusercontent.com/arielortizbobea/aem7010/main/ai-tools/seeds/fact-checker.md

Step 2. In Claude Code, type:

Use the fact-checker agent to review output/analysis.md against
data/placements_all_classified.csv. Report the result. Do not edit anything.

Claude Code will spawn the subagent, hand it the system prompt from fact-checker.md, and let it loose with read-only tools. The agent reads the report, extracts numeric claims, looks each one up in the table or recomputes it from the CSV with a small Bash command, and reports PASS / ROUNDED / FAIL per claim. The mechanism (an LLM doing retrieval-and-recompute through its tools, not “checking from memory”) is described in detail in How does an LLM “check” anything? earlier in this tutorial.

You should see roughly five to ten claims checked and almost all PASS. If any FAIL, that is the critic catching something the producer missed. Read the report. Decide whether to edit the prose, edit the script, or both. Whatever you decide, the critic was doing the job your eyes would otherwise have done in the same minute.

TipIf the natural-language invocation does not trigger the subagent

Subagent invocation in Claude Code is robust most of the time but version-dependent at the margin. If after the prompt above Claude Code reads the files itself rather than spawning the fact-checker subagent (you can tell from the tool-call trace; a subagent invocation shows up explicitly), use the slash command instead:

/agents

This opens the agent picker. Select fact-checker, then provide the same instruction (review output/analysis.md against data/placements_all_classified.csv).

The lesson is the same either way: the agent is a markdown file in .claude/agents/, the invocation is a one-line ask, and the critic adds a verification layer for almost no incremental friction. The exact triggering UX evolves between Claude Code releases; the file format does not.

TipWhy we are not designing new agents today

The demo runs in one minute because the agent file already exists. Designing a new agent from scratch (writing the system prompt, choosing the tool allowlist, iterating against examples until the agent is reliable) is its own skill, and the right place to learn it is a future session or your first independent research use. Today the lesson is that the artifact is a markdown file, the invocation is a one-line ask, and the critic adds a verification layer for almost no incremental friction.

Commit the agent

The agent file is part of the repo, just like a script.

  1. Stage .claude/agents/fact-checker.md.
  2. Commit message: Add fact-checker subagent.
  3. Commit. Sync Changes.
git add .claude/agents/fact-checker.md
git commit -m "Add fact-checker subagent"
git push

When you start a new research project this summer, copy this same fact-checker.md into the new repo’s .claude/agents/ folder. That is what reusability looks like in practice.

Debrief

What we did

In one class period, you went from an empty terminal session to a five-department study with an LLM classifier, a CLAUDE.md that encodes the project’s standing conventions, a descriptive write-up where every number traces to code, and a critic agent that double-checks the report. The repository at the end has roughly twenty files spread across code/, data/, output/, and .claude/agents/. Each file has a single purpose. Each step is reproducible from the previous step’s CSV.

That repository is small. The shape is not. Every applied paper you will write looks like this: a few scripts that scrape or load data, a few that clean and merge, a few that estimate or classify, a few that produce tables and figures, one short document that interprets the result, a CLAUDE.md that holds the rules, and one or two agents that act as critics. Your dissertation chapters can look exactly like this.

The three modes, lived

You have now used all three rungs of the AI-tools ladder, on the same running task.

Rung Tool What you watched it do Where the discipline lived
Chat one school, one script You pasted code, it answered with code The script you saved by hand
Cowork one school, small project The agent built scrape-plus-pipeline in your folder The git diff of the folder
Claude Code five schools, study with LLM classifier and critic The agent built and ran a project, edit by edit, then a critic reviewed the result The diff of every accepted edit, plus commit history, plus the critic’s report

The bottleneck was different at each rung. Chat made you type. Cowork made you watch. Claude Code makes you read diffs. Adding a critic agent shifts part of the diff-reading to a second model. Each rung is faster than the last only if you keep doing the verification work that the constraints of the previous rung were doing for you for free, plus whatever the new layer adds.

Where each mode fits in your research life

You will not use code-native agents for everything. You should not. Use chat for paragraphs and quick questions. Use Cowork for exploratory work over real files when you do not yet know the shape of the project. Use Claude Code when the task is multi-file, multi-step, and Git is going to track the work. Use critic agents wherever the cost of a missed error is high. The choice depends on the task, not on which tool was newest.

The one rule that carries forward

Mode B. The artifact is code, including CLAUDE.md and the agent files. The audit trail is Git. The verification is yours, even when a critic agent does some of it for you. The tools will keep getting more capable. The rule will keep getting more important.

After class

  1. Confirm at least five new commits visible at github.com/<your-handle>/aem7010-ai from today’s session: CLAUDE.md, the five-department scrape, the LLM classifier with cache, the descriptive analysis, and the fact-checker agent.
  2. Pull the repo onto any other machine you work on and confirm Rscript code/stack.R, Rscript code/classify_llm.R, and Rscript code/analysis.R reproduce the outputs from a clean checkout. If they do not, the missing piece is somewhere in your dependencies; today is a good day to find it.
  3. Pick one current research project of yours. Identify one step in it that looks like one of the three rungs you used today. Try the right tool for that step this week. Commit the prompt and the code in the project’s README so your future self can rerun it.

Stretch: design your own agent

Today’s demo used a pre-built fact-checker.md. Designing a new agent is the natural next step. Pick one job in your research where you reliably catch (or miss) errors by hand: a unit-conversion sanity check, a column-name validator, a check that every figure caption mentions a sample size. Write that role as .claude/agents/<name>.md with a tight system prompt and a minimal tool allowlist. Iterate against three or four real examples until it reliably catches the error you already know is there. Commit the file. Now you have one piece of permanent infrastructure that did not exist this morning.

TipWhere to look for inspiration

The fact-checker.md you used today is one example. The community around Claude Code maintains a small library of agent templates that map cleanly to common research roles. The course repo’s ai-tools/seeds/agents/ folder is a good starting point; bring your own modifications back to office hours.