Session 6: AI Tools I

How LLMs work, and the limits of chat

Slides for this session: View the slide deck (opens in your browser; press F for fullscreen). The slides are a lean anchor to the concepts below. The walkthrough on this page is the substantive material and the reference you will come back to.

Want a PDF for note-taking? Open the slides in your browser, append ?print-pdf to the URL, and use File → Print → Save as PDF. Reveal.js handles the layout. Works in Chrome, Edge, and Firefox.

Where we are in the course

Sessions 6, 7, and 8 cover AI tools for applied economics research. They are sequenced on purpose.

Session 6 (today) is about chat: the interface you already know from Claude.ai, ChatGPT, and Gemini. We build a mental model of how large language models work, so you can predict where they will help and where they will fail. The hands-on exercise uses chat to write an R script. The lesson is not “chat is bad.” It is “chat is a constrained tool, and those constraints shape how you can use it responsibly.”

Session 7 (Wednesday) moves up a level to agentic desktop AI: Cowork, and its growing category of peers. These tools see your files and run code on your machine. The guardrails change.

Session 8 (next Monday) moves one more level to code-native agents: Claude Code running in your terminal. This is closest to how professional software teams use AI. The project structure and verification habits scale up.

Three tools, three sessions, one framework. By next Monday you will be able to pick the right mode for a given research task, and explain why.

These sessions assume Sessions 4 and 5

Every exercise in this module starts with git init in a clean folder and ends with a git commit. Without Git, you cannot safely let an AI touch your working directory. If you skipped the version-control sessions, review Session 4 before running the hands-on exercise.

A mental model of LLMs

You will use large language models for the rest of your research life. The single most important thing you can carry away from today is a sturdy, non-mystical picture of what they are doing. You do not need the math. You need four facts.

Fact 1: They predict the next token

Your phone predicts the next word as you type. A large language model does the same thing, at scale: given a sequence of tokens, it predicts the next token. A token is a chunk of text, often a word, sometimes a fragment. The word understanding is typically three tokens. The word the is one token. Numbers and rare words split into more pieces.

The model takes everything in the conversation so far, turns it into tokens, and outputs a probability distribution over the next token. It samples one, appends it, and repeats. That is the entire architecture at the level of behavior. Text in, next token out, loop until a stop condition.

This sounds small. It is small. The magic comes from training on very large volumes of text, which teaches the model patterns that look remarkably like reasoning. But it is still pattern-completion at heart.

Why this matters for research

A tool that predicts plausible continuations is useful for drafting prose, explaining concepts, and writing code in common patterns. It is risky when the “plausible continuation” is a numeric result, a citation, or a statistical claim. Plausible is not the same as correct. The researcher’s job is to tell the difference.

Fact 2: They generate, they do not retrieve

A chat interface is not a database. When you ask it to recall the author of a paper, it does not look the citation up. It generates a string that looks like a citation, token by token. Much of the time the generated string matches a real reference because the reference appeared in training. Sometimes it does not, and you get a fabricated author, title, or DOI that is internally consistent but refers to nothing.

The clearest public example is Mata v. Avianca (2023): two New York attorneys filed a legal brief with six invented case citations that ChatGPT had generated. The cases looked plausible. They did not exist. The court sanctioned the attorneys.

Economics has its own version of this. Ask any current chat model to summarize the literature on a narrow topic. You will get a paragraph with citations. Paste each citation into Google Scholar. A nontrivial fraction will not return a hit, or will return a paper by different authors. This is not a bug. It is the architecture working as designed.

Rule of thumb for citations

If an AI tool gives you a citation, verify it before using it. Paste the title into Scholar. Check the DOI. Read the paper. “The AI said this paper exists” is not a sufficient reason to cite it.

Fact 3: They are non-deterministic

The model outputs a probability distribution over possible next tokens. A sampling step picks one. Temperature is a knob that scales that distribution before sampling. At temperature 0, the model picks the single most likely token every time. As temperature rises, the distribution flattens, lower-probability tokens get a real chance, and outputs become more varied. Past 1.5 or so, they tend to become incoherent. A value near 1 is a typical default.

Whether you can set temperature depends on the product, not on the underlying model.

Chat web apps (chatgpt.com, Claude.ai, Gemini in the browser): locked by the product at a value near 1. Not user-adjustable.
APIs, playgrounds, and consoles (OpenAI, Anthropic Workbench, Google AI Studio): temperature is a parameter or slider. Anthropic accepts 0 to 1; OpenAI accepts 0 to 2.
Coding agents and Custom GPTs: pre-set internally, typically low. Sometimes configurable at build time; usually not surfaced in the UI.

Even at temperature 0, the serving infrastructure does not guarantee bit-for-bit identical output across calls. In chat products, determinism is neither the default nor something the user can choose.

The practical consequence: two students who type the same prompt into the same tool at the same time can get different answers. Sometimes the difference is just wording. Sometimes the difference is a different R function, a different selector, a different numerical claim. We will see this live in class.

This is why a chat transcript is not a replication artifact. You cannot cite “this was the answer the model gave me” as a stable reference. The output from a chat session is a moment in time that cannot be reproduced, even by you, even five minutes later.

Fact 4: The context window is finite

Everything the model can “see” in a conversation lives in its context window: your prompts, its replies, any files you pasted in, any system instructions. The window has a hard size limit. When you fill it, the oldest content starts falling out silently. The model does not announce this. It just starts behaving as if the beginning of the conversation never happened.

Long conversations with chat models degrade for this reason. You set up careful framing early, work on several subproblems, and then by the thirtieth message the model “forgets” the constraints you established at the start. If you have felt this and assumed the tool got worse, the tool did not get worse: it lost context.

Three implications for research

Probabilistic outputs require verification. Always. You verify code by running it; you verify prose by reading it carefully; you verify citations by checking them.
Context is fragile. Know what the model can actually see. If a long conversation matters, summarize state into a single message and start a new conversation with that summary.
What you cannot verify, you cannot cite. This is the strictest rule of the module. We will come back to it in every session.

Chat as a category

The brand names are ChatGPT, Claude.ai, Gemini, and Copilot Chat today. They will rotate. The category, a turn-based conversation with a large language model, will persist. The reasoning in this section applies to whichever interface is current when you graduate.

The copy-paste workflow

Chat is the lowest-integration AI tool. You manually bring context into the conversation by pasting it, and you manually take results out by copying them. The friction is the point. You are the bridge between the chat window and your codebase. Nothing moves without you.

That friction has a side effect that turns out to be useful: chat cannot secretly do something to your files, because it has no access to them. Every change you make is one you chose to paste back into your editor. The audit trail is your commit history.

What chat is good at

Chat is strong in four areas.

Drafting prose and pseudocode. A first paragraph, a paragraph that summarizes a section, a skeleton for a methods paragraph. You refine it.
Explaining concepts at your level. Ask for an intuition behind clustered standard errors aimed at a first-year PhD student. The explanation will be a useful starting point, though you still check it against a textbook.
Translating code between languages. R to Python, Python to Stata, Stata to R. The translation is usually almost right and needs careful inspection.
Explaining error messages. Paste the error, ask what it means, get a plain-English diagnosis and suggested fix. Verify by trying the fix.

What chat is bad at

Chat has four hard limitations.

It cannot run code. The R script you just got might look correct and fail immediately. You do not know until you run it in your own environment.
It cannot see your files. Unless you paste them in. The model’s idea of your data is whatever you described in text, and descriptions lie.
It does not remember across sessions. Close the tab, come back tomorrow, and you are starting from zero.
It does not know what it does not know. The model will give you confident code for a package version that does not exist. The tone of certainty is the same whether the content is right or invented.

Mode A vs Mode B

This is the single most important distinction in the module. Write it down.

Mode A: AI as runtime. You ask the AI to do the thing. “Scrape this page and give me a table.” “Summarize this dataset.” “Classify these placements as academic or non-academic.” The output is data. The reasoning lives inside the model, hidden.

Mode B: AI as code author. You ask the AI to write code that does the thing. “Write an R script that scrapes this page and saves a CSV.” “Write an R script that produces summary statistics.” “Write an R script that classifies these placements, with commented keyword rules.” The output is code. The reasoning lives in the script, visible.

Mode A is faster in the moment. It is almost always wrong for research. Mode B is the reproducible path.

The reproducible research artifact is always code

A chat transcript is not a replication package. A session log is not a method. An R script that your co-author or referee can run end-to-end is. When an AI produces results for you, your question is always: what is the code that generated this? If the code does not exist, the result does not belong in a paper.

Chat happens to force Mode B. It cannot run anything, so whatever it produces as code is something you have to paste into RStudio and run yourself. The AI tools in Session 7 and 8 can run code. We will still insist on Mode B. The artifact is still the script, not the session.

Why Mode B matters

Five reasons.

First, reproducibility. Another researcher, including your future self, can rerun an R script. They cannot rerun a chat session.

Second, auditability. A reviewer can read your script and understand what you did. They cannot audit an LLM’s internal reasoning, because there is no such thing to audit.

Third, debuggability. When the result looks wrong, you can read the code line by line. In Mode A, there is nothing to read.

Fourth, toolchain independence. If you switch from Claude to ChatGPT, your R script still works. In Mode A, your result was produced by a specific model at a specific time, and cannot be regenerated.

Fifth, the Git workflow. Code fits into Git commits. Chat logs do not. Every Mode B interaction becomes a diff you can accept, reject, or revert. This is the point where Session 4-5 and Session 6-7-8 connect.

Live demo: chat writes a scraper, chat fails

We now do, on the projector, exactly the task you will do on your own laptop. The task is deliberately small and the failure modes are instructive.

The task

Scrape the Cornell Dyson Applied Economics and Management PhD placement page into a tibble with columns name, year, and placement. This is the simplest version of the exercise that will carry through Sessions 7 and 8.

The page we target is Career Outcomes: MPS, MS, & PhD in Applied Economics at https://dyson.cornell.edu/programs/graduate/placements/. Open it in a browser now. Scroll through. Notice that placements are grouped by year, and each entry contains a name and a destination. The job-market candidates for the current cycle live at a separate page, PhD Job Market Candidates, which we ignore for this exercise.

The prompt

A student prompt, typed live:

“Write an R script that scrapes the Cornell Dyson PhD placement page at https://dyson.cornell.edu/programs/graduate/placements/ and returns a tibble with columns name, year, and placement. Use rvest. Save the result to data/placements_dyson.csv.”

Simple, direct, under-specified. This is representative of how researchers actually prompt when they are trying to get unstuck.

What the model returns

The model returns an R script. Read it on the projector before running it. A typical first-draft response:

Loads rvest and dplyr.
Reads the HTML with read_html().
Extracts nodes with a CSS selector the model guessed from the URL (it did not load the page).
Parses the results into a tibble.
Writes a CSV.

We do not show a fixed version of the script here. The demo is live precisely because the output varies. Two points we always make on the board, regardless of what the model produces:

The model did not load the page. Its selector is a guess from prior web structures, not from the actual HTML.
The model will present the code with confidence. That confidence is not evidence of correctness.

Three likely failure modes

Any of the following will surface; we use whichever one comes up first.

Wrong selector. The CSS selector returns nodes, but not the nodes we want. The row count may match by accident, but the content is off by one category or one section. This is the worst failure because it is silent: the script runs without error and produces a CSV. Only a row-by-row check against the live page catches it.

Hallucinated function argument. The model uses html_nodes() with an argument name that was deprecated in rvest 1.0.0. The call errors out. You read the error, discover the API changed, and either upgrade your mental model or paste the error back to the model. This is the easy kind of failure because R tells you something is wrong.

Fabricated citation in the preamble. The model’s opening prose refers to a paper or blog post that does not exist. If you do not test this, you do not notice. If you do (paste the title into Scholar), you learn not to trust the preamble any more than the code.

The verification reflex

Every time you run AI-generated code, you apply the same five checks. We call this the verification reflex. It will recur in every session of this module.

Does the script run end-to-end without errors? Run it. Do not paper over errors with silent fixes unless you understand them.
Does the row count match the live page? Open the browser, count visible entries in the section you scraped. Compare.
Can you explain every line of the script? Not “I trust the model.” You should be able to say what each line does and why.
Is the .R file committed to your repo? If not, it is not a research artifact.
Is the output data file either committed or explicitly gitignored? No ambiguity.

If any of these fails, do not commit until it passes. Build this habit now. It compounds across your research life.

Hands-on exercise: scrape with chat

Time: ~20 minutes. Work in a new folder inside your course repo.

You are going to do the same kind of task the instructor demonstrated, but on a different department’s placement page. We reserve the Cornell Dyson page for Wednesday’s Cowork exercise so it stays fresh. Spreading the class across different pages also makes the Gallery more interesting: each HTML structure breaks in its own way.

The point is not to produce a perfect scraper. The point is to run the verification reflex and feel, first-hand, where chat is helpful and where it is not.

Setup

Prepare a clean working directory inside your course Git repo. We follow a functional layout: code/ for R scripts, data/ for inputs and processed CSVs, notes/ for short markdown notes. Folder names describe content, not the date or session number.

cd ~/github/my-research
mkdir -p code data notes
git status

Work inside your own research repo from Sessions 4-5 (~/github/my-research if you followed the Session 4 naming; substitute your actual folder name otherwise). On Wednesday we will spin up a separate repo for the agentic tools. git status should show code/, data/, and notes/ as untracked directories (or empty, depending on how Git lists them). You will commit after you have something worth committing.

Open your course repo as an RStudio Project. In the Files pane, create three folders at the repo root: code, data, and notes. Work in code/ for the script and data/ for the CSV.

Open your course repo folder in VS Code. In the Explorer sidebar, create three folders at the repo root: code, data, and notes. Work in code/ and data/.

Pick a department

Claim one department from the list below. Try to spread evenly across the room: if three people near you are already doing Berkeley, pick something else. All seven pages are applied-economics PhD programs with a public placement record.

Department	Page
UC Berkeley ARE	https://are.berkeley.edu/job-candidates/past-placements
UC Davis ARE	https://are.ucdavis.edu/phd/past-placements/
Minnesota Applied Economics	https://apec.umn.edu/graduate/placement-recent-graduates
Wisconsin AAE	https://aae.wisc.edu/grad/placement/
Maryland AREC	https://www.arec.umd.edu/graduate/recent-placements
Michigan State AFRE	https://www.canr.msu.edu/afre/graduate/recent_placements
Purdue Ag Econ	https://ag.purdue.edu/department/agecon/graduate_program/career_placement.html

Before you prompt the model, open the page in your browser. Scroll through. Identify how the placements are structured (a table? a repeating block with a header year and a list of names? a drop-down by year?). You will need to recognize success or failure when the script runs.

The task

Write code/scrape_<dept>_chat.R that produces data/placements_<dept>.csv with columns name, year, and placement. Use a short department tag in the filename: berkeley, davis, minnesota, wisconsin, maryland, msu, purdue.

The rules

These rules are what make this a Mode B exercise rather than a Mode A one.

The final artifact is an R script, not pasted data. Even if your chat session renders a table, the research output is the .R file.
You must be able to explain every line of the script. If you cannot, rewrite that line or ask the model to explain it until you can.
Commit the script before you leave class, even if it does not work. A failed attempt committed with a note in the README is more useful than no record of having tried.
Apply the verification reflex (above). Document what passed and what failed in the commit message or in a short notes/<dept>_chat.md.

Prompts to try

Start with a short prompt. If it fails, iterate. Try all three styles if time permits.

Basic: “Write an R script that scrapes <URL> and saves a CSV with columns name, year, placement.”
With page structure hints: paste in a snippet of the page HTML (view source in your browser and copy 20-30 relevant lines) and ask the model to target that structure.
With explicit Mode B framing: “Write an R script. Do not paste the scraped data into this conversation. The R script is the artifact. Use rvest. Comment every section.”

Note which prompt produced better code. The comparison is the lesson.

Commit your work

When your script works (or when you decide you have spent enough time on it), commit it. Replace <dept> below with your tag.

git add code/scrape_<dept>_chat.R
# Decide what to do with the CSV:
#   commit it (transparent, small)
#   or gitignore it (if it is large or sensitive)
# For this exercise, commit it.
git add data/placements_<dept>.csv
git commit -m "Session 6: chat-drafted scraper for <dept> placements"

Git pane (top-right) → check the boxes next to your scrape_<dept>_chat.R and the matching CSV → click Commit → message “Session 6: chat-drafted scraper for placements” → Commit.

Source Control panel → stage both files with the + icon → type the same message → press Cmd+Enter (Mac) or Ctrl+Enter (Windows).

If your script did not work, still commit it with a message that says so:

git add code/scrape_<dept>_chat.R notes/<dept>_chat.md
git commit -m "Session 6: chat-drafted scraper; script errors, see notes/<dept>_chat.md"

A committed failure is research evidence. A lost failure is not.

Gallery (~5 min in class)

Two or three volunteers project their script and talk through what worked and what failed. Focus on:

Which prompt style did they use?
What did the verification reflex catch?
If they tried two chat tools, how did the output differ?

The point is not to compare tools by their rendering or by subjective feel, but by how each pushed or failed to push the student toward Mode B.

Debrief and bridge to Session 7

What we learned

Chat is a probabilistic pattern-completer with a copy-paste interface. It is useful for drafting, explaining, and translating. It is unreliable for citations, for numerical claims, and for anything that depends on seeing your actual data. It forces Mode B as a byproduct of its limitations, and that is a feature you should preserve in the more powerful tools we look at next.

The verification reflex is not a chat-specific habit. It is the AI-research habit. We will repeat it in Sessions 7 and 8 until it is automatic.

The ceiling of chat

What you hit today was the ceiling of a tool that cannot see your files or run your code. You worked around it by pasting in HTML snippets and copying scripts back and forth. That workflow has a cost in time and a cost in errors (you forgot to paste the new column, the model lost context after the tenth exchange).

The next two sessions lift that ceiling.

Session 7 (Wednesday): Cowork. Same model, same chat interface, but it can see the files in a folder you designate and run code on your machine. We all target the Cornell Dyson page together, which becomes a 10-minute task. New risks appear.
Session 8 (next Monday): Claude Code. A terminal-native agent that is tightly coupled to Git and to multi-file projects. We scale the scraping exercise to five departments and add a classification step.

The three-modes table

Keep this in your head for the next two sessions.

Mode	Example tool	Context	Action	Best at
Chat	Claude.ai, ChatGPT	Manual paste	Text only	First drafts, explanations
Agentic desktop	Cowork	Auto from folder	File + code	Exploratory work with files
Code-native agent	Claude Code	Auto from repo	Files + code + git	Project-level work

Today you used mode 1. Wednesday you use mode 2. Next Monday you use mode 3. Each step trades some friction for some risk, and the discipline that keeps you safe is the same across all three: Mode B, commit before, commit after, verify everything.

For Wednesday (Session 7)

Four items. Item 2 has its own setup block, below the list.

Install Cowork if you do not already have it. Cowork is currently a feature of the Claude desktop app, in research preview. Download the desktop app, sign in, and enable Cowork mode. If you hit install issues we will troubleshoot in the first ten minutes of class.
Create a dedicated repo for the AI module named aem7010-ai. See the setup block below. Takes about five minutes; do it before class.
Bring a laptop with working R, RStudio, and Git. We confirmed the setup in Sessions 4 and 5.
Think about one research task in your own work that you would want an AI to do if it could see your files. Bring it to class. We will discuss these briefly in the opening block.

Setup block for item 2: creating `aem7010-ai`

The Cowork and Claude Code exercises go into a separate repo, aem7010-ai, kept apart from your my-research repo so the agent’s file access stays narrow.

On GitHub. Go to https://github.com/new. Name: aem7010-ai. Public. Leave “Add a README”, “Add .gitignore”, and “Choose a license” unchecked. Click Create repository. Copy the SSH URL from the quick-setup block.

On your laptop.

cd ~/github
mkdir aem7010-ai && cd aem7010-ai
git init -b main
echo "# AEM 7010 AI module" > README.md
printf ".Rhistory\n.RData\n.Ruserdata\n.Rproj.user/\n.DS_Store\n" > .gitignore
git add README.md .gitignore
git commit -m "Initialize AI module repo"
git remote add origin git@github.com:YOURNAME/aem7010-ai.git
git push -u origin main

Replace YOURNAME with your GitHub handle. If the git push fails because SSH is not set up on this machine, use the HTTPS URL from the same quick-setup block on GitHub (https://github.com/YOURNAME/aem7010-ai.git) in place of the SSH URL.

File → New Project → New Directory → New Project. Name aem7010-ai, parent ~/github/, tick Create a git repository. In the Files pane, create README.md and .gitignore with the content shown in the Terminal tab. Stage and commit from the Git pane. Then open Tools → Terminal → New Terminal and run the git remote add and git push -u origin main lines.

Open ~/github/ as a folder. New folder aem7010-ai, then open it as a workspace. In the integrated terminal, run the commands from the Terminal tab.

Refresh the GitHub page. You should see README.md and .gitignore. If you cannot finish this before class, flag it in the first ten minutes Wednesday.

Reading that is useful but optional

If you want a longer treatment of why LLMs hallucinate, Stephen Wolfram’s 2023 essay What Is ChatGPT Doing… and Why Does It Work? is a readable non-technical explanation. For the academic integrity angle, the Nature and Science AI-disclosure policies (both updated 2023-2024) are short and worth reading before any paper submission.

Where we are in the course

A mental model of LLMs

Fact 1: They predict the next token

Fact 2: They generate, they do not retrieve

Fact 3: They are non-deterministic

Fact 4: The context window is finite

Chat as a category

The copy-paste workflow

What chat is good at

What chat is bad at

Mode A vs Mode B

Why Mode B matters

Live demo: chat writes a scraper, chat fails

The task

The prompt

What the model returns

Three likely failure modes

The verification reflex

Hands-on exercise: scrape with chat

Setup

Pick a department

The task

The rules

Prompts to try

Commit your work

Gallery (~5 min in class)

Debrief and bridge to Session 7

What we learned

The ceiling of chat

The three-modes table

For Wednesday (Session 7)

Setup block for item 2: creating aem7010-ai

Setup block for item 2: creating `aem7010-ai`