Session 6: AI Tools I

How LLMs work, and the limits of chat

Prof. Ariel Ortiz-Bobea

2026-04-27

Where We Are in the Course

§ Tutorial: Where we are in the course

Sessions 4 and 5 gave you Git. Sessions 6, 7, and 8 give you AI.

Today (Session 6): chat. Claude.ai, ChatGPT, Gemini. Build the mental model. Feel the ceiling.
Wednesday (Session 7): agentic desktop AI. Cowork. The agent sees your files.
Next Monday (Session 8): code-native agents. Claude Code. Terminal-native, git-aware.

Three tools, three sessions, one framework. The discipline is the same across all three.

These sessions assume Sessions 4 and 5. Every exercise starts with git init and ends with git commit.

A Mental Model of LLMs

§ Tutorial: A mental model of LLMs

You will use these tools for the rest of your research life. You do not need the math. You need four facts.

They predict the next token.
They generate, they do not retrieve.
They are non-deterministic.
The context window is finite.

Each fact has a consequence for research.

Fact 1: They Predict the Next Token

§ Tutorial: Fact 1

Your phone guesses the next word. A large language model does the same thing, at scale. Text in, next token out, loop until stop.

A token is a chunk of text. “the” is one token. “understanding” is usually three.
The model outputs a probability distribution over the next token. Samples one. Appends. Repeats.
The training gives patterns that look like reasoning. At heart: pattern completion.

Research implication. Plausible continuations are useful for prose, common-pattern code, explanations. They are risky for numbers, citations, and statistical claims. Plausible is not the same as correct.

Fact 2: They Generate, They Do Not Retrieve

§ Tutorial: Fact 2

A chat interface is not a database. Asked for a citation, it generates a string that looks like a citation, token by token.

Mata v. Avianca (2023): two NY attorneys filed a brief with six fabricated case citations from ChatGPT. Sanctioned.
Economics has its own version. Ask for a literature summary, paste the citations into Scholar, a nontrivial fraction will not exist.
This is the architecture working as designed. It is not a bug.

Rule of thumb. If an AI gives you a citation, verify it before you use it. Paste the title into Scholar. Check the DOI. Read the paper.

Fact 3: They Are Non-Deterministic

§ Tutorial: Fact 3

The model samples from a probability distribution. Temperature scales the distribution.

Temperature 0: most likely token every time. Still not bit-for-bit reproducible across calls.
Temperature ~1: the default for chat products. Varied but coherent.
Temperature > 1.5: often incoherent.

Where it is exposed:

Chat web apps (Claude.ai, ChatGPT, Gemini): locked near 1 by the product. Not user-adjustable.
APIs and consoles (Anthropic, OpenAI, Google AI Studio): exposed as a parameter.

Two students, same prompt, same tool, same minute → different outputs. We will see this live in class.

A chat transcript is not a replication artifact. You cannot rerun it. Even five minutes later. Even by yourself.

Fact 4: The Context Window Is Finite

§ Tutorial: Fact 4

Everything the model “sees” lives in its context window: your prompts, its replies, any files pasted in.

The window has a hard size limit.
When you fill it, the oldest content falls out silently. No warning.
Long conversations degrade: the model “forgets” the framing you set at the start.

Three implications. (1) Verify everything probabilistic. (2) Summarize long sessions and restart. (3) What you cannot verify, you cannot cite. The strictest rule of the module.

Chat as a Category

§ Tutorial: Chat as a category

Brand names rotate. Category persists: turn-based conversation, copy-paste workflow.

Good at:

Drafting prose and pseudocode
Explaining concepts at your level
Translating code between languages
Explaining error messages

Bad at:

Running code (cannot)
Seeing your files (cannot, unless pasted)
Remembering across sessions (does not)
Knowing what it does not know (tone of certainty is the same either way)

Mode A vs Mode B

§ Tutorial: Mode A vs Mode B

The single most important distinction in the module. Write it down.

Mode A
AI as runtime
“Do the thing.”
Output: data
Reasoning: hidden

Mode B
AI as code author
“Write code that does the thing.”
Output: script
Reasoning: visible

Mode A is faster. Mode A is almost always wrong for research.

The reproducible artifact is always code. A chat transcript is not a method. An R script your co-author can run is.

Live Demo: Chat Writes a Scraper

§ Tutorial: Live demo

We do the Cornell Dyson PhD placements page on the projector. Three failure modes any of which will surface:

Wrong selector. Script runs. CSV is silently wrong.
Hallucinated function argument. Script errors out. Easiest kind of failure.
Fabricated citation in the preamble. Looks plausible. Is not real.

The tool’s confidence is not evidence of correctness.

⟶ Switch to the tutorial for the live demo on the projector (~10 min). You will see one of these three happen in real time.

The Verification Reflex

§ Tutorial: The verification reflex

Five checks, every time. The habit that recurs in every session.

Does the script run end-to-end without errors?
Does the row count match the live page?
Can you explain every line?
Is the .R file committed?
Is the output either committed or explicitly gitignored?

If any fails, do not commit until it passes.

Hands-On: Scrape With Chat (~20 min)

§ Tutorial: Hands-on exercise

You each pick a different department’s placement page. Cornell Dyson is reserved for Wednesday.

Berkeley ARE, Davis ARE, Minnesota APEC, Wisconsin AAE
Maryland AREC, MSU AFRE, Purdue Ag Econ

Three rules:

The artifact is the .R script. Not pasted data.
You can explain every line.
Commit your work before leaving, even if it does not run. A committed failure is research evidence.

⟶ Switch to the tutorial: Hands-on exercise (~20 min). Claim a department. Write the script. Run the verification reflex. Commit.

Gallery and Debrief

§ Tutorial: Gallery · Debrief

Two or three volunteers project their work (~5 min total). Focus:

Which prompt style worked?
What did the verification reflex catch?
Where did chat hit a ceiling?

The ceiling of chat: you pasted HTML in, you pasted scripts out. That friction cost time and cost errors. The next two sessions lift that ceiling. Same discipline, more power.

What’s Next

Wednesday (Session 7): Cowork. Same model, new interface. The agent sees your files and runs code.

Before Wednesday, four things:

Install Cowork (Claude desktop app, research preview).
Create aem7010-ai — a dedicated repo for the AI module. Full setup block is in the tutorial’s “For Wednesday” section. Takes five minutes.
Bring a laptop with R, RStudio, and Git.
Think of one research task you would want an AI to do if it could see your files.

Full walkthrough with copy-paste commands on the companion site: arielortizbobea.github.io/aem7010