Session 4: Git Fundamentals

Tracking your own work

Slides for this session: View the slide deck (opens in your browser; press F for fullscreen). The slides are a lean anchor to the concepts below; the walkthrough on this page is the substantive material.

Want a PDF for note-taking? Open the slides in your browser, append ?print-pdf to the URL, and use File → Print → Save as PDF. Reveal.js handles the layout. Works in Chrome, Edge, and Firefox.

Why Version Control?

Version control is a systematic record of how your code changes. It matters in three situations every applied economist now faces.

Yourself over time

If your project folder has ever looked like this:

analysis_v1.R
analysis_v2.R
analysis_FINAL.R
analysis_FINAL_v2.R
analysis_REALLY_FINAL.R
analysis_FINAL_USE_THIS.R

…then you already know the cost. You made choices you cannot reconstruct. You changed something that worked and cannot remember what.

Git replaces this chaos with a clean history of snapshots. Each snapshot (called a commit) records exactly what changed, when, and why. You can return to any previous version at any time.

This matters especially for managing paper submissions. Every paper moves through several versions of code: the first submission, the R&R revision, the final accepted version, the posted replication package. A reviewer asks, three months after you submitted, “what exactly did you do for the coefficient in Table 2?” Without version control, answering that question means digging through emails and folders. With Git, you mark each submission with a tag (covered in Session 5) and return to it in one command, even years later.

You and AI coding tools

Increasingly you will write research code alongside AI coding assistants: Claude Code, GitHub Copilot, Cursor, Codex, and similar tools. These systems produce dozens of lines in seconds. Most of it is useful; some of it is subtly wrong in ways you will not catch on first read.

Version control is what makes this workflow safe. Every AI-assisted change becomes a commit you can review before accepting, revert cleanly when a test breaks three hours later, or roll back selectively while keeping the parts that worked. It also produces a written record of what came from a model and what you wrote yourself. That record is increasingly expected by journal disclosure policies and by AEA replication review.

You and co-authors

Applied economics papers increasingly have two, three, or four co-authors writing code together. Someone edits the cleaning script on Monday; someone else reworks the regression specification on Tuesday. Without version control, the only way to combine those edits is to email files back and forth. That process silently destroys work.

Git (and GitHub, which we cover next session) makes collaboration explicit. Every change is attributed to a specific author, every conflict between edits is surfaced rather than overwritten, and every co-author sees the same history.

These three motivations connect the whole course

Yourself over time is the reproducibility theme (Sessions 2–3 with Lars). AI tools are Sessions 6–7. Co-authors are Session 5 (GitHub). Version control is the connective infrastructure that makes all three workable.

What Is Git?

Git is a distributed version control system. It takes a snapshot of your entire project every time you commit. Think of it as a timeline:

v1: Load data  →  v2: Clean variables  →  v3: Run regression

Each point on the timeline is a commit. Git stores the full history locally on your computer.

Important

Git ≠ GitHub. Git is the tool that runs on your computer. GitHub is a cloud hosting service where you can store and share your Git repositories online. We cover GitHub in Session 5.

Situations Where Git Helps

Before we set up Git, it helps to see what using it looks like in practice. These scenarios introduce the vocabulary you will meet throughout both sessions, in context rather than abstractly.

Working solo, months later. You wrote clean_data.R in February. It is now August, and a coefficient changed since the results you reported at a seminar in April. What did you change? With Git, every save-point is a commit. The list of commits is your log. The difference between any two versions is a diff. You find the change, understand it, and decide whether it was right.

Refactoring clunky code. Your analysis.R has grown to 600 lines and does everything: data cleaning, regressions, tables, figures. Adding anything new is painful. You want to split it into three cleaner scripts with one job each, but the refactor touches nearly every line of the project and risks introducing bugs that quietly change your results. With Git, you create a branch called restructure-scripts, do the surgery there, and verify that the refactored code produces identical output before committing to it. If it works, you merge the branch back. If it breaks something you cannot diagnose, you delete the branch and your old working code is untouched.

Working with a co-author. You are cleaning data; your co-author is writing the robustness checks. Both of you need to edit the same project simultaneously. Each of you keeps your own clone of the project. You push your changes to a shared remote on GitHub. Your co-author pulls your changes into their copy. If you both edited the same line, Git surfaces a merge conflict so one of you decides the final version. Nobody’s work is silently overwritten.

Returning to a paper revision. You submitted to a journal in March. The R&R comes back in August. You need to run new analyses but also reproduce the figures from the submitted version if a referee asks. With Git, you tag the submitted code with a name like v1.0-first-submission. A single git checkout brings it back to your screen, years later if needed.

Recovering from a mistake. You edit run_regression.R for three hours, save, close, and then realize you deleted something you needed. If the earlier version was committed, Git lets you restore the file to that commit. If you committed the deletion too, you undo the commit with reset. Git is forgiving, but only for states it knows about, which is why committing often matters.

Working with AI coding tools. You ask Claude Code or Copilot to refactor a function. It returns 40 lines of new code: some better than yours, some subtly wrong. With Git, you review the AI’s suggestion as a diff, stage only the lines that are good, and revert the commit cleanly if something breaks hours later. Without Git, AI-assisted edits are a gamble.

Each of the bolded terms is covered in detail below. For now, the point is that Git is not one workflow: it is a toolkit for different research situations.

Setup

You should already have Git installed from the pre-class setup on the Preliminaries page. We now need to verify the installation and tell Git who you are, so it can attach your name to every change you make. We go step by step.

Step 1: Open your terminal

Every command in this session is typed into a terminal: a program that lets you send instructions to your computer as text.

On Mac: open the Terminal app. Find it by pressing Cmd+Space to open Spotlight, typing “Terminal”, and pressing Enter.
On Windows: open Git Bash (installed together with Git). Find it by pressing the Windows key, typing “Git Bash”, and pressing Enter.

A window opens with a few lines of text ending in a prompt that looks like yourname@yourlaptop ~ $ on Mac or yourname@yourlaptop MINGW64 ~ $ on Windows. The $ is where your typing goes. You write a command after it and press Enter to run it.

Keep this window open for the rest of the session.

Step 2: Confirm Git is installed

At the prompt, type the following and press Enter:

git --version

You should see output like:

git version 2.39.0

The exact version number will differ. As long as the output begins with git version, Git is installed and you can move on. If instead you see command not found or a similar error, Git is not installed on this machine. Go back to the Preliminaries page and follow the install instructions before continuing.

Step 3: Check whether your identity is already set

Git attaches a name and email to every change you make. Before setting these values, check whether they are already set from earlier coursework or another project.

At the prompt, type the following, pressing Enter after each line:

git config --global user.name
git config --global user.email

If both commands print a name and an email (one per command), Git already knows who you are. You can skip to The Three Areas of Git below.

If either command prints nothing (a blank line, or an empty prompt), continue to Step 4.

Step 4: Set your identity

Both commands use the --global flag, which means the setting is stored once per laptop and applied automatically to every Git project on this machine. You do this once; you do not repeat it for each project.

Replace "Your Name" and "you@email.com" with your actual name and email, then run each command:

git config --global user.name "Your Name"
git config --global user.email "you@email.com"

You will not see any output after either command. Silence means success. Git now knows who you are.

Privacy: your email becomes public on every commit

If you plan to push your work to public GitHub repositories, the email you set above will be visible in the commit log to anyone who views the repo. If you prefer to keep your real email private, GitHub provides a stable noreply address you can use instead. We cover the setup in Session 5.

Step 5: Verify the settings

Run the two check commands from Step 3 again:

git config --global user.name
git config --global user.email

Both should now return the name and email you just set. Setup is complete.

Want to see everything Git has configured?

git config --global --list

This prints your full global configuration (name, email, default editor, default branch name, and any other settings). Useful for diagnosing “why is Git acting strange” later.

The Three Areas of Git

Every Git project has three areas. Understanding them is the key mental model:

Area	What it is	How you interact
Working Directory	The files you see and edit	You edit files normally
Staging Area	A holding zone for the next commit	`git add` moves files here
Repository	The permanent history of commits	`git commit` saves a snapshot

The workflow is: edit → stage → commit.

A mental model: the shopping cart

Think of Git’s workflow as online shopping.

Working directory is browsing the store and dropping items into your cart. You can add, remove, change your mind freely.
Staging area is the checkout page where you review your cart before ordering. You can still remove things or add more.
Repository is your order history. Every order is timestamped and permanent.

The command mapping is straightforward:

git add moves a file to the checkout page.
git restore --staged removes a file from the checkout page (we cover this later).
git commit presses Place Order. You get a confirmation number (the commit hash), and the order joins your permanent history.
git log is your order history page.

The staging step exists for the same reason stores have a checkout review: before you make anything permanent, you want to see exactly what you are about to commit to.

Your First Repository

Keep Git repositories out of cloud-sync folders

Before creating your first repository, make sure you are not inside Dropbox, Box, iCloud Drive, OneDrive, or Google Drive. These services can race against Git’s internal file writes and silently corrupt your repository. GitHub itself is the cloud backup for Git; layering another cloud sync on top creates conflicts.

A common convention is to keep all Git projects in a single dedicated folder in your home directory, for example ~/github/. Each project is a subfolder of that.

If you do not have this folder yet, create it now:

mkdir ~/github
cd ~/github

From now on, all mkdir some-new-project commands in this tutorial assume you are inside ~/github/.

Step 1: Create a project folder

At the prompt (still inside ~/github/ from the callout above), type the following two commands, pressing Enter after each:

mkdir my-research
cd my-research

mkdir my-research creates a new empty folder called my-research.
cd my-research changes your current location to inside that folder.

Both commands produce no output. Your prompt should now end with my-research, something like yourname@yourlaptop my-research $. That ending signals you are now inside the project folder.

Step 2: Initialize Git

At the prompt, type:

git init

You should see output similar to:

Initialized empty Git repository in /Users/yourname/github/my-research/.git/

This command created a hidden .git/ folder inside my-research. That folder is where Git stores all version history, configurations, and internal bookkeeping. You never open it or touch its contents directly.

Caution

Do not delete the .git/ folder. Deleting it erases all the history of your project. The folder is hidden by default, so you are unlikely to find it by accident, but if you ever see it in a file browser with “show hidden files” turned on, leave it alone.

Step 3: Check the status

At the prompt, type:

git status

You should see output that looks roughly like this:

On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Three things to notice:

You are on branch main. Every Git repository has at least one branch, and the default branch is called main.
No commits yet. Git knows the folder is now a repository, but you have not saved any snapshots of work yet.
Nothing to commit. The folder contains no files yet, so Git has nothing to track.

This is your starting point. In the next section, you will create a file and take the first snapshot.

Staging and Committing

Create a file

Do not use TextEdit or Notepad for code files

macOS TextEdit defaults to rich text (RTF) and silently adds .txt to your filenames, so clean_data.R becomes clean_data.R.txt with invisible formatting markup inside. Windows Notepad has similar encoding and line-ending issues that break code files.

Use RStudio (File → New File → R Script) or VS Code (File → New File) instead. Both are installed as part of the Preliminaries setup and handle plain-text code files correctly. For Terminal users, running code . inside your project folder opens the whole folder in VS Code in one step.

We create the file in two steps: first open the project in your editor, then create the file. Opening the project first is what lets your editor see the Git state live. Pick whichever editor you want to use:

Open the folder in VS Code. Launch VS Code (from Applications, Spotlight, or the Start menu). Use File → Open Folder, browse to my-research, and click Open.

VS Code opens a new window showing the contents of my-research in the left file explorer. The Source Control panel (Source Control icon (three-node fork shape) in the left activity bar) now displays the Git state for this repository.

Optional shortcut: opening a folder from the terminal with code .

Once you work with Git often, clicking through File → Open Folder each time becomes tedious. VS Code provides a shortcut: from your terminal, inside any folder, type code . and press Enter. The . means “this folder”, and VS Code opens directly into it.

Why this is useful now. You are already navigating to folders in your terminal to run Git commands. Being able to jump into the editor from that same spot, without switching to the File menu, keeps your workflow in one place. It also guarantees you open the exact folder you are currently in, so you cannot accidentally open the wrong one.

Why this pays off later. Once you eventually SSH into Cornell’s CAC cluster, FSRDC secure computing environments, or a cloud VM to run analyses too large for your laptop, VS Code’s Remote SSH extension lets you type code . inside the remote terminal and have your local VS Code window show and edit files that actually live on the remote machine. The editor feels local; the files are remote. This workflow only works if the code command is on your PATH. Setting it up now means it just works the first time you need it, possibly years from now.

One-time setup:

Open VS Code.
Press Cmd+Shift+P (Mac) or Ctrl+Shift+P (Windows/Linux) to open the Command Palette.
Start typing Shell Command. Select Shell Command: Install ‘code’ command in PATH.
Close and reopen your terminal window (the old one caches the PATH).

Now, from any terminal, code . opens the current folder in VS Code. If you skip this setup, File → Open Folder still works fine.

Create the file. In VS Code, use File → New File (or Cmd+N / Ctrl+N). Paste the content below. Use File → Save As and name the file clean_data.R, making sure it saves inside the my-research folder.

Open the folder as an RStudio Project. RStudio needs a Project (a .Rproj file) for Git integration to appear. Go to File → New Project → Existing Directory, browse to your my-research folder, and click Create Project. RStudio restarts in project mode. The Git pane appears in the top-right, connected to your local repository.

Create the file. Use File → New File → R Script. Paste the content below. Save with File → Save, naming the file clean_data.R. RStudio saves it inside the project folder by default.

If you are not using an editor, open the file in any plain-text editor you have (not TextEdit or Notepad; see the warning above). Paste the content below and save as clean_data.R inside my-research.

The content for the file:

The script uses the wooldridge R package, which bundles Jeffrey Wooldridge’s teaching datasets. Install it once (in R or RStudio’s console) before running the script:

install.packages("wooldridge")

# clean_data.R
# Load and clean the wage dataset
# Uses wooldridge::wage1 (Jeffrey Wooldridge's teaching dataset; 526 obs)

wages <- wooldridge::wage1

# Remove observations with missing wages (none in wage1, but real data always needs this)
wages <- wages[!is.na(wages$wage), ]

# Log transformation
wages$log_wage <- log(wages$wage)

After saving, verify back in your Terminal (still inside my-research):

ls

You should see clean_data.R listed. Now run:

git status

You should see something like:

On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    clean_data.R

nothing added to commit but untracked files present (use "git add" to track)

clean_data.R appears under Untracked files. Git has noticed the file exists but is not yet tracking it. The next step stages it.

If you have my-research open in VS Code, clean_data.R appears in the file tree marked with a green U (untracked). In RStudio, the Git pane lists it with a yellow ? icon. You do not need to do anything in the editor; it already reflects what git status just told you.

Stage the file

Three interfaces, same Git

From here on, command blocks appear in tabs: Terminal, RStudio, and VS Code. All three do exactly the same thing. In class we walk through the Terminal tab on the projector.

Why Terminal for teaching? Four reasons worth naming up front.

The mental model is cleaner. edit → stage → commit is three explicit commands. In a GUI, staging is often just a checkbox, and students never really register what the operation is until something breaks. A rigorous researcher needs to understand the operation, not just the button.
Terminal Git works everywhere. GUIs do not. You will eventually use Cornell’s CAC cluster, FSRDC secure computing environments, AWS or Google Cloud VMs, or any other remote machine. None of those have a GUI. A researcher who only knows the RStudio Git pane is stuck the first time they work with a dataset that does not fit on their laptop.
Documentation, error messages, and AI tools all speak command-line. Every answer on Stack Overflow, every chapter of the Pro Git book and Happy Git with R, every response from Claude Code, GitHub Copilot, and Cursor will tell you to run a Git command. If the only Git you know is “I clicked the button in RStudio”, you cannot use any of those resources to get unstuck.
Git is scriptable. Researchers use system("git log --pretty=format:%h") inside R to embed a commit hash in an output file for reproducibility. They run git archive inside a Makefile to package data. They call git tag automatically from a submission script at paper-submission time. None of this is available through a GUI.

The bottom line. We teach commands in class so you have the vocabulary, the portability, and the safety net when things go wrong. Once you understand what each command does, feel free to use RStudio or VS Code for daily work. The tabs on this page show the equivalent actions in each interface.

git add clean_data.R

Click Commit in the Git pane (top-right of the IDE). The Review Changes dialog opens. Select clean_data.R in the top pane. The bottom pane shows your diff. Click the Stage button, or tick the checkbox next to the file. It moves from the Unstaged list to the Staged list.

Quick shortcut. If you already know what you are staging and do not need to see the diff, tick the Staged checkbox directly in the Git pane without opening the dialog.

No Git pane? The project must be an RStudio Project (.Rproj) inside a Git repository. Tools → Project Options → Git/SVN → Version control system: Git, then restart RStudio.

Open the Source Control panel (Source Control icon (three-node fork shape) in the left activity bar, or ⌃⇧G / Ctrl+Shift+G). Under Changes, click clean_data.R. A diff view opens in the editor showing your changes. In the Source Control panel, hover over clean_data.R and click the + icon to stage. The file moves from Changes to Staged Changes.

Quick shortcut. If you already know what you are staging, click the + directly without previewing the diff.

All three do the same thing: they tell Git “include this file in the next commit.”

Check the status again

At the prompt, run:

git status

You should see output like:

On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   clean_data.R

The key line is new file: clean_data.R under “Changes to be committed”. This confirms the file is staged, that is, ready to be included in the next commit.

Your editor shows the same state in real time

If you have VS Code or RStudio open on this folder right now, they are already displaying everything git status just told you. You do not need to run the command to know the state of your repo; your editor is watching.

VS Code. The Source Control panel (branch icon, left activity bar) lists all modified and staged files, updating live. The file tree shows letters next to each file: U for untracked, M for modified, A for added/staged. Open a file and the gutter next to the line numbers shows green/blue/red bars for added, modified, and deleted lines compared to the last commit.

RStudio. The Git pane (top-right) shows the same information with status icons. It refreshes automatically every time you save a file or run a Git command in the Terminal.

This is one of the underrated benefits of keeping an editor open on your project: the Git state is ambient, not something you have to query. You can run Terminal commands in class and watch the editor reflect each change instantly.

Commit

At the prompt, run:

git commit -m "Add data cleaning script"

You should see output like:

[main (root-commit) a1b2c3d] Add data cleaning script
 1 file changed, 10 insertions(+)
 create mode 100644 clean_data.R

A few details to notice:

The first line shows the branch (main), a note that this is the root commit (the very first commit in the repository), and a short commit hash (a1b2c3d in the example; yours will differ).
The following lines summarize what changed: one file added with ten new lines.
The -m flag lets you write the commit message inline without opening an editor. Every commit needs a message. The message explains why the change happened.

If you forget -m, Git opens its default text editor (usually vi, sometimes nano) so you can type a multi-line message. If you are unfamiliar with these editors, the terminal can feel stuck. Escape routes:

In vi / vim: press Esc, then type :wq and press Enter to save and quit. Or :q! Enter to quit without saving.
In nano: press Ctrl+X, then Y, then Enter.

Next time, include -m "your message" to avoid the editor entirely.

Run git status again:

git status

You should now see:

On branch main
nothing to commit, working tree clean

“Working tree clean” means every change in your project folder has been committed. This is the state you want to leave your project in at the end of each work session.

In VS Code, clean_data.R disappears from the Source Control panel. The Staged and Changes lists are both empty. In RStudio, the Git pane empties out. Empty panels mean there is nothing pending, which matches “working tree clean” in the Terminal.

Make another change and commit

The workflow repeats for every change. Let us practice once.

Open clean_data.R in your editor (RStudio or VS Code). Add this line at the end of the file:

# Keep only workers with at least a high school education
wages <- wages[wages$educ >= 12, ]

Save the file.

Back at the Terminal prompt, stage the change and commit it:

git add clean_data.R
git commit -m "Filter to workers with high school education or more"

You should see a commit message similar to before, this time reporting one file changed and two lines added. Run git status to confirm the working tree is clean again.

You now have two commits in the project history. You can browse them visually in your editor.

RStudio. In the Git pane (top-right), click the History button (clock icon). A dialog opens listing every commit with its author, date, and message. Click any commit to see its diff in the lower pane. Both of your commits are there, newest on top.

VS Code. Click on clean_data.R in the file explorer to open it. In the explorer sidebar, expand the Timeline section at the bottom. It lists every version of this file in the commit history, with your commit messages as labels. Click any entry to open the diff between that version and the next. For a full repo-wide commit graph, install the free Git Graph or GitLens extension from the Extensions marketplace.

Writing good commits

A commit has two parts: what changed (the files you staged) and why (the message). Both matter.

Be specific in the message. “Fix bug” is unhelpful. “Fix off-by-one error in sample selection” is useful. Your future self, three months from now, will thank you. Your co-authors will thank you even more.

One logical change per commit. If you need the word “and” to describe what a commit does, it is probably two commits. Good messages read like a research changelog:

Add control for state fixed effects
Switch to winsorized outcome at 1%
Fix sample filter for pre-2000 observations

Mixing unrelated changes in one commit makes it impossible to revert just one of them later.

Back to the shopping cart. Each commit is an order. If an order contains one coherent purchase (all the groceries for the week), returning it makes sense. If it contains a book, a stapler, and a bag of flour lumped together, returning just one item is painful. Before committing, ask yourself: if I had to undo just this commit six months from now, would that make sense as a unit?

Different styles exist. Researchers differ in how granular their commits are:

Atomic commits. Every small change is a commit. Very detailed history. Easy to review and revert. Standard in open source software. Recommended by Jenny Bryan in Happy Git with R, the canonical Git guide for R users.
Feature commits. One commit per completed task, for example “baseline regression table done”. Bigger commits but still coherent. More common in solo research.
End-of-session commits. One commit per day, bundling everything. Tempting but loses the ability to audit what you did and where.

For PhD research, aim between atomic and feature. Commit after each logical step in your workflow: data pulled, sample selected, variables built, model estimated, figure produced. Roughly one commit per line you would write in a research diary.

When to commit. Whenever the code runs and produces the intended result after your change. Also before starting something risky (a refactor, an alternative specification), and before you stop working for the day.

Avoid these messages. "Fixed stuff", "WIP", "Lots of changes", ".". They mean you skipped thinking about what you changed. Either split the commit into pieces with meaningful messages, or think for another ten seconds about what to write.

Exercise 1: Add a Second Script and Commit It

Time: ~5 minutes. Work in your existing my-research project.

So far you have been typing along with the walkthrough. In this exercise you apply the same loop independently with a new file. Pick the interface you want to use for the rest of the course and practice the full workflow in it: create a file, stage it, commit it, check the log.

The task

Add a second R script to my-research called run_regression.R with the content below. Stage it, commit it with a meaningful message, then view the resulting history.

# run_regression.R
# Estimate a Mincer wage equation on wooldridge::wage1

wages <- wooldridge::wage1

# Basic OLS
model1 <- lm(log(wage) ~ educ + exper + tenure, data = wages)
summary(model1)

How to do it in each interface

Open run_regression.R in your editor (RStudio or VS Code) and save it inside my-research with the content above.
Back at the prompt, confirm Git sees it:

git status

You should see run_regression.R under Untracked files.

Stage, commit, and check the log:

git add run_regression.R
git commit -m "Add baseline Mincer wage regression"
git log --oneline

You should see your new commit on top, followed by the commits from the walkthrough. Four commits total (or more, depending on how many you accumulated).

With my-research open as an RStudio Project, use File → New File → R Script. Paste the content. File → Save and name it run_regression.R.
The Git pane (top-right) now lists run_regression.R as untracked (yellow ?).
Click the Commit button in the Git pane. The Review Changes dialog opens.
Tick the Staged checkbox next to run_regression.R (or click the Stage button).
In the message box at the top, type Add baseline Mincer wage regression.
Click Commit. A dialog confirms the commit was made; close it.
Click the History button (clock icon) in the Git pane. You should see your new commit on top, followed by the walkthrough commits.

With my-research open as a VS Code folder, use File → New File. Paste the content. Save with File → Save As and name it run_regression.R inside the my-research folder.
In the Source Control panel (branch icon, left activity bar), run_regression.R appears under Changes.
Hover over run_regression.R and click the + icon to stage it. It moves to Staged Changes.
In the message box at the top of the Source Control panel, type Add baseline Mincer wage regression.
Press Cmd+Enter (or click the checkmark icon) to commit.
Open run_regression.R and expand the Timeline section in the file explorer. Your new commit appears at the top.

Expected result

However you did it, at the end your repository has one additional commit. Running git log --oneline in the Terminal should produce something like:

a9b8c7d Add baseline Mincer wage regression
e4d5c6b Filter to workers with high school education or more
2c5e8a1 Add data cleaning script

Newest on top. Three commits, one per logical change, each with a meaningful message. This is the habit we want.

Viewing History

You now have three commits in your project. Git offers two commands to explore that history: git log lists commits, and git diff shows changes between them.

git log: the commit list

At the prompt, run:

git log

You should see output like:

commit 7f3a2b1c9e5d4f6a8b2c1d3e4f5a6b7c8d9e0f1a (HEAD -> main)
Author: Your Name <you@email.com>
Date:   Fri Apr 18 15:30:22 2026 -0400

    Filter to workers with high school education or more

commit 2c5e8a1b3d9f4e7a6b2c1d3e4f5a6b7c8d9e0f1a
Author: Your Name <you@email.com>
Date:   Fri Apr 18 15:25:10 2026 -0400

    Add data cleaning script

Each commit entry shows four fields: the commit hash (a 40-character unique ID), the author, the date, and the message. Your hashes will differ; they are generated from the content and metadata of each commit.

What is HEAD?

In the output above, the newest commit is annotated (HEAD -> main). HEAD is Git’s pointer to your current commit. Think of it as a bookmark: it marks where you are in the repository’s history.

When you commit, HEAD moves forward to include the new commit.
When you undo a commit (git reset --soft HEAD~1, covered below), HEAD moves backward.
HEAD~1 means “one commit before HEAD”; HEAD~2 means two before, and so on.

HEAD -> main means “HEAD points to the tip of the branch called main.” In the undoing-mistakes diagrams below, ← HEAD marks the commit you are currently on.

The output may open in a pager

If your history fills more than one screen, Git opens the output in a pager program called less. The screen fills and the prompt disappears. Navigate with the arrow keys, Page Down, or Space. To exit and return to your prompt, press q. This is one of the most common “I am stuck in Git” moments for new users.

For a compact one-line-per-commit view:

git log --oneline

Output:

7f3a2b1 Filter to workers with high school education or more
2c5e8a1 Add data cleaning script

The seven-character hashes are short prefixes of the full ID. They are unique enough to refer to a specific commit in a small repository and much easier to read.

git diff: what changed

git diff shows what you have changed in the working directory since the last commit. Run it now:

git diff

You should see no output, and the prompt returns right away. Empty output means the working tree is clean: nothing has been modified since the last commit, so there is nothing to diff.

To see git diff in action, open clean_data.R in your editor and add one line at the end of the file:

cat("Rows after cleaning:", nrow(wages), "\n")

Save the file. Back in the Terminal, run git diff again:

git diff

You should now see output like:

diff --git a/clean_data.R b/clean_data.R
index 1a2b3c4..5d6e7f8 100644
--- a/clean_data.R
+++ b/clean_data.R
@@ -9,3 +9,5 @@ wages <- wages[!is.na(wages$wage), ]

 # Log transformation
 wages$log_wage <- log(wages$wage)
+
+cat("Rows after cleaning:", nrow(wages), "\n")

Lines starting with + are added. Lines starting with - are removed (none in this example). The @@ line identifies where in the file the change begins. Git calls this a unified diff; it is the same format used by Stack Overflow, email patches, and GitHub pull requests.

git diff vs. git diff –staged

Git separates the two halves of your change: what is in the working directory (edits you have saved on disk) versus what is in the staging area (edits you have marked with git add). Each half has its own diff command.

git diff shows the changes in the working directory that are not yet staged.
git diff --staged shows the changes that are staged and will go into the next commit.

Run the staged-diff command now, before staging anything:

git diff --staged

You will see no output. That is expected: nothing has been staged yet. The difference between the last commit and what is currently staged is zero.

Now stage the line you added:

git add clean_data.R

Re-run both diff commands:

git diff

Empty. Your edit is no longer in the working directory relative to the staging area; it has moved to the staging area.

git diff --staged

Now you see the + line that is waiting to be committed. The exact same diff you saw a moment ago, but it moved from the “unstaged” view to the “staged” view as soon as you ran git add.

This is the practical value of git diff --staged: it is the command you run right before git commit to see exactly what will go into the next commit. Final review.

Commit it

Since we like this edit, commit it:

git commit -m "Log number of rows after cleaning"

You should now have three commits. Verify:

git log --oneline

Three entries, newest on top.

The same history and diffs are available in your editor, in visual form.

RStudio. The History button in the Git pane opens the same commit list, with click-to-diff on each entry. The Review Changes dialog shows diffs side-by-side and color-coded, which is easier to read than the Terminal output for anything longer than a few lines.

VS Code. The Timeline section in the file explorer shows per-file version history. Click any entry to see the diff. Any open file also displays a +/- gutter on the left, marking modified, added, or deleted lines compared to the last commit in real time as you type.

.gitignore

Not every file belongs in version control. Large data files, sensitive credentials, and system files should be excluded. Git uses a special file called .gitignore to know which patterns to skip.

Create the file

The file must be named exactly .gitignore, starting with a dot and with no extension after it. It lives in the top level of your project folder (the same folder as clean_data.R).

From the prompt, inside my-research, create an empty .gitignore file with touch:

touch .gitignore

touch creates an empty file with the given name. You will see no output; silence means success. Confirm with:

ls -a

The -a flag shows hidden files (those starting with a dot). You should see .gitignore in the listing. Now open it in your editor to add content:

code .gitignore

If you did not set up the code command, open it manually through RStudio or VS Code’s file explorer.

In RStudio (with my-research open as a Project), use File → New File → Text File. A blank untyped document opens in the editor. Use File → Save As and enter the filename as .gitignore (starting with a dot, nothing else). Save inside my-research. RStudio may warn about files whose names begin with a dot; accept the warning and save.

In the Explorer sidebar (with my-research open as the workspace), hover over the top row showing the folder name MY-RESEARCH. A row of small icons appears to the right. Click the New File icon (a page with a +). Type .gitignore as the name and press Enter. The file is created and opens for editing.

Hidden files starting with a dot

On macOS, files whose names begin with a dot are hidden in Finder by default. Do not be alarmed if .gitignore does not appear when you browse the project folder in Finder. The Terminal (ls -a) and your code editors both show it normally.

Add the patterns

Before pasting patterns, three bits of .gitignore syntax:

* is a wildcard that matches any characters in a filename, so *.csv means “any file ending in .csv.”
A trailing / marks a directory; data/ ignores the folder data and everything inside it.
A pattern without a / matches at any depth in the project. A pattern beginning with / anchors it to the project root.

In the .gitignore file you just opened, paste the following content and save:

# Data files (too large or sensitive for Git)
*.csv
*.dta
data/

# R artifacts
.Rhistory
.RData
.Rproj.user/

# System files
.DS_Store
Thumbs.db

Each line is a pattern:

*.csv matches any .csv file anywhere in the project.
data/ matches any folder called data, including all its contents.
.DS_Store matches exactly that filename (macOS folder metadata).
Lines starting with # are comments, ignored by Git.

Commit the .gitignore file

Back in the Terminal (still in my-research), stage and commit:

git add .gitignore
git commit -m "Add .gitignore for data and R artifacts"

.gitignore is itself a tracked file in your repository. Unlike the files it excludes, it belongs in Git: your collaborators need to see the same ignore patterns you do.

Warning

Data files do not belong in Git. Git is designed for code (small text files), not for large datasets. If you commit a 500 MB CSV, every collaborator will have to download it with the full repository. Use a shared drive, Dropbox, or a data repository for large files.

Linking to data that lives elsewhere

You have told Git to ignore your data folder. But your code still needs to read the data. In practice most applied economists split their project files across two places:

Code in ~/github/my-research/ (this Git project, not on Dropbox).
Data in a cloud-synced folder like ~/Dropbox/projects/my-research/data/ (backed up, shared across machines, not version-controlled with Git).

The cleanest bridge between the two is a symbolic link (or symlink): a tiny pointer file that looks like a folder but actually redirects to another location. You put a symlink called data inside your project that points to your Dropbox data folder. Your code reads from data/wages.csv, and the operating system transparently serves it from Dropbox.

Think of it as a signpost: data lives here (in the project folder) but actually points there (to Dropbox). R, Python, Stata, and almost every other tool follow symlinks automatically without knowing anything is unusual.

Set up the symlink (optional, can do this later)

The exact command depends on your operating system. Run this once per project.

First, navigate to your project folder from the Terminal. This is essential: ln -s creates the symlink in whatever folder you are currently in.

cd ~/github/my-research

Then create the symlink:

ln -s ~/Dropbox/projects/my-research/data data

This creates a file called data inside my-research that points to ~/Dropbox/projects/my-research/data. Verify:

ls data/

You should see the contents of the Dropbox folder listed, as if they were inside my-research.

On Windows, the equivalent is a junction, which does not require administrator rights (unlike true symlinks on Windows). Open Command Prompt (not Git Bash) and first navigate to your project folder. Junctions are created in the current directory, so this step matters.

cd %USERPROFILE%\github\my-research

Then create the junction:

mklink /J data "%USERPROFILE%\Dropbox\projects\my-research\data"

mklink /J creates a directory junction called data inside my-research, pointing to the Dropbox folder. It behaves identically to a symlink for reading files.

Both operating systems offer menu-based alternatives: Make Alias on Mac, Create Shortcut on Windows. Do not use these for this purpose. Aliases and shortcuts are recognized only by the graphical file manager; programs like R do not follow them. Use the Terminal commands above instead.

A few notes.

First, add data to your .gitignore if it is not there already. The symlink itself is a small file, but you do not want Git to track a pointer to a Dropbox path that will differ on every collaborator’s machine. The existing data/ line in the .gitignore above already covers this.

Second, each collaborator creates their own symlink with their own Dropbox path. The symlink is not shared through the repo; that is by design. You and your co-author may both have a data/ folder in your projects, but pointing to different absolute locations on your laptops.

Third, symlinks make your project portable only for the person who created them. If a stranger clones the repo, they need both the code and the data, and a way to connect them. For published replication packages, you include the data directly in the archive or link to it by a citation, not by a symlink.

Undoing Mistakes

You will make mistakes. Git has tools to recover from the common ones. Four operations cover almost every situation you will hit in the first year.

Each operation below opens with a small before/after diagram using A → B → C ← HEAD notation. Letters are commits in chronological order (A is oldest, C is newest), and each arrow points from an earlier commit to the next one. HEAD is Git’s pointer to your current commit; see the HEAD callout in Viewing History for a full explanation if you skipped it.

Operation	When to use it	Command
Unstage a file	You ran `git add` by accident; want to pull back from staging	`git restore --staged <file>`
Discard local changes	You edited a file and want to throw the edit away before committing	`git restore <file>`
Undo the last commit	You committed too early; want to un-commit but keep the work staged	`git reset --soft HEAD~1`
Return to an earlier version	You want the version of a specific file from a previous commit	`git checkout <hash> -- <file>`

The first three operate on recent state (staging area, working directory, last commit). The fourth reaches into history. Below we walk through each with Terminal, RStudio, and VS Code versions.

Unstage a file

Scenario. You ran git add on a file you did not mean to include in the next commit. The file itself is fine; you just want to pull it out of the staging area so it does not get committed.

Before:  A → B → C ← HEAD
After:   A → B → C ← HEAD       (history unchanged)
         + file moves OUT of staging, back to working directory
         + your edits are kept

git restore --staged clean_data.R

No output. Run git status to confirm: the file is now listed as “modified” or “untracked” rather than under “Changes to be committed”.

In the Git pane (top-right), find the file in the list and uncheck the box under the Staged column. The file moves from Staged back to Modified.

Alternatively, from the Review Changes dialog (click Commit to open), select the file in the top pane and click the Unstage button.

In the Source Control panel, find the file under Staged Changes. Hover over the row; a - icon (minus) appears to the right. Click it. The file moves back down to Changes.

Alternatively, right-click the staged file and select Unstage Changes.

Discard local changes

Scenario. You edited a file, decided the change is wrong, and want to revert to the last committed version.

Before:  A → B → C ← HEAD
After:   A → B → C ← HEAD       (history unchanged)
         + working directory reverts to match C
         + your uncommitted edits are gone (destructive)

Caution

This operation is destructive. Uncommitted edits cannot be recovered. If you might want the edit later, commit it first (even with a placeholder message like “WIP: exploring alternative spec”). Once committed, you can always roll back to a previous version.

git restore clean_data.R

Git overwrites the working-directory version of the file with the last committed version. Your edits are gone.

In the Git pane, select the file. Click More → Discard in the pane’s toolbar, or right-click the file and choose Revert. RStudio asks you to confirm. Click Yes. The file returns to its last committed state.

In the Source Control panel, find the file under Changes. Hover over it; a ↺ icon (discard) appears. Click it. Confirm in the dialog. The edits are discarded.

Alternatively, right-click the file and select Discard Changes.

Undo the last commit (keep the changes)

Scenario. You committed too early. You want to pull the commit back so you can edit more, then recommit with a better message or a different scope. The file changes should stay in your staging area, ready to commit again.

Before:  A → B → C ← HEAD
After:   A → B ← HEAD           (HEAD moves back one step)
         + changes from C return to staging, ready to re-commit

Run:

git reset --soft HEAD~1

HEAD~1 means “one commit before the current commit”. The --soft flag tells Git: undo the commit, but leave all the file changes exactly as they were before the commit, staged and ready.

Run git status after. You should see the changes from the undone commit listed under “Changes to be committed”. No file content is lost.

RStudio’s Git pane does not expose soft reset as a button, but RStudio includes a built-in Terminal tab where you can run the command directly.

Open the Terminal tab. Tools → Terminal → New Terminal, or keyboard shortcut Shift+Alt+T (Windows) / Shift+Option+T (Mac).
The terminal opens inside your project folder. Run:

git reset --soft HEAD~1

The Git pane refreshes automatically and shows the files from the undone commit back under the Staged section.

VS Code’s Source Control panel does not expose soft reset as a button, but VS Code has a built-in integrated terminal where you can run the command.

Open the integrated terminal. View → Terminal, or keyboard shortcut Ctrl+` (backtick).
The terminal opens inside your project folder. Run:

git reset --soft HEAD~1

The Source Control panel refreshes automatically and shows the files from the undone commit back under the Staged Changes section.

Alternative with GitLens extension. If you install the popular free GitLens extension, it adds an Undo Commit command to the commit context menu. Right-click any commit in the Source Control graph and choose Undo. Same result, no terminal needed.

Note

Across all three paths, the underlying command is identical: git reset --soft HEAD~1. The only difference is where you type it.

Return to an earlier version of a file

Scenario. Three months ago, clean_data.R had a variable construction you now want back. Since then you have replaced it with something else and committed the replacement. You want the old version of that one file back, without rewinding the rest of the project.

This operation reaches into history and pulls out a past version of a specific file. Your other files are untouched.

Before:  A → B → C ← HEAD
After:   A → B → C → D ← HEAD   (history grows forward)
         + D is a NEW commit containing the restored file
         + both old and new versions stay reachable in log

First, find the commit hash where the old version of the file still existed:

git log --oneline clean_data.R

You should see something like:

a9b8c7d Switch to winsorized outcome
7f3a2b1 Add quadratic experience specification
e4d5c6b Filter to workers with high school education or more
2c5e8a1 Add data cleaning script

Pick the hash of the commit whose version you want (say 2c5e8a1). Then restore just that file from that commit:

git checkout 2c5e8a1 -- clean_data.R

The file in your working directory is now the version from that commit. Run git status to confirm, then stage and commit if you want to keep the restored version as the current state:

git add clean_data.R
git commit -m "Restore clean_data.R to earlier version"

The -- file at the end is a filter

The part after -- tells Git which file(s) to restore. Other files in that commit, and in the rest of your repo, are not touched. If the commit 2c5e8a1 had modified clean_data.R, run_regression.R, and descriptive_stats.R, the command above restores only clean_data.R. The other two stay as they currently are.

You can target more than one file or a whole folder:

git checkout 2c5e8a1 -- clean_data.R run_regression.R   # two specific files
git checkout 2c5e8a1 -- code/                           # everything in code/
git checkout 2c5e8a1 -- code/*.R                        # all R files in code/

The double-dash -- tells Git: “everything after this is a file path, not an option.” It removes ambiguity if a filename happens to match a branch name. Safe habit: always include it.

RStudio’s Git pane does not expose a per-file history restore. Use the Terminal tab inside RStudio:

Open the Terminal tab. Tools → Terminal → New Terminal, or the keyboard shortcut.
Find the commit hash with git log --oneline clean_data.R.
Run git checkout <hash> -- clean_data.R.
The file in the Files pane (and in any open editor) updates to the restored version. Stage and commit through the Git pane as usual.

VS Code’s built-in Timeline lets you view a file’s full history as a diff viewer, but it does not have a one-click “Restore” option on its own. The smoothest built-in path combines Timeline browsing with the integrated terminal.

Open clean_data.R in the editor.
In the Explorer sidebar, scroll down and expand the Timeline section. Every commit that touched this file is listed, newest on top.
Click any entry. A diff opens in the editor showing that version vs. the current one. Browse until you find the version you want to restore.
Right-click that Timeline entry and select Copy Commit Hash.
Open the integrated terminal (Ctrl+`) and paste the hash into this command:

git checkout <paste-hash-here> -- clean_data.R

The file is now the restored version. The Source Control panel shows it as modified. Stage and commit to make this the current state.

One-click restore with the GitLens extension

If you install the free GitLens extension, right-clicking a file in any commit view exposes a direct Restore command. No copy-paste-terminal dance. GitLens is widely used in the VS Code community and adds many other Git features beyond this one. Worth installing if you work with Git in VS Code often.

What happens to your history

Restoring a file is a new commit forward, not a move backward. Your history grows; it does not rewind.

Before checkout:    A → B → C ← HEAD
After checkout:     A → B → C ← HEAD      (clean_data.R now staged, history unchanged)
After commit:       A → B → C → D ← HEAD  (D has a new hash, contains the restored version)

The old commits still contain the old versions. Nothing is destroyed. You added a new commit (D) that happens to contain an old file state. Both the old commit and the new commit are in your log, each with its own hash, each fully reachable.

Caution

This operation restores a single file to an older state. It does not rewrite history or change other files. If the restored version was broken for a reason you forgot, commit as usual and then decide later. Git’s history has both versions; nothing is lost.

Exercise 2: Practice the Safety Rails

Time: ~10 minutes. Continue in my-research.

You have the add-and-commit loop down. This exercise practices the “undo” tools: unstaging, discarding changes, and viewing the diff before committing. These are the safety rails that make committing often feel low-stakes.

Use whichever interface you prefer

Do these operations in your editor of choice. The Terminal version is shown below for concreteness, but every step has an RStudio and VS Code equivalent in the earlier sections: staging is in Staging and Committing, unstaging and discarding are in Undoing Mistakes, and the log view is in Viewing History. Pick your preferred interface and scroll back to any section if you need the clicks.

1. Modify `run_regression.R`

Open run_regression.R in your editor and add a second specification at the end of the file:

# Add experience squared
model2 <- lm(log(wage) ~ educ + exper + I(exper^2) + tenure, data = wages)
summary(model2)

Save the file.

2. See your changes before staging

At the Terminal prompt:

git diff

You should see your new lines marked with +. The diff is your preview of what you are about to commit.

3. Stage and commit

git add run_regression.R
git commit -m "Add quadratic experience specification"

4. Practice unstaging

Create a junk file you do not want to commit, stage it by accident, then unstage it. The shell > below redirects the output of echo into a new file, creating junk.R with one line of text:

echo "temporary scratch" > junk.R
git add junk.R
git status

You should see junk.R under “Changes to be committed”. Unstage it:

git restore --staged junk.R
git status

Now junk.R appears as untracked again. The file still exists on disk, but Git no longer plans to include it in the next commit. Delete the junk file (rm is the shell command to remove a file):

rm junk.R

5. Practice discarding an edit

Open clean_data.R in your editor. Add any line (say, a silly comment: # this is a test edit). Save the file.

Check the diff:

git diff

Your edit shows as a + line. Now discard it:

git restore clean_data.R

Open clean_data.R in your editor again. The edit is gone. This is a destructive operation: uncommitted edits cannot be recovered. Git only remembers things you committed.

6. View your full history

git log --oneline

You should see five commits, newest on top. Each one describes a single logical change. Each one is a point you can return to.

Your RStudio Git pane and VS Code Source Control panel mirrored every step of this exercise in real time: showing you which files were staged, what the diff looked like, and when the working tree returned to clean. Over time you will develop a sense for whether to consult the Terminal or the editor for any given question. Both show the same underlying Git state.

What’s Next

In Session 5 we will connect your local repository to GitHub, learn to push and pull, create branches, and collaborate with pull requests.

Before next session: create a free account at github.com.

Why Version Control?

Yourself over time

You and AI coding tools

You and co-authors

What Is Git?

Situations Where Git Helps

Setup

Step 1: Open your terminal

Step 2: Confirm Git is installed

Step 3: Check whether your identity is already set

Step 4: Set your identity

Step 5: Verify the settings

The Three Areas of Git

Your First Repository

Step 1: Create a project folder

Step 2: Initialize Git

Step 3: Check the status

Staging and Committing

Create a file

Stage the file

Check the status again

Commit

Make another change and commit

Writing good commits

Exercise 1: Add a Second Script and Commit It

The task

How to do it in each interface

Expected result

Viewing History

git log: the commit list

git diff: what changed

git diff vs. git diff –staged

Commit it

.gitignore

Create the file

Add the patterns

Commit the .gitignore file

Linking to data that lives elsewhere

Undoing Mistakes

Unstage a file

Discard local changes

Undo the last commit (keep the changes)

Return to an earlier version of a file

Exercise 2: Practice the Safety Rails

1. Modify run_regression.R

2. See your changes before staging

3. Stage and commit

4. Practice unstaging

5. Practice discarding an edit

6. View your full history

What’s Next

1. Modify `run_regression.R`