%%{init: { 'theme': 'neutral' }}%%
flowchart LR
laptop["Your Laptop"] -- git push --> github["GitHub (origin)"]
github -- git pull --> laptop
Session 5: GitHub & Collaboration
Working with remotes and teams
Want a PDF for note-taking? Open the slides in your browser, append ?print-pdf to the URL, and use File → Print → Save as PDF. Reveal.js handles the layout. Works in Chrome, Edge, and Firefox.
Quick Recap
In Session 4 we learned the core local Git workflow:
git init: create a repositorygit add/git commit: stage and save snapshotsgit log/git diff: view history and changes.gitignore: exclude files from trackinggit restore/git reset: undo mistakes
Today we take those skills online.
What Is GitHub?
GitHub is a cloud hosting service for Git repositories. You still run Git on your laptop and do your work there. GitHub stores a mirror of your repository on its servers. Two commands keep the two copies in sync: git push uploads your commits to GitHub, git pull downloads what others (or your other machines) have pushed.
GitHub is free for public and private repositories. You can also apply for GitHub Education benefits as a student or faculty member.
Why Put Your Code on GitHub?
Version control on your laptop tracks your changes over time. GitHub adds a cloud copy of that history. Four motivations make this worth doing for every research project.
Backup and recovery
Hard drives fail. Laptops get stolen. Coffee spills. A local Git repository lives in one folder on one machine; lose the machine and you lose the project. A remote on GitHub is a continuously updated backup of both your code and its full history. Restoring onto a new laptop takes one command.
Working across your own machines
Many applied economists work on more than one computer: a laptop for travel, an office desktop, sometimes a cluster for heavy computation. Without a cloud remote, synchronizing them means emailing scripts to yourself or copying folders onto USB drives. With GitHub, each machine pulls the latest version and pushes changes when finished. The history is the same everywhere.
Connect to GitHub: SSH Keys
Before you can push or pull, your laptop needs to prove to GitHub that you are who you claim to be. GitHub supports two authentication methods.
- HTTPS with a credential helper. Your laptop talks to GitHub over the same protocol your browser uses. A helper (
osxkeychainon Mac, Git Credential Manager on Windows,libsecreton Linux) stores a Personal Access Token so you are not re-prompted on every push. - SSH keys. Your laptop holds a private cryptographic key; GitHub holds the matching public key. Every connection signs a challenge instead of sending a password or token. No secret material crosses the network.
Both work. The differences that matter in practice:
| HTTPS + credential helper | SSH keys | |
|---|---|---|
| Setup time | ~2 minutes | ~5 minutes (one-time) |
| Expiration | Personal Access Tokens expire and need rotation (typically every 90 days) | Does not expire |
| Firewalls | Port 443, always open | Port 22, sometimes blocked on managed-IT or campus networks |
| Convenience across repos | One token authenticates all | One key authenticates all |
| Same setup across machines | Slightly different helper on each OS | Identical everywhere (Mac, Linux, Windows via Git Bash) |
For a research workflow, SSH is the more common long-term choice. You set it up once and forget about it: the same configuration works on your laptop, your office desktop, and a compute cluster, and you never have to rotate a token. HTTPS is a reasonable fallback if port 22 is blocked on your network.
We use SSH in this tutorial. If you prefer HTTPS, see the HTTPS alternative at the end of this section; the Git commands (git push, git pull, and the rest) are identical either way, only the authentication layer differs.
What SSH actually is
SSH (Secure Shell) is a protocol for secure communication between two computers. Every time you run git push or git pull against an SSH remote, your laptop opens an SSH connection to GitHub and the commands travel over that encrypted channel.
The authentication mechanism SSH uses is public-key cryptography. You generate two paired files on your laptop: a private key and a public key. Anything one key encrypts, only the other can decrypt. The two are linked by math and cannot be guessed from each other.
The rule that makes this secure is simple. Your private key stays on your laptop and is never shared. Your public key is safe to hand out. You give the public key to GitHub once. From then on, when you connect, your laptop proves it holds the matching private key without ever sending the private key itself. No password crosses the network.
The upside over passwords is twofold. First, you never type a password again. Second, an attacker who intercepts your traffic cannot steal your private key, because the private key never travels.
You generate the key pair once, give GitHub the public half, and keep the private half on your laptop.
If you already use GitHub over SSH on this laptop (from a previous course, research project, or setup elsewhere), there is no need to repeat Steps 1–5. Run this one command in your terminal:
ssh -T git@github.comIf you see something like Hi yourusername! You've successfully authenticated..., your setup is good. Skip ahead to Keep Your Email Private or straight to the Remotes and Cloning sections.
If the command errors (Permission denied, Could not resolve hostname, etc.) or you have never done this before, continue with Steps 1–6 below.
All commands in this section run in a terminal.
- Mac / Linux: use the Terminal app (Applications → Utilities → Terminal). The Terminal tab inside RStudio and the Integrated Terminal in VS Code also work; they run the same shell.
- Windows: use Git Bash, the terminal that shipped with Git for Windows. Not PowerShell and not Command Prompt. Open it from the Start menu, or select Git Bash in VS Code’s terminal dropdown.
Step 1: Check whether you already have a key
Before generating a new key, see if one already exists on this machine:
ls -al ~/.sshTwo possible outcomes:
- You see files called
id_ed25519andid_ed25519.pub(or an olderid_rsa/id_rsa.pubpair) in the listing. You already have an SSH key pair, which GitHub will accept. Skip ahead to Step 4 (copy the public key, substitutingid_rsa.pubif that is what you have). - You see an error like
ls: /Users/yourname/.ssh: No such file or directory, or the folder exists but does not contain those files. You do not have an SSH key on this machine yet. Continue with Step 2. This is the expected state for most first-timers. A GitHub account by itself does not create local SSH keys; the.sshfolder is only created the first time you runssh-keygen.
~/.ssh is an absolute path. The ~ expands to your home directory, so the command looks in the same place regardless of your current working directory.
Step 2: Generate a key pair
Replace the email with the one tied to your GitHub account.
ssh-keygen -t ed25519 -C "you@email.com"The command is interactive. It will prompt you three times.
Prompt 1 — where to save the key:
Generating public/private ed25519 key pair.
Enter file in which to save the key (/Users/yourname/.ssh/id_ed25519):
Press Enter to accept the default location.
Prompts 2 and 3 — passphrase and confirmation:
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Either type a passphrase at both prompts (slightly more secure, recommended on a shared machine) or press Enter twice to leave it empty (simpler, defensible on a personal laptop). The passphrase does not appear on screen while you type; that is expected.
What you see when it finishes. A confirmation message and a block of ASCII art:
Your identification has been saved in /Users/yourname/.ssh/id_ed25519
Your public key has been saved in /Users/yourname/.ssh/id_ed25519.pub
The key fingerprint is:
SHA256:aB3cD4eF... you@email.com
The key's randomart image is:
+--[ED25519 256]--+
| .o.+*=O+ |
| . .=.ooX* |
| . . *..+o=. |
| . o .o + |
| . . So |
| .... |
| ...E |
| o.o.. |
| .o.++o |
+----[SHA256]-----+
The randomart image is a visual fingerprint of your key. OpenSSH prints one after every key generation as a human-friendly way to spot mismatches later (a radically different picture means a different key). It is decorative: you do not need to record it or act on it.
Verify the files now exist:
ls -al ~/.sshYou should see both id_ed25519 (private key — keep this secret) and id_ed25519.pub (public key — the one you share).
Step 3: Start the SSH agent and add your key
The ssh-agent is a small background process that holds your unlocked private key in memory so Git does not re-prompt you for the passphrase on every push or pull. You start the agent, then tell it which key to load.
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519The first line starts the agent (the eval part sets the environment variables Git will use to find it). The second registers your new key with the running agent.
Tell the agent to store your passphrase in the macOS Keychain so you do not need to unlock it again after a reboot:
ssh-add --apple-use-keychain ~/.ssh/id_ed25519Step 4: Copy the public key
Display the public key in the terminal:
cat ~/.ssh/id_ed25519.pubThe output looks like ssh-ed25519 AAAAC3Nz... you@email.com. Select and copy the entire line.
Only share the public key (id_ed25519.pub, the one ending in .pub). Never paste the private key (id_ed25519, no extension) anywhere.
Shortcut — Mac: pipe directly to the clipboard instead of copying manually:
pbcopy < ~/.ssh/id_ed25519.pubShortcut — Windows (Git Bash):
clip < ~/.ssh/id_ed25519.pubStep 5: Add the key to GitHub
This step happens in your web browser, not the terminal.
Go to github.com. Click your profile picture (top right) → Settings.
In the left sidebar, click SSH and GPG keys.
Click New SSH key (green button, top right of the SSH keys section).
Paste your key in the Key field. Give it a Title that identifies the laptop, for example “My MacBook” or “Dyson office desktop”. Leave Key type as Authentication Key.
Click Add SSH key. GitHub may ask you to confirm your password, and then require a second factor to verify it is really you (adding an SSH key grants push access, so GitHub re-verifies on sensitive actions).
GitHub has required 2FA for all contributors since 2024, so you almost certainly have one of the following set up:
- Authenticator app (Google Authenticator, Authy, 1Password, Microsoft Authenticator, etc.): enter the current 6-digit code. These rotate every 30 seconds, so enter promptly.
- Duo: Cornell’s single sign-on uses Duo. If your GitHub account is linked to a Cornell or institutional identity, you may receive a push notification on your phone (tap “Approve”) or be shown a 6-digit Duo passcode to enter.
- Security key (YubiKey, Titan, or your laptop’s fingerprint / Touch ID): plug in and tap, or authenticate with the built-in biometric.
- SMS or email (legacy): enter the code GitHub sends to your phone or inbox.
If you have never set up 2FA on GitHub and it prompts you to do so before accepting the SSH key, follow the wizard: installing an authenticator app (Authy, 1Password, or Google Authenticator) is the most portable choice for a personal account.
Step 6: Test the connection
Back in your terminal:
ssh -T git@github.comThe first time you connect, you will see a warning asking you to confirm GitHub’s host fingerprint. This is SSH’s trust-on-first-use mechanism: it records github.com’s identity in ~/.ssh/known_hosts so future connections can verify they are still talking to the real GitHub. Type yes (the full word; y is rejected) and press Enter. If the setup worked, you will see:
Hi yourusername! You've successfully authenticated, but GitHub does not provide shell access.
That message is what you want. “Does not provide shell access” is normal — GitHub never lets anyone log in interactively.
HTTPS alternative (if SSH is blocked)
If SSH does not work on your machine (managed-IT restrictions, firewalls closing port 22), you can authenticate over HTTPS instead. The set-up varies slightly by operating system:
- Mac: Git uses the osxkeychain helper by default. Push once, enter your GitHub username, paste a Personal Access Token as the password, and Git remembers it.
- Windows: Git for Windows ships with Git Credential Manager. It opens a browser for GitHub sign-in the first time you push. No token needed.
- Linux: install
git-credential-libsecretand rungit config --global credential.helper libsecret.
Everywhere in this tutorial we use an SSH URL (git@github.com:user/repo.git). If you are on HTTPS, substitute the HTTPS URL (https://github.com/user/repo.git) in the same commands. The rest of Git works identically.
Keep Your Email Private
As flagged in Session 4 (Setup, Step 4), every commit records the name and email you set with git config. Once you push to a public repo, that email is searchable by anyone browsing GitHub. The fix is a GitHub-provided noreply address.
You should already be signed into GitHub from Step 5 of the SSH setup. Stay in the browser.
- Go to github.com/settings/emails.
- Under Primary email address, check the box for Keep my email addresses private.
- GitHub shows a noreply address shaped like
12345678+yourusername@users.noreply.github.com. Copy it. - Switch to your terminal and update Git’s global configuration:
git config --global user.email "12345678+yourusername@users.noreply.github.com"Replace the address with the one shown on your GitHub settings page.
This only affects commits made after you change the setting. Past commits keep whatever email they were created with. See GitHub’s documentation if you need to rewrite old ones.
Where a Repository Comes From
Before we push or pull, it helps to know the three ways a project ends up connected to GitHub. Your path depends on whether the project already exists and who owns it.
1. Start from scratch
You have a new research idea with no code yet. Two variants arrive at the same end state.
- GitHub-first. Create an empty repository on github.com/new. Clone it to your laptop. Your local folder comes pre-connected to the remote. This is the simpler path for a project you know from the start you want on GitHub.
- Laptop-first. The folder and code already exist (the session 4 flow:
git init, local commits). Create an empty repository on GitHub, then rungit remote add origin <URL>to link the two. The right path when you started locally and only later decided to push.
2. Clone an existing repository
The repository already exists on GitHub. git clone <URL> downloads the full project and its history to your laptop, pre-connected to the remote. Whether you can push back depends on whether the owner added you as a collaborator. Typical cases in applied economics: joining a co-author’s project, following along with a course starter, pulling a public replication package.
3. Fork, then clone
You want your own independent copy on GitHub of someone else’s repo, typically because you cannot push to theirs. Click Fork on GitHub to create a copy under your account, then clone your fork. You have full push access to your fork, and can propose changes back to the original via a pull request. We revisit this pattern in Sessions 6–7 when working with Claude Code project templates.
Which path, when?
| Scenario | Entry point |
|---|---|
| Your own new research project | Start from scratch |
| Joining your advisor’s or co-author’s repo | Clone |
| Following a course starter (including this session’s exercise) | Use this template, then clone |
| Contributing to an R package or replication repo you do not own | Fork, then clone |
| Basing your own work on someone else’s code, diverging from their version | Fork, then clone |
In this tutorial we walk through cloning in detail, since it is the most common path for joining an existing project. Starting from scratch and forking each get their own short subsections below.
Cloning a Repository
git clone <URL> downloads a full repository from GitHub to your laptop, including the full commit history. The command also sets up the remote connection automatically, so git push and git pull work immediately.
What you need
- An SSH or HTTPS URL for the repo. On GitHub, click the green Code button on the repository’s main page, choose SSH (or HTTPS), and copy the URL.
- A parent folder on your laptop where the clone should live. The clone command creates a new subfolder inside your current location.
git clone refuses to proceed if a folder with the same name already exists at the destination. This is the common case when students come from session 4 and already have a my-research/ folder, or when they retry a clone after an earlier attempt.
Symptoms:
- Terminal:
fatal: destination path 'my-research' already exists and is not an empty directory. - RStudio and VS Code: the clone dialog reports the target directory is not empty.
Fix: rename or remove the existing folder before cloning. From your terminal:
cd ~/github
mv my-research my-research-old # safe: keep the old version
# or, if you do not need the old folder:
rm -rf my-research # destructive and irreversibleThen retry the clone below.
Three interfaces, same operation
Navigate to the parent folder where you want the project to live:
cd ~/githubThen clone (replace YOURUSERNAME with your GitHub username, the handle shown in your GitHub profile URL):
git clone git@github.com:YOURUSERNAME/my-research.gitGit creates a new my-research/ folder, downloads everything, and reports progress. When it finishes, step into the project:
cd my-research- File → New Project → Version Control → Git.
- In the dialog:
- Repository URL: paste the SSH (or HTTPS) URL from GitHub.
- Project directory name: auto-fills from the URL; leave it or edit.
- Create project as subdirectory of: choose the parent folder (for example
~/github).
- Click Create Project.
RStudio clones the repo and opens it as a new RStudio Project. The Git pane (top right) appears, already connected to the remote.
- Open the Command Palette (
Cmd+Shift+Pon Mac,Ctrl+Shift+Pon Windows/Linux). - Type
Git: Cloneand press Enter. - Paste the SSH or HTTPS URL in the input box and press Enter.
- When prompted, choose the parent folder where the repo should live.
- When VS Code asks whether to open the cloned repo, click Open.
Verify the remote is set
After cloning, confirm the remote connection. In your terminal, inside the cloned folder:
git remote -vYou should see something like:
origin git@github.com:YOURUSERNAME/my-research.git (fetch)
origin git@github.com:YOURUSERNAME/my-research.git (push)
origin is the default name Git assigns to the remote you cloned from. Future push and pull commands target origin unless you say otherwise.
Push and Pull
Once a remote is set up (either because you cloned, or because you added one with git remote add), two commands keep your laptop and GitHub in sync.
git pushuploads your new local commits to GitHub.git pulldownloads new commits from GitHub into your local copy.
git push -u origin main
The command has four pieces:
git push: the command.-u(or--set-upstream): a flag. It tells Git to record the linkage between your local branch and its counterpart on the remote, so future pushes and pulls on this branch do not need arguments.origin: the remote name, a label on your laptop that points to a URL. Clone names itoriginby default. See all your remotes withgit remote -v. Rename withgit remote rename origin newname.main: the branch name. Shorthand for “push my localmainto the remotemain.” The fully explicit form ismain:main(source:destination). Names usually match.
| Command | What it means |
|---|---|
git push origin main |
Push local main to origin/main. Do not set upstream. |
git push -u origin main |
Same, and set origin/main as the upstream of local main. |
git push -u origin branch1 |
Push a new branch branch1 and set its upstream. |
git push |
Bare. Only works once the current branch has an upstream. |
After the first -u push on a given branch, plain git push and git pull become shorthand for the configured upstream. Set once per branch, use forever.
About the name “origin”. A remote is a label, not the repository. It points to a URL, like a contact in your phone points to a number. Git names it origin by convention because it is the origin of your clone. You can rename it, but nearly every Git tutorial, book, and answer on Stack Overflow uses origin, so keeping it avoids friction. Multiple remotes become useful in specific patterns: a upstream pointing to a repo you forked from, a backup pointing to a mirror on a different host. Most research projects have just one remote.
Push your commits
git pushIf this is the first push from a freshly cloned repo, this works immediately because clone already set the upstream to origin/main. If you started from scratch with git init, your first push needs the -u flag to set the upstream:
git push -u origin mainAfter the first push, future pushes are just git push.
In the Git pane (top-right of the IDE), click the green up-arrow ⬆ Push button. A dialog opens and shows the output of the underlying git push.
If this is the first push on a new branch, RStudio may ask you to confirm setting upstream tracking; accept.
Click the Sync icon in the status bar (bottom of the window, next to the branch name — a cloud with up/down arrows). Alternatively, open the Source Control panel (Source Control icon in the left activity bar, or ⌃⇧G / Ctrl+Shift+G) → click the … (More Actions) menu → Push.
The first time you push a new branch, VS Code prompts you to publish the branch; click OK to set up tracking.
Pull changes from GitHub
git pullThis fetches any new commits from origin/main and merges them into your local main. If you are on another branch, specify it: git pull origin branch-name.
In the Git pane, click the blue down-arrow ⬇ Pull button. A dialog shows the output of the underlying git pull.
Click the Sync icon in the status bar (bottom, next to the branch name). Sync does both pull and push in one step. Alternatively, in the Source Control panel, click the … menu → Pull for pull only.
What you should see on the first push
When a push succeeds, GitHub’s website refreshes to show your files and commit history. Common errors and their fixes:
Permission denied (publickey): your SSH key is not recognized by GitHub. Runssh -T git@github.comto check the connection (see Step 6). If that also fails, revisit the SSH setup.rejected — non-fast-forward: the remote has commits your local copy does not. A fast-forward push is one where GitHub’s history is a simple linear extension of yours; when it is not, the push is rejected to avoid accidentally overwriting the co-author’s work. Rungit pullfirst to bring the remote commits down (resolving any conflicts), then push again.fatal: The current branch main has no upstream branch: you started fromgit initand have not set tracking. Usegit push -u origin mainfor the first push.
Starting From Scratch
If you began a project on your laptop without a remote (the session 4 flow: git init, commits, no GitHub connection yet), attaching a GitHub remote takes two commands.
Create a new empty repository on github.com/new. Name it, leave Initialize with README unchecked, click Create repository. Leaving it unchecked matters: if you check it, GitHub creates an initial commit on the remote that will diverge from your local history, and your first
git pushwill be rejected as non-fast-forward.In your terminal, inside the local project folder:
git remote add origin git@github.com:YOURUSERNAME/project-name.git
git push -u origin mainThe first line registers GitHub as the remote called origin. The second pushes your existing commits and sets up upstream tracking.
GitHub also displays these exact commands on the empty repository’s welcome page, under …or push an existing repository from the command line. You can copy-paste directly from there.
Forks and Templates
Forking a repository
A fork is a server-side copy of someone else’s repository under your own GitHub account. You fork when you want your own independent version of a repo you do not own.
On GitHub:
- Go to the source repository’s page.
- Click Fork (top-right of the page).
- Confirm the fork destination (your own account or an organization you belong to).
- GitHub creates
YOURUSERNAME/repo-namewith the full history and a link back to the source. - Clone your fork locally, the same way you would any other repo.
You can push freely to your fork. To propose changes back to the source, open a pull request on GitHub (covered later in this session).
We revisit forks in Sessions 6–7 when working with Claude Code project templates. The fork-and-clone pattern is how you get your own working copy of a template maintained by someone else.
Using a template
GitHub also supports template repositories. The owner marks a repo as a template, and anyone with a GitHub account can generate a fresh copy in one click. Unlike a fork, the copy has no link back to the source and starts with a clean commit history.
On GitHub:
- Go to the template repository’s page. Templates display a Use this template button where the Fork button would normally appear.
- Click Use this template → Create a new repository.
- Give your copy a name (for example
my-research). This createsYOURUSERNAME/my-researchunder your account. - Clone your new repository locally, the same way you would any other.
Fork vs. template: when to use which
- Use a fork when you may contribute changes back to the source, or when the linked history is valuable (for example, when working on an open-source package).
- Use a template when you want a clean starting point with no connection back and no history baggage. Research starter kits and course scaffolds are the typical case.
The starter repo for this course, arielortizbobea/aem7010-starter, is a template. Exercise 3 below walks you through using it.
Exercise 3: Clone Your Copy of the Starter Repo
Time: ~15 minutes
The goal is to practice the full workflow end to end: create a repository on GitHub from a template, clone it to your laptop, make a change, push it back, and pull a change made on GitHub.
1. Create your copy of the starter
- Open arielortizbobea/aem7010-starter in your browser.
- Click the green Use this template button → Create a new repository.
- Name your new repository
my-research. Leave it public (or private, your choice). Click Create repository.
GitHub creates YOURUSERNAME/my-research under your account with the starter files and a fresh history.
YOURUSERNAME throughout this exercise means your GitHub username (sometimes called your handle): the short name you chose when you created your GitHub account. It is not your email or your full name.
To find it, look at the URL of your new repository page: https://github.com/YOURUSERNAME/my-research. The part between github.com/ and /my-research is your username. For example, if the URL shows github.com/arielortizbobea/my-research, then arielortizbobea is the username you substitute in the commands below.
2. Free up the my-research folder (if needed)
If you worked through session 4, you may already have a ~/github/my-research folder on your laptop. The clone below will fail with destination path 'my-research' already exists unless you move or remove it first.
Check in your terminal:
ls -d ~/github/my-research 2>/dev/nullIf the command prints /Users/YOU/github/my-research, the folder exists and you need to choose one of the two options below. If it prints nothing, you do not have the folder and can skip to step 3.
Option A (recommended): keep the session 4 work by renaming.
cd ~/github
mv my-research my-research-session4Your session 4 history is preserved in my-research-session4/ in case you want to come back to it.
Option B: delete the old folder. Only choose this if you are sure you do not need any commits from the session 4 folder. The operation is irreversible.
cd ~/github
rm -rf my-researchEither option frees up ~/github/my-research for the fresh clone in step 3.
3. Clone it to your laptop
Replace YOURUSERNAME with your GitHub handle throughout.
cd ~/github
git clone git@github.com:YOURUSERNAME/my-research.git
cd my-research- File → New Project → Version Control → Git.
- Repository URL: paste
git@github.com:YOURUSERNAME/my-research.git. - Project directory name: leave as
my-research. - Create project as subdirectory of: choose
~/github. - Click Create Project.
RStudio clones the repo and opens it as a new RStudio Project.
- Open the Command Palette (
Cmd+Shift+Pon Mac,Ctrl+Shift+Pon Windows/Linux). - Type
Git: Cloneand press Enter. - Paste
git@github.com:YOURUSERNAME/my-research.gitand press Enter. - When prompted, choose
~/githubas the parent folder. - When VS Code asks whether to open the cloned folder, click Open.
You should now see the starter files: README.md, .gitignore, clean_data.R, run_regression.R.
4. Make a change and commit
Open clean_data.R in your editor. Add a comment at the top:
# Modified by YOUR NAME for session 5 exerciseSave the file. Then stage and commit:
git add clean_data.R
git commit -m "Add modification note to clean_data.R"- In the Git pane (top-right of the IDE), tick the checkbox next to
clean_data.Rto stage it. The status changes fromM(modified) to a greenAor checkmark. - Click Commit. The Review Changes dialog opens.
- Type
Add modification note to clean_data.Rin the commit message box (top-right of the dialog). - Click Commit. The dialog shows the commit output.
- Open the Source Control panel (Source Control icon (three-node fork shape) in the left activity bar, or
⌃⇧G/Ctrl+Shift+G). - Under Changes, hover over
clean_data.Rand click the + icon to stage. The file moves to Staged Changes. - In the message box above the staged changes, type
Add modification note to clean_data.R. - Click the ✓ Commit button.
5. Push your commit
Because cloning set the upstream automatically, you do not need any arguments to push.
git pushIn the Git pane, click the green up-arrow ⬆ Push button. A dialog shows the output of the push.
Click the Sync icon in the status bar (bottom of the window, next to the branch name). Alternatively, open the Source Control panel and click the … menu → Push.
If you see Permission denied (publickey), revisit SSH setup.
6. Verify on GitHub
Go to https://github.com/YOURUSERNAME/my-research. Refresh. You should see your comment in clean_data.R and your commit message in the history.
7. Test pulling
Simulate a co-author making a change on GitHub:
- On GitHub, click on
clean_data.R→ pencil icon (Edit this file) → add a second comment line at the bottom, for example# Additional note added from the GitHub web editor. - Commit the change via the green Commit changes button (default options are fine).
Now pull the change back to your laptop:
git pullIn the Git pane, click the blue down-arrow ⬇ Pull button. A dialog shows the fetched commits.
Click the Sync icon in the status bar (bottom, next to the branch name) to pull and push in one step. Alternatively, in the Source Control panel, click the … menu → Pull for pull only.
Open clean_data.R and confirm the edit made on GitHub now appears locally.
Branches
A branch is a parallel version of your code. You can experiment on a branch without affecting the main version. When the work is ready, you merge it back.
A picture of the concept:
%%{init: {
'theme': 'base',
'themeVariables': {
'git0': '#8EC6E8',
'git1': '#8FC88E',
'gitBranchLabel0': '#1B1B1B',
'gitBranchLabel1': '#1B1B1B',
'tagLabelBackground': '#F7F4E9',
'tagLabelColor': '#1B1B1B',
'tagLabelBorder': '#BBBBBB'
},
'gitGraph': { 'parallelCommits': true, 'showCommitLabel': false }
}}%%
gitGraph
commit tag: "A"
commit tag: "B"
branch add-iv-analysis
checkout add-iv-analysis
commit tag: "D"
commit tag: "E"
checkout main
commit tag: "C"
merge add-iv-analysis tag: "F"
Reading left to right: A and B are commits on main. At B, branching creates add-iv-analysis (the fork). The branch accumulates its own commits D and E (the IV-analysis work) while main independently continues with C (something else you or a co-author did). At F, merging brings add-iv-analysis back into main. F is a merge commit whose history includes everything from both lines.
The two actions in this picture map directly to the two Git commands introduced below: branching is git checkout -b add-iv-analysis, and merging is git merge add-iv-analysis (run from main). Everything between those two actions is just ordinary committing on one side or the other.
When to branch
A branch earns its keep when the work has at least one of these properties:
- Exploratory. You might abandon it. If you do, the branch disappears and
mainstays clean. No trace of the dead end. - Multi-step. It takes several commits to finish and makes sense as a reviewable unit only when assembled.
- Risky. It could break a currently-working
main. You wantmainrunnable end-to-end while you work. - Parallel with someone else. Two people cannot edit
mainsimultaneously without stepping on each other. - Reviewed before merging. A co-author wants to see the changes before they land. This is the motivation for pull requests.
If none of these apply, commit on main. Branches for two-minute changes are ceremony without payoff.
main is the canvas you intend to sign. A branch is a sketchpad where you try variations. If a sketch works, you copy it to the canvas (merge the branch). If not, you close the sketchpad (delete the branch) and nothing contaminates the final work.
Research examples where a branch is clearly worth it
- Responding to a referee report (R&R). The classic case.
mainstays at the state of the submitted paper. You create a branch for each revision round or for each specific concern (rr-round-1,rr-referee2-sample-selection). Work proceeds on the branch: re-running regressions, updating tables, editing the text. You merge back when each item is resolved, tag the state at resubmission (v0.2-first-rr, covered later in this session), and you end up with a full record of how the paper evolved from submission to acceptance. - Cascading specification change. Adding state fixed effects triggers rerunning every diagnostic, updating three tables, and editing the results discussion. Branch
add-state-fe. If the result is worth keeping, merge; if not, delete the branch and nothing downstream is touched. - Dataset extension. Updated data arrive through 2023. Ingestion, validation, recomputation, and table re-rendering are multi-commit work. Branch
extend-sample-to-2023. - AI-assisted rewrite. Claude Code rewrites your cleaning pipeline in 200 lines. Some of it may be subtly wrong. Branch
ai-refactor-cleaninggives you a review buffer before the rewrite touchesmain. - Co-author collaboration. You rerun regressions while your co-author edits the introduction. Each works on a branch. Neither blocks the other, and merges surface conflicts in the one-or-two places they exist.
What not to branch for: typo fixes, one-line adjustments, obvious bug fixes. The ceremony costs more than the benefit on trivial changes.
Create and switch to a branch
git checkout in session 4; it does two different things
git checkout is an old, overloaded Git command that plays two distinct roles:
- File-restoration role (session 4, Return to an earlier version of a file).
git checkout <commit> -- clean_data.Rrestored the version of a file from an older commit. In session 4 we usedgit restorefor the simpler “discard my uncommitted changes” case; thegit checkoutform was reserved for pulling an older version of a file back to the working directory. - Branch-navigation role (this session).
git checkout mainorgit checkout -b add-iv-analysisswitches branches (with-b, creates a new branch first). Think of it as a navigation operation between branches.
Same command, different contexts. The two uses are easy to distinguish in practice because the argument tells you what you are acting on: a file path (with --) for the restoration case, a branch name for the switch case. Git 2.23+ split these roles into two single-purpose commands to reduce confusion:
git restorefor discarding changes, andgit restore --source=<commit> -- clean_data.Rfor restoring from an older commit.git switch mainorgit switch -c add-iv-analysisfor switching branches.
Both the old and new forms still work. Most tutorials (including this one) use git checkout because it is what you will encounter most often in Stack Overflow answers, books, and older scripts.
git checkout -b add-iv-analysisThis single command creates a new branch and switches to it. It combines two commands that Git also accepts separately:
git branch add-iv-analysiscreates a new branch at the current commit. A branch in Git is just a named pointer to a commit. Creating one does not switch you to it. You stay on whichever branch you were already on. Runninggit branchwith no arguments lists all branches and marks the current one with an asterisk.git checkout add-iv-analysisswitches to an existing branch. Your working directory updates to reflect that branch’s files. Historicallygit checkoutwas also used to restore individual files, which is why Git 2.23+ introducedgit switchandgit restoreas cleaner single-purpose replacements;git switch -c add-iv-analysisis the modern equivalent ofgit checkout -b add-iv-analysis.
The -b flag on git checkout stands for “branch”: it tells Git to create the new branch before switching, collapsing the two steps into one.
- In the Git pane (top-right of the IDE), click the New Branch button (branch icon next to the current branch dropdown). The New Branch dialog opens.
- Branch name: type
add-iv-analysis. - Leave Sync branch with remote checked if you plan to push this branch to GitHub.
- Click Create.
RStudio creates the branch and switches to it. The branch dropdown in the Git pane now shows add-iv-analysis.
- Click the current branch name in the bottom-left status bar (it shows
mainwith a small branch icon). - In the dropdown that opens at the top of the window, select + Create new branch…
- Type
add-iv-analysisand press Enter.
VS Code creates the branch and checks it out. The status bar now shows add-iv-analysis.
Work on the branch
Commits made while on the branch exist only on the branch; they do not touch main until you merge. For demonstration, create a file called iv_analysis.R containing one comment line, then stage and commit it.
echo '# IV regression using 2SLS' > iv_analysis.R
git add iv_analysis.R
git commit -m "Add IV analysis script"The echo ... > file redirect writes the string to iv_analysis.R, creating the file if it does not exist.
- Create the file: File → New File → R Script. Paste
# IV regression using 2SLSinto the editor. - File → Save As, name it
iv_analysis.R, save inside the project folder. - In the Git pane (top-right), tick the checkbox next to
iv_analysis.Rto stage it. - Click Commit. The Review Changes dialog opens.
- Type
Add IV analysis scriptin the message box, click Commit.
- Create the file: File → New File, name it
iv_analysis.R, press Enter. - In the editor, type
# IV regression using 2SLS. Save withCmd+S/Ctrl+S. - Open the Source Control panel (
⌃⇧G/Ctrl+Shift+G). - Under Changes, hover over
iv_analysis.Rand click + to stage. - Type
Add IV analysis scriptin the commit message box, click the ✓ Commit button.
Switch back to main
git checkout mainIn the Git pane, click the branch dropdown (showing the current branch name) and select main.
Click the current branch name in the bottom-left status bar and select main from the dropdown.
Notice that iv_analysis.R disappears from your working directory. It only exists on the other branch. Your working directory reflects whichever branch you are on.
Merge the branch
When you are satisfied with the work on your branch, merge it back to main.
git checkout main
git merge add-iv-analysisRStudio’s Git pane does not expose a merge action. Open RStudio’s built-in terminal (Tools → Terminal → New Terminal, or use the Terminal tab at the bottom of the IDE) and run the Terminal commands above. The Terminal tab is a real shell that inherits your project’s working directory.
Two equivalent paths. Pick whichever you prefer.
Keyboard-driven (Command Palette).
- Make sure you are on
main(click the branch name in the bottom-left status bar and selectmain). - Open the Command Palette (
Cmd+Shift+P/Ctrl+Shift+P). - Type
Git: Merge Branchand press Enter. - Select
add-iv-analysisfrom the list of branches.
Mouse-driven (no typing).
- Click the branch name in the bottom-left status bar and select
main. - Open the Source Control panel (Source Control icon (three-node fork shape) in the left activity bar, or
⌃⇧G/Ctrl+Shift+G). - Click the … (More Actions) menu at the top of the Source Control panel → Branch → Merge Branch…
- Click
add-iv-analysisin the list.
Either way, VS Code runs the merge and reports the result.
Now iv_analysis.R appears on main and all the branch’s commits are part of main’s history.
Delete the branch (optional)
After merging, you can clean up.
git branch -d add-iv-analysisSame pattern as merge: RStudio does not expose a delete-branch action. Open the Terminal tab (Tools → Terminal → New Terminal) and run the Terminal command above.
- Open the Command Palette and type
Git: Delete Branch. - Select
add-iv-analysisfrom the list.
Alternative GUI path: click the branch picker in the bottom-left status bar, locate add-iv-analysis in the list, and click the trash icon that appears on hover.
Pull Requests
A pull request (PR) is GitHub’s way of proposing changes. Instead of merging locally, you push a branch to GitHub and ask for it to be reviewed before merging. Pull requests are the standard collaboration workflow in both industry and academic research. They create a written record of what changed and why, which is valuable for reproducibility.
git pull and pull requests are different tools that work together
A git pull downloads new commits from the remote into your local copy. No review, no approval. You run it whenever you want to sync.
A pull request (PR) proposes that your branch be merged into another branch (usually main). It is a review gateway: a co-author reads the diff, leaves comments, and clicks merge when satisfied.
They typically run in sequence, not as alternatives:
- You branch, commit, and push your branch to GitHub.
- You open a PR asking for your branch to be merged into
main. - A reviewer approves and merges. The merge happens on GitHub, server-side.
- Other collaborators (and your other machines) run
git pullto bring the merged work into their localmain.
So the PR is the gateway for work entering main. git pull is how everyone’s local copies catch up after work has passed through that gateway.
When a PR is essential: multi-author projects, any project where main should always run end-to-end, and open-source contributions where you do not have push access to the target repo.
When you can skip the PR: solo research projects (no reviewer needed; commit directly or merge locally), and trivial fixes like typos where the review ceremony costs more than it adds. git pull remains useful even in solo work if you move across a laptop, office desktop, and cluster.
The pull request workflow
The steps below describe the full PR cycle from the first push to a merged branch. You will put this into practice in the take-home pair exercise at the end of this session. Read through for reference now; no action required on your own repo at this point.
Step 1: Create a branch and make your commits. Use the three-interface pattern from the Branches section above.
Step 2: Push the branch to GitHub. The first push on a new branch needs -u to set the upstream.
git push -u origin add-iv-analysisIn the Git pane, confirm you are on the add-iv-analysis branch (check the dropdown). Click the green up-arrow ⬆ Push button. If this is the first push on this branch, RStudio prompts you to set upstream tracking; accept.
Click the current branch name (add-iv-analysis) in the bottom-left status bar. In the dropdown at the top, select Publish Branch (this option appears whenever a local branch has no remote counterpart yet). VS Code pushes the branch and sets upstream in one step.
Step 3: Open a PR on github.com. Visit your repository’s page. You will usually see a yellow banner reading “add-iv-analysis had recent pushes” with a green Compare & pull request button. Click it. If the banner is gone (it disappears after a few minutes), go to the Pull requests tab → New pull request → set the compare branch to add-iv-analysis → click Create pull request.
Step 4: Write a description explaining what you changed and why. One paragraph is usually enough for a research PR. Reference specific tables, figures, or robustness checks when relevant.
Step 5: Review. Your co-author reads the diff under Files changed, leaves inline comments on specific lines, and approves the PR (or requests changes).
Step 6: Merge. Click Merge pull request on GitHub. GitHub then offers to delete the branch; accept unless you have a reason to keep it.
Handling Merge Conflicts
A merge conflict happens when Git cannot automatically combine two sets of changes because they touch the same line in the same file. Git pauses the operation, marks the affected file, and asks you to decide which version to keep.
When conflicts arise
You hit a conflict in three common situations:
- After
git pullwhen a co-author pushed a commit that edited the same line you edited locally. - After
git mergewhen two branches modified the same line. - While rebasing. Rebasing is advanced and not covered in this course.
In each case, Git stops mid-operation and leaves your working directory in a conflicted state. You cannot commit or push again until you resolve the conflict.
What a conflict looks like
Open the affected file in your editor. Git inserts conflict markers where the disagreement is:
<<<<<<< HEAD
lm(wage ~ educ + exper, data = df)
=======
lm(log(wage) ~ educ + exper + tenure, data = df)
>>>>>>> add-iv-analysis
- Between
<<<<<<< HEADand=======is your current branch’s version. - Between
=======and>>>>>>> add-iv-analysisis the incoming version from the other branch.
Two concrete examples
The right resolution depends on whether the two edits can be combined or express genuinely incompatible choices.
Example 1: combinable changes. You added nonwhite as a control on your branch; your co-author added female on main. Pulling their work triggers:
<<<<<<< HEAD
model1 <- lm(log(wage) ~ educ + exper + tenure + nonwhite, data = wages)
=======
model1 <- lm(log(wage) ~ educ + exper + tenure + female, data = wages)
>>>>>>> main
Both changes are additive and compatible. Edit the file to keep both controls, removing the markers:
model1 <- lm(log(wage) ~ educ + exper + tenure + nonwhite + female, data = wages)Stage and commit. The merge is done.
Example 2: incompatible changes. You changed the sample filter in clean_data.R on your branch to keep only positive wages. Your co-author pushed a change to main that tightens the filter to workers with at least high-school education. The two edits land on the same line:
<<<<<<< HEAD
wages <- wages[wages$wage > 0 & !is.na(wages$wage), ]
=======
wages <- wages[wages$educ >= 12 & !is.na(wages$wage), ]
>>>>>>> main
Here you cannot blindly combine; the two edits express different sample definitions. Two paths:
Pick one. Decide with your co-author which filter reflects the current analysis, delete the other block, remove the markers.
Combine the intents explicitly, if both filters should apply:
wages <- wages[wages$wage > 0 & wages$educ >= 12 & !is.na(wages$wage), ]
Either way, save, stage, and commit. The choice is substantive; Git can surface the disagreement but not settle it.
How to resolve
Open the conflicted file in a text editor (e.g.,
code clean_data.Rto open in VS Code, ornano clean_data.R).Decide which version to keep, or combine them into something new.
Delete the
<<<<<<<,=======, and>>>>>>>marker lines.Save the file.
Stage and commit:
git add clean_data.R git commit -m "Resolve merge conflict in regression specification"
The commit completes the merge Git had paused.
- The conflicted file appears in the Git pane with an orange
U(unmerged) icon. - Open the file. The conflict markers are visible in the editor.
- Edit the file to keep the version you want. Delete the
<<<<<<<,=======, and>>>>>>>lines manually. - Save.
- Back in the Git pane, tick the checkbox next to the file to stage it. The
Ubecomes a checkmark. - Click Commit, type a message (e.g.,
Resolve merge conflict in regression specification), click Commit.
VS Code has the most polished conflict UI of the three.
- The conflicted file shows colored highlighting. Above each conflict block, VS Code offers inline action links: Accept Current Change, Accept Incoming Change, Accept Both Changes, Compare Changes.
- Click the action that matches your decision, or edit manually if you want a hybrid version.
- Save the file.
- In the Source Control panel, the file moves from Merge Changes to Staged Changes.
- Type a commit message and click the ✓ Commit button.
Merge conflicts are normal, not dangerous. They happen whenever two people edit the same line. The fix is always the same in spirit: decide which version to keep, remove the markers, stage and commit.
Exercise 4: Stage and resolve a merge conflict
Time: ~10 minutes. Work solo on your own my-research repo (the one from Exercise 3).
The goal is to experience a real merge conflict and resolve it. You will deliberately edit the same line of run_regression.R on two different branches, attempt the merge, and work through the resolution.
1. Confirm you are on main with a clean working tree
cd ~/github/my-research
git statusYou should see On branch main and nothing to commit, working tree clean. If you have uncommitted work from Exercise 3, commit or discard it first.
2. Create a branch and add a female control
Create a branch named add-female-control:
git checkout -b add-female-controlGit pane → New Branch (branch icon) → type add-female-control → Create.
Click the branch name in the bottom-left status bar → + Create new branch… → type add-female-control → Enter.
Open run_regression.R. Find the first regression line:
model1 <- lm(log(wage) ~ educ + exper + tenure, data = wages)Change it to add female as a control:
model1 <- lm(log(wage) ~ educ + exper + tenure + female, data = wages)Save. Stage and commit:
git add run_regression.R
git commit -m "Add female as a control"3. Switch back to main and add a different control to the same line
Switch back:
git checkout mainOpen run_regression.R again. The line is back to the original (no female). Now change the same line to add nonwhite instead:
model1 <- lm(log(wage) ~ educ + exper + tenure + nonwhite, data = wages)Save. Stage and commit:
git add run_regression.R
git commit -m "Add nonwhite as a control"At this point, main and add-female-control each have a commit that edits the same line of the same file differently. This is the ingredient for a conflict.
4. Attempt the merge, hit the conflict
From main, try to merge the branch:
git merge add-female-controlGit responds:
Auto-merging run_regression.R
CONFLICT (content): Merge conflict in run_regression.R
Automatic merge failed; fix conflicts and then commit the result.
This is the conflict you designed for. Git has paused the merge.
5. Resolve the conflict by combining both changes
Open run_regression.R. You will see conflict markers:
<<<<<<< HEAD
model1 <- lm(log(wage) ~ educ + exper + tenure + nonwhite, data = wages)
=======
model1 <- lm(log(wage) ~ educ + exper + tenure + female, data = wages)
>>>>>>> add-female-control
Both changes are additive, so combine them. Edit the file to read (and remove the three marker lines):
model1 <- lm(log(wage) ~ educ + exper + tenure + nonwhite + female, data = wages)Save. Stage and commit:
git add run_regression.R
git commit -m "Resolve conflict: keep both nonwhite and female"Git completes the merge.
6. Verify
Check the log:
git log --oneline -5You should see the merge commit at the top, followed by the main-side commit (“Add nonwhite…”), the branch-side commit (“Add female…”), and the initial starter commits. Open run_regression.R to confirm the line has both controls.
Optional cleanup:
git branch -d add-female-controlWhat you just practiced
- Creating a branch, committing, switching back, and merging — the full local loop.
- The exact thing that produces a conflict: two commits that edit the same line of the same file on different lines of history.
- Resolving a conflict is a substantive choice, not a mechanical one. Git highlights the disagreement. You decide which version is correct (or how to combine them).
Additional practice (take-home): Pair PR workflow
Time: ~20 minutes. Work with a partner, any time after class. This exercise lets you experience the collaboration side of Git — pushing a branch, opening a pull request, reviewing someone else’s diff, merging, and pulling the result.
1. Person A invites Person B as collaborator
Person A: go to your my-research repo on GitHub → Settings → Collaborators → Add people. Enter your partner’s GitHub username and send the invitation.
2. Person B clones Person A’s repo
Person B: accept the invitation (check your email or GitHub notifications), then clone. Replace PARTNER_USERNAME (your partner’s GitHub handle, visible in their GitHub profile URL) with Person A’s actual GitHub handle throughout.
cd ~/github
git clone git@github.com:PARTNER_USERNAME/my-research.git
cd my-research- File → New Project → Version Control → Git.
- Repository URL:
git@github.com:PARTNER_USERNAME/my-research.git. - Project directory name: leave as
my-research(or change tomy-research-partnerto avoid collision with your own repo of the same name). - Create project as subdirectory of:
~/github. - Click Create Project.
- Open the Command Palette (
Cmd+Shift+P/Ctrl+Shift+P) →Git: Clone. - Paste
git@github.com:PARTNER_USERNAME/my-research.git, press Enter. - Choose
~/githubas the parent folder (or rename the target folder to avoid collision). - Click Open when VS Code prompts.
3. Both partners create a branch
Create a branch named after yourself.
git checkout -b yourname-featureIn the Git pane, click the New Branch button (branch icon). Type yourname-feature, leave Sync with remote checked, click Create.
Click the current branch name in the bottom-left status bar → + Create new branch… → type yourname-feature → Enter.
4. Both partners add a file and commit
Add a new .R file with a few lines of R code (for example descriptive_stats.R or robustness_check.R), stage, and commit.
# After creating your_file.R in any editor:
git add your_file.R
git commit -m "Add descriptive statistics"Create the file (File → New File → R Script, paste your content, save as your_file.R inside the project). In the Git pane, tick the checkbox next to the file, click Commit, type the message, click Commit.
Create the file (File → New File, paste your content, save as your_file.R inside the cloned folder). In the Source Control panel, click + next to the file, type the commit message, click ✓ Commit.
5. Both partners push the branch
git push -u origin yourname-featureClick the green up-arrow ⬆ Push button in the Git pane. Accept the “set upstream” prompt on the first push.
In the bottom-left status bar, click the branch name and select Publish Branch from the dropdown.
6. Both partners open a pull request
Go to GitHub and open a pull request from your branch into main. See the Pull Requests workflow above if you need the steps.
7. Review each other’s PR
Click Files changed to see the diff. Leave a comment (for example “Looks good!” or “Add a header comment?”). When you are satisfied, click Merge pull request.
8. Pull the merged changes locally
Now that both PRs are merged, update your local copy of main so your laptop has both new files.
git checkout main
git pullIn the Git pane, switch to main via the branch dropdown, then click the blue down-arrow ⬇ Pull button.
Click the branch name in the status bar and select main, then click the Sync icon in the status bar (or open the Source Control panel → … menu → Pull).
Both partners now have main containing both of your contributions.
When the Data Is Too Large or Restricted
GitHub is not a data archive. Files over 100 MB trigger warnings; files over 2 GB are rejected. Many applied-economics replication packages exceed these limits, and some datasets cannot be shared publicly at all. The standard solution is to keep the code on GitHub and the data in a dedicated archive that mints a DOI (Digital Object Identifier), then link the two in your replication documentation.
Where to put the data.
- openICPSR / AEA Data and Code Repository (openicpsr.org/openicpsr/aea): the official archive for AEA journals. If you publish in the AER or an AEJ, this is where your replication package must live.
- Zenodo (zenodo.org): free, CERN-operated, accepts files up to 50 GB, issues DOIs. Integrates with GitHub so a Release can be archived in one click.
- Harvard Dataverse (dataverse.harvard.edu): widely used in social sciences, free, issues DOIs.
- ICPSR (icpsr.umich.edu): long-established social-science archive, often required by journals and funders.
If the data cannot be shared. Confidential administrative records, licensed commercial data, and some Census microdata cannot be posted publicly. Journal policies accept a data availability statement that describes the data, its source, the access restrictions, and how an authorized researcher can obtain it. You still post the code so that anyone with equivalent access can reproduce your analysis.
How code on GitHub and data in an archive stay linked
The mechanical pattern is just .gitignore plus a clear README:
- Keep data in a
data/folder on your laptop (or cross-synced via Dropbox). The session 5 starter’s.gitignorealready excludesdata/, so Git never tracks the files inside it. - Your code paths reference files by relative path:
read_csv("data/cps_2020.csv"),readRDS("data/clean_wages.rds"), etc. - Upload the data itself to openICPSR, Zenodo, or Dataverse (as appropriate for your journal) to get a DOI.
- Your README tells the replicator: “Download the data from [DOI], place it in a folder called
data/at the repository root, then runmaster.R.”
The replicator clones the repo, follows the README, and the relative paths in the code resolve to the data they downloaded.
GitHub integrations with data archives
The automation story varies widely.
- Zenodo has fully automated GitHub integration. Connect once in your Zenodo → Settings → GitHub panel, toggle the repository on, and every future GitHub Release is automatically archived on Zenodo with a DOI. This is the gold-standard workflow for small-data or code-only replication packages. The DOI is citable and permanent.
- openICPSR / AEA has no automatic integration. For AEA-journal replication packages you typically upload code and data together as one bundle via openICPSR’s web interface. The manual step is: export your code from GitHub at the submission tag, combine it with the data files in a folder, and upload. openICPSR mints a DOI. Your GitHub README links to that DOI.
- Harvard Dataverse has a REST API that supports semi-automated syncing, though there is no one-click workflow like Zenodo’s.
- ICPSR is a curated archive (staff review your deposit), so there is no “push from GitHub” workflow by design.
Avoid Git LFS (Large File Storage) for replication archives. GitHub’s large-file extension is metered, costs money past a small quota, and the files disappear if the repository owner stops paying. They are also difficult to cite. Use a dedicated archive instead.
Summary
You now have the tools to:
| Task | Commands |
|---|---|
| Track changes locally | git init, add, commit |
| View history | git log, diff |
| Back up to GitHub | git remote add, push, pull |
| Collaborate | git branch, checkout, merge |
| Review changes | Pull requests on GitHub |
| Mark reproducible versions | git tag |
Next steps
- Put your current research project on GitHub today
- Use branches for every new analysis or robustness check
- Write meaningful commit messages
- Bookmark this companion website for reference
Resources
- Pro Git book (free online)
- GitHub Docs
- Happy Git with R by Jenny Bryan: an excellent guide for R users