SSDS 15-16

SSDS Syllabus

Seminar for the Study of Development Strategies

General Information

The focus of the course is close reading and re-analysis of emerging research in the political economy of development, broadly construed. The focus is on well identified research whether based on experimental or observational data. It is intended for advanced graduate students (3rd – 4th year) that already have strong analytic skills. Auditors are welcome as long as they put in the work. Second time takers/auditors are also welcome.

The overall structure is that in most weeks an external speaker comes to discuss new or in-progress research. The speaker does not present the work however; instead they share their papers, data and code in advance with the class and a “replication team” has a week to put together a detailed discussion of the work. In other weeks we do something similar with work in progress of students in the class.

Note this course has an unusual format, meeting roughly once every two weeks over the course of a year. This course meets in room 711 of IAB building on Wednesdays from 4:10 – 6:00, generally followed by a dinner for a group of participants. It is led by Macartan Humphreys (mh2245@columbia.edu). If you want to see how this document was made, you can see the code here. Thanks to Jasper Cooper and Tara Slough who have done enormous work on the schedule and thinking through the structure and workflow of the class.

Expectations

Reading

The reading loads are not especially heavy; typically the speaker will provide 1 or 2 readings that give a sense of their research agenda. You should read these carefully. You should also look at the data whether or not you are on the “rep” team. There is no point coming to the class unprepared. My thoughts on reading and discussanting are here http://www.macartan.nyc/teaching/how-to-read/ and here http://www.macartan.nyc/teaching/a-checklist-for-discussants/.

Participation

The course will alternate between External and Internal weeks. During external weeks, guest speakers will present research to the class. Student research will be presented during Internal weeks.

External weeks:

Guest speakers will be asked to share data in advance, and students are encouraged to replicate results and submit the results to robustness checks before each class.

  • Every registered student will be expected to write a one-page response paper in advance of the talk each week. This is due into the class dropbox by midnight Monday of the day before. If you are presenting in a given week this is not required.
  • A “rep” team of two students will be assigned a formal role as discussants and prepare oral and written commentary for the guest speaker.

Key elements of this are:

  1. Be in touch with authors and be sure you have the data, papers, and all you need at least a week in advance
  2. Make sure you can make sense of the data and run a basic replication.
  3. When you have a feel of things jot down a brief pre-replication plan. What do you plan to look at? What do you expect to find? Archive this on dropbox.
  4. Then there are two ways to expand the analysis;
    • One is to check for robustness. How much do things depend on the particular models or measurements?
    • The second is to go more deeply into the logic of the explanation. This might sometimes require assembling more data, constructing new tests and so on.
  5. Meet me briefly on the Monday before class to go over your main material.
  6. Generate a presentation that
    • presents the paper in general
    • uses experimentr (if it works; see below) to characterize the research design in abstract terms
    • goes through the results and replication and
    • goes through robustness and extensions
    • does all this in rmarkdown so that speaker has content and code in a single file
  7. Note that while we focus a lot on statistical replication and re-analysis there are many sides to a paper. Your presentation should not shy from discussing more fundamental conceptual or interpretational issues as appropriate.

Internal weeks

During Internal weeks, student research will be presented.

  • I strongly encourage participation from students returning from the field with main results in hand. The student will provide data and replication files to the class in advance but will not present his or her own research.
  • Students that are not at that stage will be expected to provide an advanced draft of a research design by the end of the year. An advanced design means not only theory, hypothesis and identification strategy but also draft instruments and protocols and a dummy dataset and analysis.
  • In internal weeks, two students will be assigned to present the research. The first will be assigned to act as the defender of the research and will prepare a presentation and defense of the research. The second student will serve as a devil’s advocate, preparing a critique of the presented research.

Each student should expect to serve as a discussant for a guest speaker once per semester and to have his or her research presented once in the year and to act as both a defender and a devil’s advocate for another student’s research once in the year.

Writing requirement

You will be expected to write a paper displaying original research to be presented during one of the internal weeks. These research papers will contain (i) a theoretical argument or motivation, (ii) an empirical test of that argument and (iii) a discussion of policy prescriptions resulting from the argument. A draft of this paper should be the paper used for your “internal” week; it does not have to have been written for this class specifically. However the final paper should however be the revised paper in light of the internal week discussions. Some thoughts on writing here http://www.macartan.nyc/teaching/on-writing/.

The Speakers

The Agenda

Our current speaker line up is as follows:

Date Speaker Provisional Topic
16-Sep Shira Mitchell Millennium Development Villages
23-Sep Rich Nielsen Violent Extremism
14-Oct Eli Berman Economics and Conflict
28-Oct Donald Green Vote-buying
4-Nov Pablo Querubin Accountability
18-Nov Leonard Wantchekon Deliberation
9-Dec Graeme Blair Nollywood or oil in the delta
3-Feb Jessica Gottlieb TBC
10-Feb Gwyneth McClendon TBC
24-Feb Thomas Fujiwara TBC
9-Mar Peter Bergman Education
23-Mar Daniel Hidalgo TBC
6-Apr Maarten Voors Health Systems Sierra Leone
13-Apr Jens Hainsmueller TBC
27-Apr Francesco Trebbi TBC

The Rules

It is a very unusual thing for speakers to come and share data on unpublished work. It makes for terrific feedback and learning, but can also bring some risks to speakers. This cannot be thought of as a public presentation of research in the usual way and different rules apply. In particular:

  • If a speaker requests that data not be shared outside the group, or perhaps even outside the replication team, this has to be adhered to strictly on pain of permanent ostracism.
  • Any new findings from the analyses do not belong to the class or the students that engaged in the replication. You are working with the data for training purposes not for research purposes; you might see amazing patterns in the data but they don’t belong to you.
  • Any public commentary has to be bland at best. If you have to tweet or related after sessions, these should be of no cause for embarrassment for speakers.

Workflow and Tools

We are going to be pretty hardcore about the workflow and using a set of very recent research tools to make sure all the work in the class is transparent and replicable.

The main tools that we will employ are:

  • GitHub – for collaborating on code, publishing replications and raising issues
  • Dropbox – for sharing data with one another
  • R – for conducting statistical analysis and authoring documents in…
  • Markdown – for authoring replications and pages on GitHub

GitHub

GitHub will serve four main purposes:

  1. Collaborating on code together
  • Unlike Dropbox, GitHub allows for non-simultaneous editing of the same document, whether it is an R script, a .tex file or an .Rmd (Markdown) file.
  • Each and every change is labeled, explained, and displayed in a simple interface. Reverting to previous versions or undoing certain changes is extremely easy. Three people can all make different changes to the same document on their own computers and then sync them whenever they want later.
  • How it works: you make changes on your computer to a file, say an .R script. When you save, GitHub keeps a record of which changes you made. You label the changes with an explanation, and ‘commit’ them – but you haven’t changed anything yet. To change the document on GitHub, you must ‘push’ or ‘sync’ your commits. To get your commits, others must ‘pull’ them from GitHub. The whole process becomes very easy and intuitive with a little familiarization.
  • to push all of your commits and pull everyone else’s in the desktop app, simply click the sync button.
  1. Publishing replications as web pages
  • When you submit and present replications you will write them in Markdown and compile them, then publish them to our GitHub page under your own subdirectory. This very page was created in R, using this file in 00_Admin.
  • Publishing a page like this in GitHub is pretty easy.
    1. Firstly, create a new folder, for example in External Weeks, to host all the code for your replication.
    2. Secondly, write the publishable version of your code in a Markdown file in R, saving it as an .Rmd file. For example, mvp_replication.Rmd.
    3. Thirdly, compile that file into a file called readme.md using knitr in R (see Rmd_to_md.R for an example – feel free to add to this script).
  • You’re done: GitHub automatically converts any file called readme into a webpage. When you convert an .Rmd file to an .md file, you’ve told R to take the .Rmd, compile all the R code, and make a Markdown file out of it. In each subdirectory, GitHub reads the readme file and turns it into a webpage which everyone in the class can read and which you can use for the presentations.
  1. Discussing and managing issues in the course using the ‘issues’ feature
  • A range of issues will arise during our course. It could be anything from coding problems to trying to find a partner for a replication. You can post, label and assign issues here.
  • All comments on issues can be formatted in Markdown!
  1. Sharing code, functions, packages
  • Everyone who contributes to the SSDS repository on GitHub can add code and other files to it. It can be a great incubator for new functions and other helpful general purpose tools.

To get started with GitHub, you will need two things:

  1. a GitHub account – you will use this to share code and make comments on the SSDS GitHub repository
  2. the GitHub desktop app – you will use this to label and sync any changes you make to one another’s code

Markdown

Please write all class reports in Markdown. Information on this here: http://rmarkdown.rstudio.com/. R markdown is fairly simple but has the advantage of letting you a) write $\LaTeX$ as needed b) integrate your R code directly c) compile to either a pdf, html or even word file. For transparency and error reduction b) is particularly important since we want to stay close to the data and set things up so that everyone in the class plus other presenters can follow your code and analysis.

To create a Markdown document in R:

  • open Rstudio
  • click File/New File/R Markdown…
  • this creates an .Rmd file, containing both your code and text
  • to compile the document, you can either use the knit() function in R (this is how we make .md files), or simply click the “Knit HTML / PDF / Word” button on the top panel of RStudio

R

Analysis should be done in R. If you don’t know R you should teach yourself. There are various online courses which you can take; have a look at http://tryr.codeschool.com/ and https://www.datacamp.com/. If you love your Stata or Excel and just cannot get on top of R, make sure you are on a team with someone who can so that final analyses can be implemented in R.

We will keep an updated list of packages that you will need in the the install_packages.R script. Run this script to get the new version of all

Using Dropbox

We will keep all data on dropbox so that it can be sourced in from a single location. This is good practice and means that everything has to run off core data and not from individually customized files.

The easiest way to share data on Dropbox is to:

  • put the data on your own Dropbox account
  • get the link to the data document
  • the link will have a “key” component (a random sequence of numbers and letters), and the filename (i.e. mydata.csv)
  • see below for the use of the source_DropboxData() function, from the repmis package

Using this method we don’t have to each store the data on our own computer, but can just temporarily use it in R. This avoids over-burdening our hard drives with large datasets.

Using R, Markdown and Dropbox together

So for example here is some data:

rm(list = ls(all = TRUE))
library(repmis)
data <- source_DropboxData(key = "5zqvxaz6evtc16d",file = "dummydata.csv")
## Downloading data from: https://dl.dropboxusercontent.com/s/5zqvxaz6evtc16d/dummydata.csv 
## 
## SHA-1 hash of the downloaded data file is:
## 3e82bde102084dc6cac7f14558ab1add4c4cf786

It looks like this

data
##   ID Age Voted
## 1  1  21     1
## 2  2  NA     1
## 3  3  25     0
## 4  4  60     1
## 5  5  30     0
## 6  6  15     0

And here is some analysis:

# Age difference between voters and non voters:
t.test(Age ~ Voted, data = data)
## 
## 	Welch Two Sample t-test
## 
## data:  Age by Voted
## t = -0.85866, df = 1.1034, p-value = 0.5372
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -221.2860  186.9526
## sample estimates:
## mean in group 0 mean in group 1 
##        23.33333        40.50000

All claims in your text should come from the data. For example the average age is 30.2.

See here for an example of what a replication published to GitHub might look like.

Using DeclareDesign to formally characterize the research designs

For each analysis we want to try to formally characterize the design and wring it through the alpha version (alpha as in struggling, not as in tough) of DeclareDesign. DeclareDesign is a package that I am working on with Graeme Blair, Jasper Cooper and Alex Coppock. It is designed to let you describe the core elements of a research design in an abstract way and then you get a set of outputs that provide information on the features of the design — bias, power, coverage — as well as objects useful for registration such as dummy data and mock analyses.

In the DeclareDesign framework, there are six core elements of a research design. You should be able to identify each of these for each replication:

  1. The population. The set of units about which inferences are sought;
  2. The potential outcomes function. The outcomes that each unit might exhibit depending on how the causal process being studied changes the world;
  3. The sampling strategy. The strategy used to select units to include in the study sample;
  4. The estimands. The specification of the things that we want to learn about the world, described in terms of potential outcomes;
  5. The assignment function. The manner in which units are assigned to reveal one potential outcome or another;
  6. The estimator function. The procedure for generating estimates of quantities we want to learn about.

In a replication, you will typically already have the data. The instructions below demonstrate how DeclareDesign can be used with pre-existing data.

To install the package, use devtools in combination with the access key. Please do not share the key during this alpha phase.

# Use this code here to install the DeclareDesign package 
rm(list=ls())
devtools::install_github(repo = "egap/DeclareDesign", 
                         auth_token = "7c4a0e3d05e33bd9bc15eae4a198a69f614e77ac"
                         )

We generate some example data using DeclareDesign DGP functions. You should already have data, so this step will not be necessary.

population_user <- declare_population(
  individuals = list(
    income = declare_variable()),
  villages = list(
    development_level = declare_variable(multinomial_probabilities = 1:5/sum(1:5))
  ),
  group_sizes_per_level = list(
    individuals = rep(1,1000), 
    villages = rep(5,200)
  ))

user_data <- draw_population(population = population_user)

save(user_data, file = "baseline_data.RData")

First, we load the baseline data created by the user, and then define a set of covariates that will be simulated to conduct power analysis and for simulated analyses.

load("baseline_data.RData")

kable(head(user_data), digits = 3)
villages_ID income individuals_ID development_level
1 1 -0.939 1 5
314 63 -0.042 2 4
636 128 0.829 3 5
681 137 -0.439 4 2
627 126 -0.314 5 5
692 139 -2.129 6 5

Second, we define the potential outcomes, which will be simulated based on the baseline covariate data.

potential_outcomes     <-  declare_potential_outcomes(
  condition_names = c("Z0","Z1"),
  outcome_formula = Y ~ .01 + 0*Z0 + .2*Z1 + .1*income
)

Then resample (bootstrap) from user data, respecting levels

population <- declare_population(
  individuals = list(),
  villages = list(),
  N_per_level = c(500, 10),
  data = user_data)

Fourth, we define one or more analyses we will run based on simulated data. This analysis will also be used for power analysis.

estimand <- declare_estimand(declare_ATE(), target = "population", label = "ATE")

Then we declare the design of the experiment, in this case a simple one without clusters or blocking.

assignment <- declare_assignment(potential_outcomes = potential_outcomes)

Then declare the estimator.

estimator <- declare_estimator(formula = Y ~ Z, estimates = difference_in_means, estimand = estimand)

Before finalizing the design, we conduct a power analysis to determine whether 500 units and 10 clusters (villages) are sufficient. To do this, we use the diagnose function.

The output of the diagnose() function is a summary of important statistical properties of the design, including the statistical power, bias, and frequentist coverage (among other uses, an indicator of whether the statistical power is calculated correctly). Here is the diagnosis summary for our simple experiment:

diagnosis <- diagnose(population = population, assignment = assignment, 
                      estimator = estimator, potential_outcomes = potential_outcomes, sims = 1000)
kable(summary(diagnosis), digits = 3)
PATE sd(SATE) Power RMSE Bias Coverage
Y~Z1-Z0_diff_in_means_estimator 0.2 0 1 0.006 0 0.96

The information that diagnose outputs can be very useful for characterizing designs ex post.

The output has six important pieces of information. The first is the population average treatment effect, or PATE, the causal effect of the treatment on those in a finite population from which we have sampled. The sample average treatment effect, or SATE, is different: when we sample a particular set of units, the true average difference in potential outcomes might deviate from the PATE. In this example, we are treating the sample as the population, so there is no deviation of the SATE. Power in this simulation is defined as the probability of obtaining a statistically significant difference-in-means — this occurred in 100% of the simulations. Reassuringly, the difference-in-means estimator does not exhibit any bias. Moreover, the coverage is very close to the theoretical target of 0.95, implying that the estimated confidence interval covers the true effect roughly 95% of the time, as it should.

For more details on how to use DeclareDesign, visit the alpha version of the website here.