Make the imputed datasets
From InterSciWiki
#September 22, 2009, Anthon Eff
Contents |
[edit] 1 Background
Return to Imputing data for two-stage OLS (2SLS) Regression Analysis. The problem of network effects in causal analysis first surfaced in 1889 when statistician Sir Francis Galton critiqued anthropologist Sir Edward B. Tylor's use of correlation, hence it is known as Wikipedia:Galton's problem. See Galton's problem and autocorrelation for previous attempts at solution and "The Big Lie" that it is only a problem of random sampling. It took 110 years for the statistically valid 2SLS solution with instrumental (IV) variables to emerge (starting with Kelejian and Prucha (1998), as developed further and programmed here by econometrician Anthon Eff and anthropologist Malcolm Dow. They take the additional step of compensating for missing data by inferential statistical estimation for a distribution of probable values for each item of missing data, i.e., multiple imputation (MI), which restores the sample size and accuracy of regression analysis while retaining the uncertainties associated with missing data.
- Doug White: Anthon, is there just a single IV for each of distance and language? And would it not be useful to include date of observation, v838 for example (v1714 is only coded for half the cases), among the independent variables for missing data imputation, and in the set of independent variables in the models?
Anthon: In this program, there are two IV variables for cultural transmission: one for distance (fydd), and the other for language (fyll). One simply checks the pvalue in the displayed OLS regression results to see which is significant.
There is an identification problem with using separate variables for each effect: fydd and fyll correlate with each other. So in the work that Malcolm and I have been doing, we do in fact create a composite cultural transmission variable. The programming with that is much more complicated, so it didn't feel appropriate to include the composite variable technique in this primer. There is actually a bunch of additional stuff that would be good to disseminate (how to do logistic regression with network-lagged variables, tests for endogeneity, etc.), and maybe we'll do a more advanced primer in a couple of years. For now, the methods presented here represent a significant advance on the current practice. And the primer is already overly long...
I agree, the date of observation would have been a good variable for the auxiliary data--perhaps as an interaction term with some consolidated set of Burton regions. I should have kept it. I'm not so sure about using it as an independent variable... but it wouldn't hurt to try.
- Doug: I added dateobs=SCCS$v838, to program 1, but it made no difference to the results.
[edit] 2 Instructions
To download the R files for your analysis, if C:My Documents/MI does not already have the R files in it: Make directory C:My Documents/MI, and download http://intersci.ss.uci.edu/wiki/pw/2SLS.zip to this subdirectory. Open "My computer", navigate to C:My Documents/MI, click 2SLS (the zip file), move cursor to within the zip file, press "control-A" to highlight all files, click "copy", place cursor within the C:My Documents/MI directory (outside the zip file) and press "control-paste." This should copy all the files needed for the programs in the C:My Documents/MI subdirectory.
[edit] 3
If this program doesnt run, copy the first line that generates an error and the error message, and make sure all the R files are installed in the C:My Documents/MI subdirectory. If it does run, go on to program 2. Both take some time to execute. Copy your results and compare them to those at the top of program 2. Your term paper project depends on cross-cultural readings about the variables in the SCCS codebook whose designations in the R code are shown in SCCS Variables in R for your assignment. Find the dependent variables that you want to use with 2SLS, these two program, for 2-stage least squares. You make copies of programs 1 and 2 in work documents, save them, and then save under a new name for your project. The instructor will help you define the new dependent variable. (At a later stage you can add new independent variables. The idea is to get a good multivariate causal model for the dependent variable, e.g., level of crime as predicted by a set of variables that include the influences of father and mother in child training, for example. Getting those variables right will take some time and testing). Concentrate on getting your new program 1 running. Then edit your new program 2 to get the dependent variable (depvar) defined in the same way as in your program 1. When you get results that make sense to you then you are halfway home and you have the basis for writing up the project paper to satisfy the writing requirement. Turn in at various stages:
- Outline: Will define the problem that interests you, the dependent and independent variables.
- Draft: your findings and what they mean
- Presentation (powerpoint)
- Final paper (due end of term, Wedneday of exam week)
[edit] Results (Program 2)
> bbb
coef Fstat ddf pvalue VIF
(Intercept) -10.221 0.830 1663963 0.362 NA
fyll 1.409 8.143 1635530 0.004 1.323
cultints 0.793 5.654 92317113 0.017 1.897
roots -2.288 3.980 2884787518 0.046 1.210
fish 0.578 5.316 2704133416 0.021 1.239
exogamy -0.978 6.601 555752727 0.010 1.134
settype -0.451 4.039 1549237446 0.044 1.685
femsubs 0.631 4.046 2267773469 0.044 1.241
> r2
R2:final model R2:IV(distance) R2:IV(language)
0.1063426 0.9236338 0.9664695
> ccc
Fstat df pvalue
RESET 0.684 2320453.88 0.408
Wald on restrs. 0.383 740.25 0.536
NCV 1.079 38929798.47 0.299
SWnormal 0.480 108755460.06 0.488
lagll 1.715 3260507.47 0.190
lagdd 3.419 47700055.70 0.064
[edit] Program 1 (9-19-09) obsolete
#MI--make the imputed datasets
#--change the following path to the directory with your data and program--
#setwd("d:/projects/MI")
setwd("c:/My Documents/MI")
rm(list=ls(all=TRUE))
options(echo=TRUE)
#--you need the following two packages--you must install them first--
library(foreign)
library(mice)
#--To find the citation for a package, use this function:---
citation("mice")
#Click image to enlarge
#-----------------------------
#--Read in data, rearrange----
#-----------------------------
#--Read in auxilliary variables---
load("vaux.Rdata",.GlobalEnv)
row.names(vaux)<-NULL
#--Read in the SCCS dataset---
load("SCCS.Rdata",.GlobalEnv)
#--look at first 6 rows of vaux--
head(vaux)
#--look at field names of vaux--
names(vaux)
#--check to see that rows are properly aligned in the two datasets--
#--sum should equal 186---
sum((SCCS$socname==vaux$socname)*1)
#--remove the society name field--
vaux<-vaux[,-28]
names(vaux)
#--Two nominal variables: brg and rlg----
#--brg: consolidated Burton Regions-----
#0 = (rest of world) circumpolar, South and Meso-America, west North America
#1 = Subsaharan Africa
#2 = Middle Old World
#3 = Southeast Asia, Insular Pacific, Sahul
#4 = Eastern Americas
#--rlg: Religion---
#'0 (no world religion)'
#'1 (Christianity)'
#'2 (Islam)'
#'3 (Hindu/Buddhist)'
#--check to see number of missing values in vaux,
#--whether variables are numeric,
#--and number of discrete values for each variable---
vvn<-names(vaux)
pp<-NULL
for (i in 1:length(vvn)){
nmiss<-length(which(is.na(vaux[,vvn[i]])))
numeric<-is.numeric(vaux[,vvn[i]])
numDiscrVals<-length(table(vaux[,vvn[i]]))
pp<-rbind(pp,cbind(data.frame(numeric),nmiss,numDiscrVals))
}
row.names(pp)<-vvn
pp
#--extract variables to be used from SCCS, put in dataframe fx--
fx<-data.frame(
socname=SCCS$socname,socID=SCCS$"sccs#",
valchild=(SCCS$v473+SCCS$v474+SCCS$v475+SCCS$v476),
cultints=SCCS$v232,roots=(SCCS$v233==5)*1,
cereals=(SCCS$v233==6)*1,gath=SCCS$v203,hunt=SCCS$v204,
fish=SCCS$v205,anim=SCCS$v206,femsubs=SCCS$v890,
pigs=(SCCS$v244==2)*1,milk=(SCCS$v245>1)*1,plow=(SCCS$v243>1)*1,
bovines=(SCCS$v244==7)*1,tree=(SCCS$v233==4)*1,
foodtrade=SCCS$v819,foodscarc=SCCS$v1685,
dateobs=SCCS$v838, #ADDED BY DRW
ecorich=SCCS$v857,popdens=SCCS$v156,pathstress=SCCS$v1260,
CVrain=SCCS$v1914/SCCS$v1913,rain=SCCS$v854,temp=SCCS$v855,
AP1=SCCS$v921,AP2=SCCS$v928,ndrymonth=SCCS$v196,
exogamy=SCCS$v72,ncmallow=SCCS$v227,famsize=SCCS$v80,
settype=SCCS$v234,localjh=(SCCS$v236-1),superjh=SCCS$v237,
moralgods=SCCS$v238,fempower=SCCS$v663,
sexratio=1+(SCCS$v1689>85)+(SCCS$v1689>115),
war=SCCS$v1648,himilexp=(SCCS$v899==1)*1,
money=SCCS$v155,wagelabor=SCCS$v1732,
migr=(SCCS$v677==2)*1,brideprice=(SCCS$v208==1)*1,
nuclearfam=(SCCS$v210<=3)*1,pctFemPolyg=SCCS$v872
)
#--look at first 6 rows of fx--
head(fx)
#--check to see number of missing values--
#--also check whether numeric--
vvn<-names(fx)
pp<-NULL
for (i in 1:length(vvn)){
nmiss<-length(which(is.na(fx[,vvn[i]])))
numeric<-is.numeric(fx[,vvn[i]])
pp<-rbind(pp,cbind(nmiss,data.frame(numeric)))
}
row.names(pp)<-vvn
pp
#--identify variables with missing values--
z<-which(pp[,1]>0)
zv1<-vvn[z]
zv1
#--identify variables with non-missing values--
z<-which(pp[,1]==0)
zv2<-vvn[z]
zv2
#-----------------------------
#----Multiple imputation------
#-----------------------------
#--number of imputed data sets to create--
nimp<-10
#--one at a time, loop through those variables with missing values--
for (i in 1:length(zv1)){
#--attach the imputand to the auxiliary data--
zxx<-data.frame(cbind(vaux,fx[,zv1[i]]))
#--in the following line, the imputation is done--
aqq<-complete(mice(zxx,maxit=100,m=nimp),action="long")
#--during first iteration of the loop, create dataframe impdat--
if (i==1){
impdat<-data.frame(aqq[,c(".id",".imp")])
}
#--the imputand is placed as a field in impdat and named--
impdat<-cbind(impdat,data.frame(aqq[,NCOL(zxx)]))
names(impdat)[NCOL(impdat)]<-zv1[i]
}
#--now the non-missing variables are attached to impdat--
gg<-NULL
for (i in 1:nimp){
gg<-rbind(gg,data.frame(fx[,zv2]))
}
impdat<-cbind(impdat,gg)
#--take a look at the top 6 and bottom 6 rows of impdat--
head(impdat)
tail(impdat)
#--impdat is saved as an R-format data file--
save(impdat,file="impdat.Rdata")
[edit] Program 1 (9-22-09) current
#MI--make the imputed datasets
#--change the following path to the directory with your data and program--
###setwd("d:/projects/MI")
setwd("C:/My Documents/MI")
rm(list=ls(all=TRUE))
options(echo=TRUE)
#--you need the following two packages--you must install them first--
library(foreign)
library(mice)
#--To find the citation for a package, use this function:---
citation("mice")
#-----------------------------
#--Read in data, rearrange----
#-----------------------------
#--Read in auxilliary variables---
load("vaux.Rdata",.GlobalEnv)
row.names(vaux)<-NULL
#--Read in the SCCS dataset---
load("SCCS.Rdata",.GlobalEnv)
#--look at first 6 rows of vaux--
head(vaux)
#--look at field names of vaux--
names(vaux)
#--check to see that rows are properly aligned in the two datasets--
#--sum should equal 186---
sum((SCCS$socname==vaux$socname)*1)
#--remove the society name field--
vaux<-vaux[,-28]
names(vaux)
#--Two nominal variables: brg and rlg----
#--brg: consolidated Burton Regions-----
#0 = (rest of world) circumpolar, South and Meso-America, west North America
#1 = Subsaharan Africa
#2 = Middle Old World
#3 = Southeast Asia, Insular Pacific, Sahul
#4 = Eastern Americas
#--rlg: Religion---
#'0 (no world religion)'
#'1 (Christianity)'
#'2 (Islam)'
#'3 (Hindu/Buddhist)'
#--check to see number of missing values in vaux,
#--whether variables are numeric,
#--and number of discrete values for each variable---
vvn<-names(vaux)
pp<-NULL
for (i in 1:length(vvn)){
nmiss<-length(which(is.na(vaux[,vvn[i]])))
numeric<-is.numeric(vaux[,vvn[i]])
numDiscrVals<-length(table(vaux[,vvn[i]]))
pp<-rbind(pp,cbind(data.frame(numeric),nmiss,numDiscrVals))
}
row.names(pp)<-vvn
pp
#--extract variables to be used from SCCS, put in dataframe fx--
fx<-data.frame(
socname=SCCS$socname,socID=SCCS$"sccs#",
valchild=(SCCS$v473+SCCS$v474+SCCS$v475+SCCS$v476),
cultints=SCCS$v232,roots=(SCCS$v233==5)*1,
cereals=(SCCS$v233==6)*1,gath=SCCS$v203,hunt=SCCS$v204,
fish=SCCS$v205,anim=SCCS$v206,femsubs=SCCS$v890,
pigs=(SCCS$v244==2)*1,milk=(SCCS$v245>1)*1,plow=(SCCS$v243>1)*1,
bovines=(SCCS$v244==7)*1,tree=(SCCS$v233==4)*1,
foodtrade=SCCS$v819,foodscarc=SCCS$v1685,
ecorich=SCCS$v857,popdens=SCCS$v156,pathstress=SCCS$v1260,
CVrain=SCCS$v1914/SCCS$v1913,rain=SCCS$v854,temp=SCCS$v855,
AP1=SCCS$v921,AP2=SCCS$v928,ndrymonth=SCCS$v196,
exogamy=SCCS$v72,ncmallow=SCCS$v227,famsize=SCCS$v80,
settype=SCCS$v234,localjh=(SCCS$v236-1),superjh=SCCS$v237,
moralgods=SCCS$v238,fempower=SCCS$v663,
sexratio=1+(SCCS$v1689>85)+(SCCS$v1689>115),
war=SCCS$v1648,himilexp=(SCCS$v899==1)*1,
money=SCCS$v155,wagelabor=SCCS$v1732,
migr=(SCCS$v677==2)*1,brideprice=(SCCS$v208==1)*1,
nuclearfam=(SCCS$v210<=3)*1,pctFemPolyg=SCCS$v872
)
#--look at first 6 rows of fx--
head(fx)
#--check to see number of missing values--
#--also check whether numeric--
vvn<-names(fx)
pp<-NULL
for (i in 1:length(vvn)){
nmiss<-length(which(is.na(fx[,vvn[i]])))
numeric<-is.numeric(fx[,vvn[i]])
pp<-rbind(pp,cbind(nmiss,data.frame(numeric)))
}
row.names(pp)<-vvn
pp
#--identify variables with missing values--
z<-which(pp[,1]>0)
zv1<-vvn[z]
zv1
#--identify variables with non-missing values--
z<-which(pp[,1]==0)
zv2<-vvn[z]
zv2
#-----------------------------
#----Multiple imputation------
#-----------------------------
#--number of imputed data sets to create--
nimp<-10
#--one at a time, loop through those variables with missing values--
for (i in 1:length(zv1)){
#--attach the imputand to the auxiliary data--
zxx<-data.frame(cbind(vaux,fx[,zv1[i]]))
#--in the following line, the imputation is done--
aqq<-complete(mice(zxx,maxit=100,m=nimp),action="long")
#--during first iteration of the loop, create dataframe impdat--
if (i==1){
impdat<-data.frame(aqq[,c(".id",".imp")])
}
#--the imputand is placed as a field in impdat and named--
impdat<-cbind(impdat,data.frame(aqq[,NCOL(zxx)]))
names(impdat)[NCOL(impdat)]<-zv1[i]
}
#--now the non-missing variables are attached to impdat--
gg<-NULL
for (i in 1:nimp){
gg<-rbind(gg,data.frame(fx[,zv2]))
}
impdat<-cbind(impdat,gg)
#--take a look at the top 6 and bottom 6 rows of impdat--
head(impdat)
tail(impdat)
#--impdat is saved as an R-format data file--
save(impdat,file="impdat.Rdata")

