LRB Dow Eff Functions

From InterSciWiki
Jump to: navigation, search
library(mice)
library(foreign)
library(stringr)
library(psych)
library(AER)
library(spdep)
library(geosphere)
library(relaimpo)
#The Dow-Eff functions, as well as the four ethnological datasets, are contained in an R-workspace, located in the cloud.
load(url("http://dl.dropbox.com/u/9256203/DEf01.Rdata"), .GlobalEnv)
ls()  #-can see the objects contained in DEf01.Rdata
##  [1] "addesc"    "chK"       "chkpmc"    "CSVwrite"  "doMI"     
##  [6] "doOLS"     "EA"        "EAcov"     "EAfact"    "EAkey"    
## [11] "gSimpStat" "kln"       "llm"       "LRB"       "LRBcov"   
## [16] "LRBfact"   "LRBkey"    "mkdummy"   "SCCS"      "SCCScov"  
## [21] "SCCSfact"  "SCCSkey"   "setDS"     "WNAI"      "WNAIcov"  
## [26] "WNAIfact"  "WNAIkey"
#The setDS( xx ) command sets one of the four ethnological datasets as the source for the subsequent analysis. The four valid options for xx are: “WNAI”, “LRB”, “EA”, “SCCS”. The setDS() command creates objects:
#object name	description
#cov	Names of covariates to use during imputation step
#dx	The selected ethnological dataset is now called dx
#dxf	The factor version of dx
#key	A metadata file for dx
#wdd	A geographic proximity weight matrix for the societies in dx
#wee	An ecological similarity weight matrix for the societies in dx
#wll	  A linguistic proximity weight matrix for the societies in dx
setDS("LRB")
#The next step in the workflow is to create any new variables and add them to the dataset dx. New variables can be created directly, as in the following example. When created in this way, one should also record a description of the new variable, using the command addesc(). The syntax takes first the name of the new variable, and then the description.
dx$lnarea <- log(dx$area)
addesc("lnarea", "log of total land area occupied by the group")
#Dummy variables (variables taking on the values zero or one) should be added using the command mkdummy(). This command will, in most cases, automatically record a variable description. Dummy variables are appropriate for categorical variables. The syntax of mkdummy() takes first the categorical variable name, and then the category number (these can be found in the codebook for each ethnological dataset). Note that the resulting dummy variable will be called variable name+“.d”+category number.
mkdummy("systate3", 1)
## [1] "This dummy variable is named systate3.d1"
mkdummy("systate3", 2)
## [1] "This dummy variable is named systate3.d2"
#After making any new variables, list the variables you intend to use in your analysis in the following form.
evm <- c("group2", "hunting", "gatherin", "fishing", "war1", "reven", "nomov", "dismov", "store", "subdiv2", "systate3.d2", "systate3.d1", "lnarea", "nagp")
#Missing values of these variables are then imputed, using the command doMI(). Below, the number of imputed datasets is 2, and 3 iterations are used to estimate each imputed value (these values are too low: nimp=10 and maxit=7 are the defaults and are reasonable for most purposes). The stacked imputed datasets are collected into a single dataframe which here is called smi.
#This new dataframe smi will contain not only the variables in evm, but also a set of normalized (mean=0, sd=1) variables related to climate, location, and ecology (these are used in the OLS analysis to address problems of endogeneity). In addition, squared values are calculated automatically for variables with at least three discrete values and maximum absolute values no more than 300. These squared variables are given names in the format variable name+“Sq”.
#Finally, smi contains a variable called “.imp”, which identifies the imputed dataset, and a variable called “.id” which gives the society name.
smi <- doMI(evm, nimp = 2, maxit = 3)
## [1] "group2"
## [1] "nomov"
## [1] "dismov"
## [1] "store"
## [1] "systate3.d2"
## [1] "systate3.d1"
## Time difference of 2.65 secs
dim(smi)  # dimensions of new dataframe smi
## [1] 678  85
smi[1:2, ]  # first two rows of new dataframe smi
##   .imp   .id group2 nomov dismov store systate3.d2 systate3.d1 hunting
## 1    1 Punan     30    45    240     1           0           0      30
## 2    1 Batek     58     6     50     2           1           0      30
##   gatherin fishing war1 reven subdiv2 lnarea nagp mht.name.d2 mht.name.d8
## 1       65       5    1  1.26   69.86  3.388 4738           0           0
## 2       65       5    1  2.10   69.86  2.282 3852           0           0
##   mht.name.d11 koeppengei.d4 koeppengei.d13 koeppengei.d18 continent.d3
## 1            0             0              0              0            0
## 2            0             1              0              0            0
##   continent.d4 region.d2 region.d7 bio.1  bio.2 bio.3  bio.4  bio.5 bio.6
## 1            0         0         0 1.240 -1.339 2.971 -1.571 0.2588 1.566
## 2            0         0         0 1.258 -1.327 2.079 -1.489 0.4143 1.538
##   bio.8  bio.9 bio.10 bio.11 bio.12 bio.13  bio.14  bio.15 bio.16   bio.17
## 1 1.125 0.9913 0.8121  1.403  4.036 1.9917  7.7985 -1.5882  2.163  7.68694
## 2 1.097 0.9936 0.9058  1.399  1.322 0.9502 -0.2036  0.1907  1.126 -0.05763
##   bio.18 bio.19 meanalt mnnpp  sdalt     x       y    x2     y2     xy
## 1 3.6274  2.635 -0.4907 3.250 0.2440 1.500 -0.7185 2.251 0.5163 -1.078
## 2 0.2692  1.765 -0.3770 0.378 0.7853 1.549 -0.5004 2.398 0.2504 -0.775
##   Australian NaDene UtoAztecan nomovSq storeSq huntingSq gatherinSq
## 1          0      0          0    2025       1       900       4225
## 2          0      0          0      36       4       900       4225
##   fishingSq war1Sq revenSq subdiv2Sq lnareaSq bio.1Sq bio.2Sq bio.3Sq
## 1        25      1   1.588      4880   11.477   1.538   1.792   8.828
## 2        25      1   4.410      4880    5.209   1.581   1.762   4.324
##   bio.4Sq bio.5Sq bio.6Sq bio.8Sq bio.9Sq bio.10Sq bio.11Sq bio.12Sq
## 1   2.469 0.06697   2.451   1.266  0.9826   0.6596    1.968   16.291
## 2   2.217 0.17168   2.364   1.202  0.9872   0.8205    1.958    1.747
##   bio.13Sq bio.14Sq bio.15Sq bio.16Sq  bio.17Sq bio.18Sq bio.19Sq
## 1    3.967 60.81624  2.52237    4.680 59.089006 13.15782    6.941
## 2    0.903  0.04144  0.03638    1.269  0.003321  0.07247    3.115
##   meanaltSq mnnppSq sdaltSq
## 1    0.2408 10.5599 0.05954
## 2    0.1421  0.1429 0.61662
#All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.
# --dependent variable--
dpV <- "group2"
# --independent variables in UNrestricted model--
UiV <- c("hunting", "gatherin", "fishing", "war1", "reven", "nomov", "dismov", "store", "subdiv2", "systate3.d2", "systate3.d1", "lnarea")
# --additional exogenous variables (use in Hausman tests)--
oxog <- c("nagp")
# --independent variables in restricted model (all must be in UiV above)--
RiV <- c("hunting", "gatherin", "fishing", "war1", "reven")
#The command doOLS() estimates the model on each of the imputed datasets, collecting output from each estimation and processing them to obtain final results. To control for Galton's Problem, a network lag model is used, with the user able to choose a combination of geographic proximity (dw), linguistic proximity (lw), and ecological similarity (ew) weight matrices. In most cases, the user should choose the default of dw=TRUE, lw=TRUE, ew=FALSE.
#There are several options that increase the time doOLS() takes to run: stepW runs a background stepwise regression to find which variables perform best over the set of estimations; relimp calculates the relative importance of each variable in the restricted model, using a technique to partition R2; slmtests calculates LaGrange multiplier tests for spatial dependence using the three weight matrices. All of these should be set to FALSE if one wishes to speed up estimation times.
h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = oxog, dw = TRUE,  lw = TRUE, ew = FALSE, stepW = TRUE, relimp = TRUE, slmtests = FALSE)
## [1] "--finding optimal weight matrix------"
## [1] "--looping through the imputed datasets--"
## [1] 1
## [1] 2
## Time difference of 23 secs
names(h)
##  [1] "DependVarb"       "URmodel"          "Rmodel"          
##  [4] "EndogeneityTests" "Diagnostics"      "OtherStats"      
##  [7] "DescripStats"     "totry"            "didwell"         
## [10] "dfbetas"          "data"