SCCS R package
- Dow and Eff Simple Functions -- Principal Components
- EduB - 2SLS-IS - Sccs 2SLS-IS downloads - SFI2011 project -
- Brown and Eff, 2009. http://intersci.ss.uci.edu/wiki/pdf/eScholarshipBrown&Eff.pdf Science magazine
- CAUTION: create MODELS .R files may have an error at run time: the fix is to comment data(sccs) --> #data(sccs)
- This change obviates the need for R CMDR as a batch installation as per Scott's original installation instructions.
- R Software for Regression with Inferential Statistics (2sls-is) - older MPI version: R Software for 2sls inferential statistics (2sls-is) - SFI The Evolution of Moralizing Gods
- EduR-0 has an introduction to the series. Edu-Mod 2009-10: The Individual Studies - Supplementary_Online_Materials:_T-T2sls
- SFI2010 project instructions Manual for Dataset Construction and Causal Analysis - NSF World Cultures Project work plan - EduMac-1
Plots for subsets
http://www.gardenersown.co.uk/Education/Lectures/R/correl.htm#correlation Generalization G11.11 p 420A +cond Pop x <- cbind(LRB$pop,LRB$subdiv,LRB$packinx) #subsistence diversity x.df <- data.frame(x) x.sub <- subset(x.df, X3 = 2) plot(x.sub$X1,x.sub$X2, cex=.5) x.sub <- subset(x.df, X3 = 1) plot(x.sub$X1,x.sub$X2, cex=.5) OR: z <- which(LRB$packing<2)# & LRB$subsp<2) plot(log(LRB$pop[z]),LRB$subdiv[z], cex=0.5) cor.test(LRB$pop[z],LRB$subdiv[z], cex=0.5)
Generalization G12.02 p 434 conditional: exclusively terrestrial resources conditional: population density has not reached reduced mobility options to movement within a small foraging area
- (not reached packing 1) packing < 1
subdiv increases linearly with log10 of pop #subsistence diversity #population density thus: x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$) #subsistence diversity x.df <- data.frame(x) x.sub <- subset(x.df, X3 = 2) plot(x.sub$X1,x.sub$X2, cex=.5)
this works x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$subsp) #subsistence diversity #subsp 1=terrestrial animals 2= terrestrial plants 3=aquatic x.df <- data.frame(x) x.sub <- subset(x.df, X3 < 2) #, X4 < 2) plot(x.sub$X1,x.sub$X2, cex=.5)
this doesnt but "should" how do I get an x.sub dataset with two conditions x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$subsp) #subsistence diversity #subsp 1=terrestrial animals 2= terrestrial plants 3=aquatic x.df <- data.frame(x) x.sub <- subset(x.df, X3 < 2 & X4 < 2) plot(x.sub$X1,x.sub$X2, cex=.5) For two conditions, use x.sub <- subset(x.df, X3 < 2 & X4 < 2) "&" ==and "|"==or
EFF: Either of these will work. The first uses different symbols for the two types; the second plots only one of the two types.
plot(LRB$pop,LRB$subdiv, cex=LRB$packing/2,pch=LRB$packing+20,col=LRB$packing)
z<-which(LRB$packing<2) plot(LRB$pop[z],LRB$subdiv[z], cex=0.5)
z<-which(LRB$packing<2 & LRB$subsp<2) plot(log(LRB$pop[z]),LRB$subdiv[z], cex=0.5)
Maps in R
http://journal.r-project.org/archive/2011-1/RJournal_2011-1_South.pdf
Where we go with SEM
- John Fox
- Re: Non-intertible matrices
- Dear Doug,
- Do you have at least as many instrumental variables as coefficients to estimate?
- John
- SEM#Build_data_for_Fox_SEM_of_HiGods_following_model_of_Klein ITS WORKING
- SEM OpenBUGS in R with MCMC
data() in its most common usage just tells R to load a stored data object from an attached package. if you do help("data") you will see there is another application if package=NULL, so that it will look in a folder data of the current working directory. that might be simplest for you.
do not confuse it with the parameter name data= in functions -- that generally just allows you to specify a data frame that gives bindings for variables.
the solution for you is to have in your working .GlobalEnv a data frame of any name, say "foo", and then you can use data=foo as one of the parameter settings for your sem call This worked
neff almost working
DW: I see the problem here, its the p742 (click the small blue "pdf" below) upper right formula Y = ... e*. That's probably lm(Y~V^.5), where e* is Y=mu + (V^.5)e* which is slightly out of my pay grade because e* is multiplicative. There's another step where sigma^2V^-1 is his covariance V not just the matrix V=W that I was using, which ought to bring the n* result into the proper range 1 < n* < n. Could one of you make those equations for me, starting with any old Y and W? With that addition I think the little R program will work. And n* is different for each variable so it can be listed in the output of our 2SLS scripts, so if this works feel free to use it.
Working Calculation for n*, neff, effective sample size with W matrix
- Griffith, Daniel A. 2005. Effective Geographic Sample Size in the Presence of Spatial Autocorrelation pdf. Annals of the Association of American Geographers 95(4): 740–760.
- p742 Suppose the n x n matrix V contains the covariation structure among n georeferenced observations (more precisely, matrix
is the covariance matrix), such that
, where Y denotes … attribute values, and mu denotes the population mean of variable Y…, and e and e* … denote the vectors of spatially autocorrelated and unautocorrelated errors…. If V = I, the n x n identity matrix, then the n observations are uncorrelated.
#A. Eff helped out here
ww<-as.matrix(read.dta("http://dl.dropbox.com/u/37813961/dist25wm.dta")[,c(-1,-2,-1,-189)])
#source(url('http://intersci.ss.uci.edu/wiki/R/source/Wlink.R'))
#ww<-Wlink
dim(ww)
round(ww[1:3,1:3],3)
round(rowSums(ww)[1:3])
ww<-ww/rowSums(ww) #making sure that ww is row normalized
round(ww[1:3,1:3],3)
round(rowSums(ww)[1:18])
n<-NROW(ww)
ww=cov(ww) # THIS IS V <--- NEW (THE FIX)
diag(ww)=0
round(ww[1:7,1:7],9)
round(ww[185:186,185:186],9)
w_inv<-solve(ww) #this is computationally singular unless the diag(ww)=0
round(w_inv[1:3,1:3],3) chk<-ww%*%w_inv # verifying that w_inv is indeed the inverse table(diag(chk));table(colSums(chk)) TRw_inv<-sum(diag(w_inv)) #trace of inverse one<-matrix(1,n,1) #column vector of ones nstar<-TRw_inv/(t(one)%*%w_inv%*%one)*n #formula nstar
n* whether or not ww is row normalized to 1, ww<-ww/rowSums(ww) 141.7 for Wlink 212.9 for Eff-Dow Distance matrix #inverse of V <--- where does Y come in? Y = mu + e = mu + V-1/2e*, where Y denotes … attribute values, and mu denotes the population mean of variable Y…, and e and e* … denote the vectors of spatially autocorrelated and unautocorrelated errors….
Construct dummy variables for FxCmtyW, Fratgrpstr
FxCmtyWages=((sccs$v2010>=2)*1)*((sccs$v61==6)*1) FxCmtyW=as.numeric(is.na(FxCmtyWages))
Fratgrpstr=sccs$v570 misFratgrpstr=as.numeric(is.na(Fratgrpstr)) table(FxCmtyW) FxCmtyW 0 1 112 74 so 1 means missing data.
Which cases?
pastName <- sccs$socnam[which(p == 1)]
p <- which(sccs$v1=='No Trade')
which(sccsdata$v1=='No Trade')
[ 1] 7 55 93 170 180 183 186
Try using the function 'which' with your expression inside.
which(Islam==2 & sccs$v238==1) [1] 83 http://eclectic.ss.uci.edu/~drwhite/worldcul/Sccs34.htm Javanese IslamXian=((sccs$v2002==2)*1)+((sccs$v2002==3)*1) table(sccs$v238,IslamXian) IslamXian 0 1 1 67 1 Javanese 2 47 0 3 13 0 4 17 23 IslamXian=((sccs$v2002==4)*1)+((sccs$v2002==5)*1) table(sccs$v238,IslamXian) IslamXian 0 1 1 58 10 2 41 6 3 10 3 4 31 9
Make a copy of sccs.Rdata named sccs2.Rdata or Strip some variables from comb.Rdata
- exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3")
newdata <- mydata[!myvars]
setwd('/Users/drwhite/Documents/3sls/sccs/')
load('comb.Rdata')
myvars <- names(comb) %in% c("Missions.f", "HiGod4.f") ##THESE .f are key
sccsdata <- comb[!myvars]
save(sccsdata,file='sccsdata.Rdata')
load('sccsdata.Rdata')
[1857] v1913 v1914 v1915 v1916 [1861] v1917 v19_22 v24_25 v26_27 [1865] v53_54 complex v159_177 fortitud [1869] aggressi competiv selfrely achievem [1873] industry responsi obedienc selfrest [1877] sexrestr id66 v2001_1806 v2002_1807 [1881] v713rev Islam Christian Buddhist [1885] Hindu HinduBuddhist Shaman2 GamesOfChance [1889] GamesOfSkill GamesOfStrategy NomEurasia NomEurasiaPacific [1893] NomAfroEurasia NomOldWorld NomNewWorldAfroPacific ExternalWar13 [1897] ExternalWar50 ExternalWar50Bands InterrnalWar46 InterrnalWar46Bands [1901] ExtWarFactor IntWarFactor FratIntGroups60 FratIntGroups60Bands [1905] FemalePower107 FemalePower107Bands MaleAggression50 MaleAggression50Bands [1909] Modernization102 Modernization102Bands PoliticalIntegration82 PoliticalIntegration82Band [1913] v1260 Rain.f Terrain.f Water.f [1917] Missions.f HiGod4.f
setwd('/Users/drwhite/Documents/3sls/sccs/')
load('comb.Rdata')
myvars <- names(comb) %in% c("HiGod4.f") ##THESE .f are key
sccs2 <- comb[!myvars]
save(sccs2,file='sccs2.Rdata')
load('sccs2.Rdata')
setwd('/Users/drwhite/Documents/3sls/sccs/')
load('comb.Rdata')
myvars <- names(comb) %in% c("Missions.f") ##THESE .f are key
sccs1 <- comb[!myvars]
save(sccs1,file='sccs1.Rdata')
load('sccs1.Rdata')
dy/dx, elasticity and derivatives
> D(expression(sqrt(1 - x^2)), 'x')
-(0.5 * (2 * x * (1 - x^2)^-0.5)) Doug: For example, if one estimates: y=a0+a1*x+a2*x^2+a3*ln(x) then dy/dx=a1+2*a2*x+a3/x In other words, this is just like any other calculation of a first derivative. No need to do any special estimation.
xxxx
Doug: OK, maybe I didn’t give such a great example. So imagine that one estimates y=f(x,z,v):
y=a0+a1*x+a2*x^2+a3*x*z +a4*ln(z)+a5*v
then
dy/dx=a1+2*a2*x+a3*z
dy/dz=a3*x+a4/z
dy/dv=a5
These are all partial derivatives, which give the ceteris paribus effect of a change in one variable at a time.
Note that only the last (dy/dv) is a scalar that one can introduce into a table next to a regression coefficient. And, since it is identical to the regression coefficient, it would be redundant to introduce it into the table.
The other two (dy/dx and dy/dz) are not scalars that can be introduced as a single number in a table--they are vectors, with as many values as there are observations (they are functions of x and z). That's why the best way to understand these is to plot them (against x and z, respectively)
In the code for the Farming and Fighting paper, the calculation of the marginal effects is done when we create the figures. For example, the calculation of the marginal effect of SCCS$v814 is done in this line:
dyd814<-axp*g[,"pdens"]+ax1+2*ax2*g[,"v814"]
#now apply that to fitting a smooth lowess function to f(y,x)
plot(cars, main = "lowess(cars)")
lines(lowess(cars), col = 2)
#lines(lowess(cars, f=.2), col = 3)
legend(5, 120, c(paste("f = ", c("2/3", ".2"))), lty = 1, col = 2:3)
lines(lowess(g[,"pdens"],jj*g[,"pdens"]),col="red") lines(lowess(sccsA[zdv,"v814"],y814),lty=2,col="red")
setwd('/Users/drwhite/Documents/3sls/sccs/')
load('comb.Rdata')
sccs<-data.frame(sapply(comb,function(x) as.numeric(x)))
plot(sccs$v814,sccs$v815) #YES
vars=data.frame(sccs$v814,sccs$v815) #missing data
#vars=data.frame(sccs$v5,sccs$v3) #no missing data
#plot(vars) #YES
varNoNA <- na.omit(vars)
sum(table(varNoNA))
plot(varNoNA) #YES
lines(lowess(varNoNA),lty=2,col="red")
##YES, PLOTS OK
xy= lines(lowess(vars),lty=2,col="red") ##NOT WORK
- Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74, 829–836.
- Cleveland, W. S. (1981) LOWESS: A program for smoothing scatterplots by robust locally weighted regression. The American Statistician, 35, 54.
Reff and Elasticity
Hi Doug --
I think I now understand that you are interested in measuring effects in percentage change terms. In economics, this is the concept of "elasticity": the percent change in one variable resulting from a given percent change in another. Here are some wikipedia articles on this notion:
http://en.wikipedia.org/wiki/Elasticity_(economics)
http://en.wikipedia.org/wiki/Price_elasticity_of_demand
As the latter item notes, the concept goes back at least to Alfred Marshall 1890, so you are in august company.
Simply put, elasticity = (dy/y) / (dx/x). Whenever dy / dx has a causal interpretation, so does the elasticity. Whether you use dy / dx or elasticity is purely a matter of convenience or interpretability in a given context.
In case it may be of interest, I attach a recent update of my paper with Karim where we define a variety of effects without recourse to the "do" operator.
I hope this helps.
Best, Hal
marginal effect marginal effects for binary or ordinal responses, computed as partial derivatives, are available from SAS/ETS PROC QLIM by specifying the MARGINAL option in the OUTPUT statement. While no procedure in SAS/STAT software directly computes marginal effects, the partial derivative can be computed using results from the procedure used to fit the model.
can we run GLM inside 2SLS-IS?
can we run the wikipedia:Generalized_linear_model inside 2SLS-IS? see Sccs 2SLS-IS downloads
Changed (lm in two_stage_ols.R to compute_stats <- function(glm.unrestricted,lm.restricted,errors,prox_list) { and got this error
8 3 data_to_impute...vars_missing_data.i.. 8 4 data_to_impute...vars_missing_data.i.. 8 5 data_to_impute...vars_missing_data.i.. 8 6 data_to_impute...vars_missing_data.i.. 8 7 data_to_impute...vars_missing_data.i.. 8 8 data_to_impute...vars_missing_data.i..
Error in linearHypothesis(lm.unrestricted, drop_vars, test = "Chisq") :
object 'lm.unrestricted' not found
In addition: Warning messages: 1: In min(y[test_idxs], na.rm = TRUE) :
no non-missing arguments to min; returning Inf
2: In max(y[test_idxs], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Changed (lm in two_stage_ols.R to compute_stats TO function(glm.unrestricted,lm.restricted,errors,prox_list)
Changed (lm to glm here: linearHypothesis(glm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2]
and got a RESULT - can this be right? YES, because I then restarted R, and did not include the earlier two_stage_old.R
coef range effect ratio Fstat ddf pvalue VIF abs.ratio %coded tot.cor part.cor
(Intercept) -9.512 NA NA NA 0.727 2849.468 0.39399615 NA NA 0.000 0.000 0.000
language 1.346 2 2.691 0.096 7.484 2200.414 0.00627649 1.306 0.096 0.000 0.207 -0.023
cultints 0.805 5 4.027 0.144 5.691 755.965 0.01730165 1.868 0.144 1.000 0.162 0.091
roots -1.995 1 -1.995 -0.071 2.770 233.726 0.09741558 1.196 0.071 1.000 -0.081 -0.097
fish 0.558 9 5.019 0.179 4.864 2579.736 0.02750355 1.205 0.179 1.000 0.051 0.078
exogamy -0.860 4 -3.438 -0.123 5.160 592.550 0.02347549 1.115 0.123 0.995 -0.111 -0.167
settype -0.450 7 -3.152 -0.113 3.858 858.149 0.04983512 1.678 0.113 1.000 0.005 -0.069
femsubs 0.458 8 3.667 0.131 2.189 1030.255 0.13933123 1.224 0.131 0.995 0.094 0.040
Train R2:final model Train R2:IV_language
0.09759433 0.97002048
Fstat df pvalue
RESET 0.001 73.273 0.974
Wald.on.restrs 1.138 228.137 0.287
NCV 0.406 58.725 0.527
SW.normal 0.763 2455.566 0.382
lag..language 1.568 316525.108 0.211
coef range effect ratio Fstat ddf pvalue VIF abs.ratio %coded tot.cor part.cor
(Intercept) -13.190 NA NA NA 0.940 130.471 0.33415460 NA NA 0.000 0.000 0.000
language 1.498 2 2.996 0.107 6.245 121.638 0.01378760 1.309 0.107 0.000 0.209 -0.023
cultints 0.762 5 3.810 0.136 5.156 854.501 0.02341128 1.860 0.136 1.000 0.154 0.084
roots -1.969 1 -1.969 -0.070 2.816 416.877 0.09406955 1.196 0.070 1.000 -0.078 -0.095
fish 0.517 9 4.654 0.166 4.143 1609.501 0.04197396 1.205 0.166 1.000 0.041 0.070
exogamy -0.815 4 -3.259 -0.116 4.507 318.360 0.03453283 1.112 0.116 0.995 -0.102 -0.159
settype -0.441 7 -3.090 -0.110 3.495 289.899 0.06257948 1.681 0.110 1.000 0.002 -0.072
femsubs 0.482 8 3.855 0.138 2.392 964.099 0.12227872 1.233 0.138 0.995 0.108 0.049
Train R2:final model Train R2:IV_language
0.08700948 0.96158666
Fstat df pvalue
RESET -0.023 143.632 1.000
Wald.on.restrs 1.278 185.277 0.260
NCV 0.640 81.212 0.426
SW.normal 0.545 3139.683 0.461
lag..language 1.613 84002.356 0.204
The generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
How to recover Imputed data values and dummy for missing data
Dichotomize NA versus Others to make dummy coded/missing cases for networks of variables
dich=is.na(sccs$v287) di287=as.numeric(is.na(sccs$v287)) table(dich,di287)
di=is.na(sccs$v287) di287=as.numeric(is.na(sccs$v287)) table(di,di287) http://escholarship.org/uc/item/7cm1f10b retrieve Anthon's file
SEE RESULTS AT Draft#HiGod4 for FxCmtyWages, Full imputation test and gives better results, "more positive cases" Meaning of Positive correlation of imputed with Coded variables
SFI Support
Bug Reports
here I attach a detailed bug description, and a R source to replicate it.
- Giorgio Ubuntu bug reports W %*% X (following data imputation) produces an error in an R version for Ubuntu linux
- But it was one compiled by Giorgio so the solution may not be general
- new Ubuntu_run_model.R
change in two_stage_ols.R and two_stage_ols_full.R imputed_data <- cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2])) to imputed_data <- cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2], stringsAsFactors=FALSE))
Anthon Eff codebook NA
The dropbox codebook at http://dl.dropbox.com/u/9256203/ccc.txt correctly indicates a number of variables in sccsfac.Rdata that lack NA
- BUT SOMETIMES THE MISSING DATA ARE NOT CORRECTLY CODED as NA
http://dl.dropbox.com/u/9256203/sccsfac.Rdata #Eff dropbox load('sccsfac.Rdata') sccs<-data.frame(sapply(sccsfac,function(x) as.numeric(x)))
I searched in the codebook for "don't know" etc.
sccs$v271:5 sccs$v273:5 sccs$v917:4 sccs$v917:7 sccs$v1648 sccs$v1649 sccs$v1650 sccs$v1651 sccs$v1652 sccs$v1653 sccs$v1654 sccs$v1659 sccs$v1676 sccs$v1677 sccs$v1678 sccs$v1683 sccs$v1684 sccs$v1685 sccs$v1694
Variable to delete
sccs$externalwar13
Spelling changes
sccs$v763 Ude -> Use sccs$v908 evan -> even sccs$v918 nt -> not sccs$v1249:6 prp -> pro
These are sample N=coded for numerically summarized variables
> sum(table(sccs$v871)) [1] 147 > sum(table(sccs$v872)) [1] 145 > sum(table(sccs$v1753)) [1] 96 > sum(table(sccs$v1848)) [1] 135 > sum(table(sccs$v1849)) [1] 135 > sum(table(sccs$v1861)) [1] 186 > sum(table(sccs$v1862)) [1] 93
Other Ns to do:
sccs$v1865 to 1873 sccs$v1873 to 1880 sccs$v1911 sccs$v1913 to 1917 sccs$v1975 sccs$v1977 to 1978 sccs$v2000
NaNs not NA
sccs$v1805.1 sccs$v1805.2 sccs$v1805.3 sccs$v1805.4 sccs$v1805.5 sccs$v1805.6 sccs$v1805.7 sccs$v1805.8
How to do 1st PCA for variables before my_sccs
#Compute the PCAP variable for Brown and Eff (courtesy of Eff)
is.na(sccs$v237[c(115,172)])<-TRUE
is.na(sccs$v63[88])<-TRUE
twoVARS<-sccs[,c("v63","v237")]
z<-which(!is.na(rowSums(twoVARS)))
pcdat2<-princomp(twoVARS[z,])
pc1<-pcdat2$scores[,1]
sccs$pc1[z]<-pc1
length(sccs$pc1) #186
#pc1[172]=NA
#pc1[115]=NA
#pc1[88]=NA
$FORGET THE REST BELOW
#Compute the PCAP variable for Brown and Eff (courtesy of Eff)
is.na(sccs$v237[c(115,172)])<-TRUE
is.na(sccs$v63[88])<-TRUE
twoVARS<-sccs[,c("v63","v237")]
z<-which(!is.na(rowSums(twoVARS)))
pcdat1<-princomp(twoVARS[z,])
pc1<-pcdat1$scores[,1] #---first principal component
length(pc1)
pc1[172]=NA
pc1[115]=NA
pc1[88]=NA
#FORGET THE REST BELOW http://stat.ethz.ch/R-manual/R-devel/library/stats/html/princomp.html #--loop through cases for missing data and do 1st PCA -- twoVARS=cbind(sccs$v921,sccs$v928) #--loop through cases for missing data -- for (i in 1:nimp){ #z<-which(my_sccs$.imp==i) z<-which(twoVARS==NA) #???? newvar[which(newvar==88)] = NA z<-which(my_sccs$v921==NA) #???? newvar[which(newvar==88)] = NA #princomp(twoVARS) pcdat=princomp(twoVARS) #pcdat$scores PCcap1=round(pcdat$scores[1:186],4) sccs$v1918=PCcap1
see source("examples/create/create_EduR_1/create_EduR_1.5DistB_EffHiddenVariablesV238AP1_2.R")
sccs$v237[172]=NA sccs$v237[115]=NA sccs$v63[88]=NA v237= sccs$v237 z<-which(v237==NA) v63= sccs$v63 z<-which(v63==NA) twoVARS=cbind(v63,v237) z=which(twoVars=NA) #princomp(twoVARS) pcdat2=princomp(z,twoVARS)
#newvar[which(newvar==0)] = NA #newvar[which(newvar==88)] = NA #newvar<-sccs$v237 #sccs$v237=newvar sccs$v237[172]=NA sccs$v237[115]=NA v237= sccs$v237 z<-which(v237==NA) v237= sccs$v237 sccs$v63[88]=NA z<-which(twoVARS==NA) twoVARS=cbind(sccs$v63,sccs$v237) z<-which(twoVARS==NA) pcdat2=princomp(z,twoVARS) PCsize1=round(pcdat2$scores[1:186],4) sccs$v1919=PCsize1
Compute R2 between significance and relative effect ratios
x=c(176,237,96,208,102)
y=c(43,34,102,8,41)
table(x,y)
y
x 8 34 41 43 102
96 0 0 0 0 1 102 0 0 1 0 0 176 0 0 0 1 0 208 1 0 0 0 0 237 0 1 0 0 0 cor(x,y) 1] -0.6998707 cor(x,y)^2 1] 0.4898189 xL=c(237,293,364,187,252,212,143) yL=c(91,15,21,2,45,39,65) cor(xL,yL)^2 1] 0.09697337 xM=c(156,5,216,72,5,99,36,279,186,106,174,85) yM=c(303,975,261,244,975,384,664,92,250,80,53,97) cor(xM,yM)^2 1] 0.5300817 .9^3 1] 0.729 .95^3 1] 0.857375 .95^2 1] 0.9025
Drop box
- http://dl.dropbox.com/u/9256203/sccsfac.Rdata
- https://dl.dropbox.com/u/37813961/sccsfac.Rdata #same as above
- https://dl.dropbox.com/u/37813961/sccs.Rdata #same as above
- https://dl.dropbox.com/u/37813961/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R
- https://dl.dropbox.com/u/37813961/two_stage_ols.R
- https://dl.dropbox.com/u/37813961/run_model_79.R
- https://dl.dropbox.com/u/37813961/run_model_100.R
- https://dl.dropbox.com/u/37813961/averageAll.R
- http://rss.acs.unt.edu/Rdoc/library/base/html/source.html
- source("https://dl.dropbox.com/u/37813961/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R")
- https://dl.dropbox.com/u/37813961/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R
- https://dl.dropbox.com/u/37813961/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2Missions.R
- https://dl.dropbox.com/u/37813961/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R
Dropbox
- http://dl.dropbox.com/u/9256203/ccc.txt #Anthon Eff codebook for the R data file sccs.Rdata
- http://dl.dropbox.com/u/37813961/dist25wm.dta.zip (put in root) sccs/examples/data -RightClick to download, dbl click zip file, dist25wm.dta appears in that directory (made by mac FINDER - FILE - Compress)
- http://dl.dropbox.com/u/37813961/langwm.dta.zip (put in root) sccs/examples/data --RightClick to download, dbl click zip file, langwm.dta appears in that directory (made by mac FINDER - FILE - Compress)
- http://dl.dropbox.com/u/9256203/langwm.dta #MISSING Anthon Eff (stata file) move to
- https://dl.dropbox.com/u/37813961/two_stage_ols.R #DRWHITE (root) sccs/R_3_s_ols/ ------------- script
- https://dl.dropbox.com/u/37813961/average.R #DRWHITE (root) sccs/R_3_s_ols/ ------------- script
- http://dl.dropbox.com/u/37813961/vaux.Rdata #DRWHITE (root) sccs/examples/data --------- auto download
- http://dl.dropbox.com/u/37813961/sccs.Rdata #DRWHITE (root) sccs/examples/data --------- auto download
- https://dl.dropbox.com/u/37813961/run_model_100.R #DRWHITE (root) sccs/examples/create/R_run --- script
- https://dl.dropbox.com/u/37813961/run_model_79.R #DRWHITE (root) sccs/examples/create/R_run --- script
- https://dl.dropbox.com/u/37813961/create #DRWHITE (root) sccs/examples/create/create_EduR_1/ --- script
- http://dl.dropbox.com/u/9256203/vaux.Rdata #Anthon Eff
- http://dl.dropbox.com/u/9256203/dist25wm.dta #MISSING Anthon Eff (stata file)
- https://www.dropbox.com/home/Public?select=two_stage_ols.R#/Public:::
- public
- The Public Folder lets you easily share single files in your Dropbox. Any file you put in this folder gets its own Internet link so that you can share it with others-- even non-Dropbox users! These links work even if your computer’s turned off.
- Step 1: Drop a file into the Public folder.
- Step 2: Right-click/control-click this file, then choose Dropbox > Copy Public Link. This copies the Internet link to your file so that you can paste it somewhere else.
- That's it! You can now share this file with others: just paste the link into e-mails, instant message conversations, blogs, etc.!
I searched in Eff’s R codebook for "don't know" and other indicators of missing data without NA.
sccs$v271:5 sccs$v273:5 sccs$v917:4 sccs$v917:7 sccs$v1648 sccs$v1649 sccs$v1650 sccs$v1651 sccs$v1652 sccs$v1653 sccs$v1654 sccs$v1659 sccs$v1676 sccs$v1677 sccs$v1678 sccs$v1683 sccs$v1684 sccs$v1685 sccs$v1694
Variable to delete
sccs$externalwar13 (too few variables)
Spelling changes in codebook needed for these words
sccs$v763 Ude -> Use (ok in main codebook) sccs$v908 evan -> even (ditto) sccs$v918 nt -> not (ditto) sccs$v1249:6 prp -> pro (ditto)
NaNs not NA
sccs$v1805.1 sccs$v1805.2 sccs$v1805.3 sccs$v1805.4 sccs$v1805.5 sccs$v1805.6 sccs$v1805.7 sccs$v1805.8
Source
Install or start R: go to R/Package & Data/Package installer, search for "igraph", install,
then Cut and paste into R as a command
setwd('/Users/drwhite/Documents/3sls/sccs/') # library(foreign)
load('sccs.Rdata')
sccs<-data.frame(sapply(comb,function(x) as.numeric(x))) #data(sccs)
#source("http://intersci.ss.uci.edu/wiki/R/source/model/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R")
source("http://intersci.ss.uci.edu/wiki/R/source/model/0CreateModelValueSDWchildren.R") #vars problem, sccs not found
source("http://intersci.ss.uci.edu/wiki/R/source/model/
source("http://intersci.ss.uci.edu/wiki/R/source/model/
source("http://intersci.ss.uci.edu/wiki/R/source/model/create_EduR_1.5DistBrownEffSign.R")
source("http://intersci.ss.uci.edu/wiki/R/source/model/create_EduR_1.5DistB_EffHiddenVariables.R")
source("http://intersci.ss.uci.edu/wiki/R/source/two_stage_ols.R")
source("http://intersci.ss.uci.edu/wiki/R/source/run_model_100.R")
source("http://intersci.ss.uci.edu/wiki/R/source/averageAll.R") #"output" is the name of the active output file
On Mac:
source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R")
#source("dropbox create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R")
source("R_3_s_ols/two_stage_ols.R")
source("examples/create/R_run/run_model_100.R")
source("examples/create/R_run/averageAll.R") #"output" is the name of the active output file
source("examples/create/R_run/averageAll.R")
Alternate urls for drw home pages
- https://dl.dropbox.com/u/37813961/DouglasR.White_Anthropologist_and_SociologistBritishHomePage.webarchive http://eclectic.anthrosciences.org/~drwhite/
- https://dl.dropbox.com/u/37813961/Douglas%20R.%20White,%20Anthropologist%20UCI.webarchive
- http://intersci.ss.uci.edu/wiki/pdf/Getting_Started.pdf Dropbox explanation] - SCCS data and R code - Visit MAIN SITE
Crucial change for SCOTT to make: adjust relimp to R2 for 2SLS Oct 2011
Brown and Eff 2010 code - Mac See: http://intersci.ss.uci.edu/wiki/index.php/Brown_and_Eff_2010_code_-_Mac #--adjust relimp to R2 for 2SLS-- xz<-bbb$relimp[2:NROW(bbb)] wt<-r2[1]/sum(xz) bbb$relimp[2:NROW(bbb)]<-xz*wt
OLS reminder
lmout=lm(sccs$v238 ~ output$no_rain_dry +output$superjh +output$milk +output$AnimXbwealth +output$FxCmntyWage +output$logdate) #) #output$fratgrpstr +output$pigs +output$eeextwar +output$Missions +output$socplex + lmout summary(lmout) vif(lmout)
Correction for missing data in v1650 and v1685
comp.Rdata and comb2.Rdata updated - always check table(comp$v1650) and table(comp$v1685)
#convert factor -->as.numeric newvar<-sccs$v1650 newvar[which(newvar==0)] = NA newvar[which(newvar==88)] = NA sccs$v1650=newvar #convert numeric -->as.factor newvar<-comb$v1650 newvar[which(newvar==0)] = NA newvar[which(newvar==88)] = NA comb$v1650=newvar
#convert factor -->as.numeric newvar<-sccs$v1685 newvar[which(newvar==0)] = NA newvar[which(newvar==8)] = NA sccs$v1685=newvar #convert numeric -->as.factor newvar<-comb$v1685 newvar[which(newvar==0)] = NA newvar[which(newvar==8)] = NA comb$v1685=newvar
save(comb, file='comb.Rdata')
lmout=lm(dep_var~PopdenEffect) lmout summary(lmout) vif(lmout)
How to normalize and average data in a data frame
PLOTTING http://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html http://stat.ethz.ch/R-manual/R-patched/library/stats/html/scatter.smooth.html http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/plot.html http://www.astrostatistics.psu.edu/su07/R/html/graphics/html/plot.html http://rss.acs.unt.edu/Rdoc/library/tutoR/html/plot.html
Ideally I want to scale the values for each column in the range of (-1,1)
?scale
Ideally I want to scale the values for each column in the range of (-1,1)
?scale #A <- matrix(c(2,3,-2,1,2,2),3,2)
popscale153=scale(sccs$v153) popscale156=scale(sccs$156) popscale1130=scale(sccs$1130) popscale=matrix(c(popscale153,popscale156,popscale1130),186,3)
popscale=matrix(c(scale(sccs$v153),scale(sccs$v156),scale(sccs$v1130)),186,3)
popdenmean=round(1.5*rowMeans(popscale),0)
popdenmean
-2 -1 0 1 2
Absent 17 21 13 9 9
Present, inactive, unconcerned 11 9 12 11 8
Present, active, unconcerned with humans 4 4 7 4 4
Present, active, supportive of morality 2 8 8 16 9
newvar=sccs$v1650 newvar[which(newvar==88)] = NA Replace 0 with NA plot(popdenmean,scale(sccs$v1650)) plot(popdenmean,scale(newvar))
popscale=matrix(c(scale(sccs$v153),scale(sccs$v156),scale(sccs$v1130)),186,3) popmean=rowMeans(popscale) max(popmean) min(popmean) mean(popmean) #~zero v1650=sccs$v1650 v1650[which(v1650==0)] = NA plot(popmean,scale(v1650)) #sccs$v1650 #http://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html Scatter Plot Smoothing #lowess(x, y = NULL, f = 2/3, iter = 3, delta = 0.01 * diff(range(xy$x[o]))) x = popmean y = scale(v1650) lowess(x, y = NULL, f = 2/3, iter = 3, delta = 0.01 * diff(range(xy$x[o])))
AgpotScale=matrix(c(scale(sccs$v921),scale(sccs$v928)),186,2) #AgriPot 1&2 Agpotmean=round(rowMeans(AgpotScale),1)
SizeJHscale=matrix(c(scale(sccs$v63),scale(sccs$v237)),186,2) #CmtySize & Superjh SizeJHmean=round(rowMeans(SizeJHscale),1)
Replace PCA from Psychometric package
twoVARS=cbind(sccs$v921,sccs$v928) #princomp(twoVARS) pcdat=princomp(twoVARS) #pcdat$scores PCcap1=round(pcdat$scores[1:186],3) sccs$v1918=PCcap1
sccs$v237[172]=5 sccs$v237[115]=3 sccs$v63[88]=3 twoVARS=cbind(sccs$v63,sccs$v237) #average CmtySize with Superjh #princomp(twoVARS) pcdat2=princomp(twoVARS) #pcdat$scores PCsize1=round(pcdat2$scores[1:186],3) sccs$v1919=PCsize1
White-Murdock alignment
from Murdock and White 1969 EduMod84#Coords Wlink=matrix(data = 0, nrow = 186, ncol = 186) for(i in 1:185) { Wlink[i,i+1]=1 Wlink[i+1,i]=1} for(i in 1:184) { Wlink[i,i+2]=.8 Wlink[i+2,i]=.8} for(i in 1:183) { Wlink[i,i+3]=.4 Wlink[i+3,i]=.4} for(i in 1:182) { Wlink[i,i+4]=.2 Wlink[i+4,i]=.2} Wlink=Wlink/rowSums(Wlink)
New problems and possibilities e.g., Replace, Dichotomize NA / Others
- replace(x,is.na(x),0)) [R] how to replace NA values in a list [http://stackoverflow.com/questions/2767219/r-how-to-replace-elements-of-a-data-frame
- data[which(data==0)] = NAReplace]
sccs$v2009=sccs$v1009
sccs$v2011=3-(sccs$v1009<=3)*1+(-1+(sccs$v1009==5)*1)
sccs$v2011= replace(sccs$v2009,is.na(sccs$v2009),0)
sccs$v1732= replace(sccs$v1732,is.na(sccs$v1732),0)
newvar<-apply(sccs[,c("v2011","v1732")],1,max)
newvar[which(newvar==0)] = NA
sccs$v2010=newvar
Replace 0 with NA
table(sccs$v1732,sccs$v2011)
0 1 2 3 4 5 6 7
0 16 7 2 0 4 5 2 0
1 10 3 1 2 1 2 0 0
2 0 0 0 0 1 1 5 2
3 2 1 0 0 3 2 0 1
table(sccs$v2009,sccs$v2011)
1 2 3
1 13 0 0 #No wage labor
2 3 0 0 #No wage labor or coerced
3 2 0 0 #No wage labor or coerced
4 0 10 0 #Hired/Wage
5 0 0 12 #Migrant wage
6 0 9 0 #Hired/Wage
7 0 3 0 #Hired/Wage
#analysis v1009 is WorldSys labor
sccs$v2009=sccs$v1009 #Proxy for worldsys labor 1-7 N=52
sccs$v2732=sccs$v1732 #Proxy for Wage labor 1-3 N=89
sccs$v2011=3-(sccs$v1009<=3)*1+(-1+(sccs$v1009==5)*1) #Proxy for a labor variable 1-7 N=52 NOW AS 1 2 3 values
sccs$v2011= replace(sccs$v2011,is.na(sccs$v2011),0) #Replace NA with 0 keep 1-2-3
sccs$v2732= replace(sccs$v2732,is.na(sccs$v2732),0) #Same for v1732 labor variable
newvar<-apply(sccs[,c("v2011","v2732")],1,max) #Take max (i.e., nonzero)
newvar[which(newvar==0)] = NA #Replace 0 with NA N=112 WHY 1-7???
sccs$v2010=newvar #N=112
FxCmtyWages 1 2 3 v1732
0 20 8 15 <-coding 0 error here hard to fathom FxCmtyWages SUM IS N=112 v1732
1 13 0 0
2 1 14 0
3 2 0 16
table(sccs$v1732,sccs$v2010) #boosts overlap of 112 and 89 from to 89? GOOD PROXY
1 2 3
1 31 2 3
2 0 21 1
3 0 0 31
GOOD TO HERE
#analysis v1009 is WorldSys labor
sccs$v2009=sccs$v1009 #Proxy for worldsys labor 1-7 N=52
sccs$v2732=sccs$v1732
sccs$v2011=3-(sccs$v1009<=3)*1+(-1+(sccs$v1009==5)*1) #Proxy for a labor variable 1-7 N=52 NOW AS 1 2 3 values
sccs$v2011= replace(sccs$v2011,is.na(sccs$v2011),0) #Replace NA with 0 keep 1-2-3
sccs$v2732= replace(sccs$v2732,is.na(sccs$v2732),0) #Same for v1732 labor variable
newvar<-apply(sccs[,c("v2011","v2732")],1,max) #Take max (i.e., nonzero)
newvar[which(newvar==0)] = NA #Replace 0 with NA N=112 WHY 1-7???
sccs$v2010=newvar #N=112
sccs$v200==1 #Use 84% no wage labor for region 1 (Africa) #
dich=is.na(sccs$v287)
ndich=as.numeric(dich)
ndich=as.numeric(is.na(sccs$v287))
table(dich,ndich)
ndich
dich 0 1
FALSE 177 0
TRUE 0 9
- Do this for FxCmtyWages
di287=as.numeric(is.na(sccs$v287)) table(dich,di287)
- Use imputed data compared to real data for dep_vars in causal graph?
- Then create dummy variables for missing v. coded 1/0 subsamples for each dep_var in influence graph
- Laura might be able to do phylogentic modeling on HiGod4
- What of cases like Superjh that are not PRIMARY but come out only when other variables are present in th model
Using the 2sls-is program
Maps and Plots
Brown-Eff Maps source("http://intersci.ss.uci.edu/~drwhite/0/world_countries_24.rda")
- Extra_code#Brown-Eff_map
- special symbols for missing data
- Moral vs. Punative High Gods vs. Indifferent or No High gods
Plots
Link for forums: [URL=http://www.zshare.net/download/94271729236d1be1/]SCCS.zip - 0.90MB[/URL]
Week2: Breaking out the EduR_1.5 model for a causal graph
- EduS_1 Depvar: Moral Gods
- EduS_2 Depvar: Missions
- EduS_3 Depvar: Animals (Pastoral
- EduS_4 Depvar: Bridewealth
- EduS_5 Depvar: AnimBwealth
- EduS_6 Depvar: FxCmty
- EduS_7 Depvar: WageLabor
- EduS_8 Depvar: FxCmtyWageLabor
- EduS_9 Depvar: Superjh
Controls for Islam and Xianity
No run HiGods without Snarey codes
HiGod4 - > sccs$v238 no_rain_dry=sccs$Rain+(sccs$v855>4)*1, wet=1+((sccs$v929>8)*1) +((sccs$v929>3)*1) ##+((sccs$v929>7)*1) #DOES NOT CORRELATE with SNAREY$Rain sccs$v1913 # NoRain=3 - (sccs$v1913>24)*1) + (sccs$v1913>101.6)*1) [1] 11.155 35.425 86.950 69.790 128.050 165.890 123.740 106.625 85.725 100.525 107.050 146.270 193.155 [14] 180.015 263.695 175.779 240.685 104.220 142.300 370.240 100.385 93.480 115.110 22.820 55.072 132.085 [27] 35.515 140.315 31.295 84.780 73.260 123.120 145.335 120.220 74.420 20.699 151.620 37.755 0.175 [40] 4.055 4.515 56.790 0.140 58.845 12.220 12.702 35.661 127.505 80.654 34.700 99.617 52.625 [53] 39.305 52.881 172.292 59.708 36.794 33.668 69.512 159.650 159.079 153.555 104.037 167.888 57.365 [66] 8.635 118.979 40.900 240.914 253.535 84.120 183.220 185.858 161.188 131.469 152.819 263.390 296.760 [79] 296.760 149.565 271.965 185.433 178.310 178.059 330.675 206.810 265.060 214.060 92.995 129.056 25.870 [92] 448.100 447.380 244.500 246.420 330.500 237.571 212.786 366.063 398.167 310.360 271.260 199.565 139.445 [105] 113.015 273.550 303.625 578.967 363.800 303.976 373.205 328.980 164.560 111.360 49.991 107.475 177.705 [118] 103.579 59.456 29.054 34.602 51.790 83.560 25.465 83.794 105.772 49.742 31.314 40.090 259.290 [131] 131.769 121.357 321.934 103.848 90.974 14.437 29.210 34.558 51.640 33.624 43.635 57.105 83.880 [144] 78.140 113.646 131.885 54.425 39.725 36.797 60.771 21.752 83.996 68.951 266.653 82.590 319.786 [157] 372.220 245.580 46.472 131.722 198.440 120.614 246.085 224.531 303.106 271.825 344.485 202.070 48.700 [170] 148.030 62.220 58.915 142.695 121.593 182.300 107.461 118.015 159.780 163.379 192.237 164.705 121.579 [183] 129.545 56.626 12.966 55.375
over 40" = over 101.6 cm 10" = 25.4 cm
NoRain=3 - ((sccs$v1913>24)*1) + ((sccs$v1913>101.6)*1) NoRain= 1+((sccs$v1913>24)*1) + ((sccs$v1913>101.6)*1) wet=1+((sccs$v929>8)*1) +((sccs$v929>3)*1) ##+((sccs$v929>7)*1) #DOES NOT CORRELATE with SNAREY$Rain
Source files
source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R") source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2Missions.R") source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R")
# THE DATA in comb.Rdata are not yet released for publication
# IF YOU DONT HAVE comb.Rdata edit out $Missions and $lo_water from your create file and use a new Which(sccs$v....)
setwd('/Users/drwhite/Documents/3sls/sccs/') #change to YOUR directory # includes library(foreign)
load('comb.Rdata') #alt: just load('sccs.Rdata') #for comb.Rdata you have added new variables such as Missions
sccs<-data.frame(sapply(comb,function(x) as.numeric(x))) #data(sccs)
dep_var<-sccs$HiGod4 #sccs$v238 if you start with load('sccs.Rdata')
z<-which(sccs$Missions!=2) #option to modify for a subsample dependent on selection variable
is.na(dep_var[z])<-TRUE #option to set chosen selection variable values to missing
dep_var #option to use this modified dep_var in
#sccs$v238=dep_var #optional but woulf change sccs until reloaded
# (73 observations will be deleted due to missionization)
#Access to cached software
source("http://intersci.ss.uci.edu/wiki/R/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R")
source("http://intersci.ss.uci.edu/wiki/R/two_stage_ols.R")
#source("http://intersci.ss.uci.edu/wiki/R/run_model_79.R") #option for Train-Test inferential states
source("http://intersci.ss.uci.edu/wiki/R/run_model_100.R") #OR: full sample
source("http://intersci.ss.uci.edu/wiki/R/averageAll.R")
#OR# as configured on your own directories
source("examples/create/create_EduR_1/create_EduR_1.5DistLangBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R")
source("R_3_s_ols/two_stage_ols.R")
source("examples/create/R_run/run_model_100.R")
source("examples/create/R_run/averageAll.R") ##averages
average(ols_stats$imputed_data) #independent variables average for imputed missing data output=average(ols_stats$imputed_data) #output but not in data frame 'class(output)' #output now in a class to make data frame output = as.data.frame(output) #Once it's a data frame it should [almost] behave just like my_sccs. lmout=lm(sccs$v238 ~ output$no_rain_dry +output$superjh +output$milk +output$AnimXbwealth +output$FxCmntyWage +output$logdate) #) #output$fratgrpstr +output$pigs +output$eeextwar +output$Missions +output$socplex + lmout summary(lmout) vif(lmout)
Science article
- Comments of Simon Dedeo
- population structure needed for "group selection"
- run longitudinal gini indices in Europe, USA re: rise and fall in religious memberships
- Outline
- Science style lit.
- Chamberlain mult. hyp.
- Background
- Multiple hypotheses (table 1)
- Stat. Results (discuss)
- Tolga to look at how Science articles report and format their results
- Comparison of 2sls and OLS
- Unexpected and Serendipitous findings
- Using backloops and control variables (SOM)
- Need for influence diagrams and controls
- Conclusions
- Supp. Online Materials
- Giorgio and Doug to work on the use of effect ratios to compare OLS with 2sls
- New diagram
- Science issue
- 4500 words: research article (we stated with 5200)
- 2500 words: research report
- 3500 survey, solicited by editor
- ... also perspectives
Readings for the SFI working group Sept 11, 2011
Discussion issues
- Prosociality and Punishment the big issue in new evolution of religion articles -- basically game theoretic not biological
- Is there any selection operating at the genetic level?
- What exactly does David Sloan Wilson mean by group selection (review his major book), get Duran Bell reaction from email
- Social evolution are likely for CmntyWageLabor because cause is episodic
- Genetic condition of Lactose tolerence with AnimXbwealth, after, cause is episodic
- Ecological dryness is long-term however.
- Write the abstract model as Do operators, filled in by Bayesian results
- Idea of a simulation (following Cosma) with a time dimension
- Can we follow the Imai, et al. Causal Mediation in R in the format of treatment and antecedent conditions?
- We can create new variables from MI of missing data for Paige and Paige
Working R Causal mediation package
- R causal mediation package
- Imai, E., L. Keele, D. Dingley, and T. Yamamoto. 2010. mediationR.pdf article Chapter 8 Causal Mediation Analysis Using R. In, Introduction to statistical mediation analysis By David Peter MacKinnon
- EduR-1.9
- http://imai.princeton.edu/research/BaronKenny.html Imai, Kosuke, Luke Keele, and Dustin Tingley. (2010). ``A General Approach to Causal Mediation Analysis. Psychological Methods, Vol. 15, No. 4 (December), pp. 309-334. (lead article)
- The companion paper: Imai, Kosuke, Luke Keele and Teppei Yamamoto. ``Identification, Inference, and Sensitivity Analysis for Causal Mediation Effects."
- We have developed easy-to-use software and have written a paper that explains its use with some examples: Imai, Kosuke, Luke Keele, Dustin Tingley and Teppei Yamamoto. Causal Mediation Analysis Using R.
- R causal mediation package - MediationR.pdf - -Mediation manual
- Causal Mediation using R
- Judea Pearl#2011_Bayesia Pearl 2011: and citing papers by [http://imai.princeton.edu/research/index.html Imai et al. (Research papers)
See also
Ilya Shpitser, Tyler J Vanderweele. 2010. A Complete Graphical Criterion for the Adjustment Formula in Mediation Analysis. The international journal of biostatistics 7, 2, 1. http://www.mendeley.com/research/complete-graphical-criterion-adjustment-formula-mediation-analysis/
Readings
2SLS
- White, Oztan, Gosti, and Snarey, 2011 (SFI) Explaining the Evolution of Moralizing Gods (draft, not for circulation)
- Brown and Eff, 2009. http://intersci.ss.uci.edu/wiki/pdf/eScholarshipBrown&Eff.pdf
- White, Ren Feng, Gosti, Oztan, 2011 (Leipzig) R Software for Regression with Inferential Statistics (2sls-is) http://intersci.ss.uci.edu/wiki/pdf/2sLS-isReOrg6aShort4.pdf
- Eff and Dow, 2010. http://intersci.ss.uci.edu/wiki/pdf/eScholarshipEff&Dow2009.pdf
Sequential Bonferroni tests
- Rice 1989 http://www.jstor.org/stable/2409177
- Holm 1969 http://www.jstor.org/stable/4615733
- http://intersci.ss.uci.edu/wiki/xls/Bonferroni.xlsx
Multiple working hypotheses: Required reading
- Chamberlain, Thomas C. 1890, The method of multiple working hypotheses: Science, v. 15, p. 92–96. Find this article online [1] clean version in pdf Commentary
- Platt, John R. 1964. Strong Inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science 46(3642):347-353.
http://pages.cs.wisc.edu/~markhill/science64_strong_inference.pdf
Confounding and Simulations
- Shalizi, Cosma R., and Andrew C. Thomas. 2011. Homophily and Contagion Are Generically Confounded in Observational Social Network Studies. Sociological Methods & Research 40: 211-239. http://smr.sagepub.com/content/40/2.toc.
- Snijders "Spread of Evidence-Poor Medicine" http://www.lists.ufl.edu/cgi-bin/wa?A2=ind1106&L=SOCNET&P=37951
SEM
- Fox, John. 2006. Structural Equation Modeling with the sem Package in R. Structural Equation Modeling 13(3): 465–486. http://socserv.socsci.mcmaster.ca/jfox/Misc/sem/SEM-paper.pdf
- http://jeromyanglim.blogspot.com/. 2009. Structural equation Models(SEM) from John Fox on 2009-12 ...
Bayesian
- Shalizi, Cosma R. 2009/2011. Graphical Models blog
- Shalizi, Cosma R. 2011. Causality blog - Beware his misunderstanding of Don Rubin's propensity score methods - see Controversy under Judea Pearl#Articles.
- Causal Graphs Laboratory paper: http://www.santafe.edu/education/schools/complex-systems-summer-schools/2011-program-info/
- Nihat Ay. 2008. A Refinement of the Common Cause Principle santa fe institute workingpaper
- Nihat Ay, Jessica C. Flack, and David C. Krakauer. 2007. Robustness and Complexity Co-constructed in Multimodal Signaling Networks. Philosophical Transactions of the Royal Society. London. 362: 441-447. http://www.jstor.org/stable/20209856
- Nihat Ay and David C. Krakauer. 2007. Geometric robustness theory and biological networks. Theory in Biosciences 125 (2): 93-121.
- Nihat Ay and Daniel Polani. 2006. Information Flows in Causal Networks santa fe institute workingpaper
- Wolfgang Löhr and Nihat Ay. 2008. On the Generative Nature of Prediction santa fe institute workingpaper
PEARL, Greenland, Halbert White and others
- Lots of papers at hand and at Judea Pearl, Sander Greenland and James M. Robins
- H.White, Lu and Chalak
- Lots of papers at hand and at Halbert White, Karim Chalak and Xun Lu
- White, Halbert, and Xun Lu. 2010. "Robustness Checks and Robustness Tests in Applied Economics". Journal of Financial Econometrics, 8: 193-243. Matlab code
- White and Lu 2009. Non-robustness is signaled by failure of conditional exogeneity. See Wooldridge 2006: 532-534, also http://www.oga-lab.net/RGM2/func.php?rd_id=systemfit:hausman.systemfit which might not be relevant. Lets look more into Hausman.
- and more to come
Pod A Conference room in the main building for you for September 7
It's a good space for 4-6 people and has better internet connectivity than the Gatehouse.
Restricted samples based on values of a control variable
Restricted samples based on values of a control variable can be obtained by blanking those values in a new depvar.
input = as.data.frame(sccs$HiGod4) z=which(output$Missions==2)#1) input[z,] depvar= input[z,] #N=73
Doing OLS on the Imputed data
- It appears that the imputed data is already being passed
out of the function. All you need to say is ols_stats$imputed_data and
- To get a specific instance you can say
'imputed_data[which(imputed_data$.imp==i),]' where is the ith dataset.
- To extract a column you just say:
ols_stats$imputed_data[,j] where j is the jth column or you can say ols_stats$imputed_data[,"column_name"]
- But its not that name I want
out=lm(sccs$v238 ~ ols_stats$imputed_data[,"Missions"]) Error in model.frame.default(formula = sccs$v238 ~ ols_stats$imputed_data[, : variable lengths differ (found for 'ols_stats$imputed_data[, "Missions"]')
- Make sure output is a data frame by saying
'class(output)' output = as.data.frame(output) #Once it's a data frame it should [almost] behave just like my_sccs.
lmout=lm(sccs$v238 ~ output$fratgrpstr +output$Missions +output$socplex+output$milk +output$AnimXbwealth +output$pigs) lmout summary(lmout)
- After summary do vif(lmout)
vif(lmout)
Handling files with factor categories (no longer numeric)
comb=cbind(sccsfac,outDat) comb$Missions comb$v1 table(comb$Missions,comb$v1)
setwd('/Users/drwhite/Documents/3sls/sccs')
load('outDat.Rdata') #Snarey factors
table(outDat$Missions,outDat$HiGod4)
sccsout<-data.frame(sapply(outDat,function(x) as.numeric(x))) #Convert factor to numeric
table(sccsout$Missions,sccsout$HiGod4)
1 2 3 4
1 33 26 5 9
2 36 25 18 34
table(sccsnew$Mission,sccsnew$v238) #Missions more high gods
1 2 3 4
1 33 22 3 7
2 35 25 10 33
save(sccsnew, file = "sccsnew.RData")
http://stat.ethz.ch/R-manual/R-patched/library/base/html/factor.html (try ?base) http://www.ats.ucla.edu/stat/R/modules/factor_variables.htm
- The only required argument is a vector of values which can be either string or numeric.
- Optional arguments include the levels argument, which determines the categories of the factor variable, and the default is the sorted list of all the distinct values of the data vector.
- The labels argument is another optional argument which is a vector of values that will be the labels of the categories in the levels argument.
- The exclude argument is also optional; it defines which levels will be classified as NA in any output using the factor variable.
- Graphics are another area that benefits from the use of factor variables. As in the tables the factor variable will indicate a better ordering of the graphs as well as add useful labels.
comb=cbind(sccsfac,outDat)
comb$Mission
comb$v1
table(comb$Mission,comb$v1)
New: You can convert every field to numeric this way:
sccs<-data.frame(sapply(sccsfac,function(x) as.numeric(x)))
But some fields will be nonsense (society name, etc.)
result:
sccsnew<-data.frame(sapply(comb,function(x) as.numeric(x)))
> table(sccs$v1,sccs$v2)
1 2 3 4 5
1 0 0 0 0 0
2 0 0 0 0 0
3 3 0 1 0 0
4 36 5 15 22 2
5 10 5 11 12 0
7 0 0 1 0 1
Yay! With this we can use sccsfac.Rdata to replace sccs.Rdata
I attached a file that created the factors like I was creating a new variable. A command for creating the Rain factor would look like such: FactorRain= factor(Snarey186$Rain,levels = c(1,2,3),labels = c(">40in/year", "Moderate", "<10in/year")) You can also factor out Rain by just dropping FactorRain= from the above command, however this does not create values. However, when I create these new values as mentioned in the first command above, they do not show up in the dataset when viewing it in RStudio.
Snarey factor code for category names
New data with factors
- we have a new version of sccsfac.Rdata that has categories at http://eclectic.ss.uci.edu/~drwhite/sccs. Cross-tabs can be done with category labels.
- factors are categories that go with numeric codes. Email: Here's something to start with: factors to add to a csv to Rdata conversion
example
> data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
> fdata = factor(data)
> fdata
[1] 1 2 2 3 1 2 3 3 1 2 3 3 1
Levels: 1 2 3
> rdata = factor(data,labels=c("I","II","III"))
> rdata
[1] I II II III I II III III I II III III I
Levels: I II III
Here is the code that I used, just so that you have it: Victoria
Moralizing_gods#Codebook_for_Moralizing_gods MUST ADD #1. Raininch #7. Snarey Year
# To make factors for variables: Rain, Terrain, Water, Missions, HiGod4
Rain.f <- factor(Rain, labels = c(">40in/year", "Moderate", "<10in/year"))
Terrain.f <- factor(Terrain, labels = c("Surface Water", "Moderate","Surface H20 Scarce"))
Water.f <- factor(Water, labels = c("Neither Rain nor Surface H20 high","Either Rain nor Surface H20 high"))
Missions.f <- factor(Missions, labels = c("Premissionized", "Missionized"))
HiGod4.f <- factor(HiGod4, labels = c("Absent", "Present, inactive,unconcerned","Present, active, unconcerned with humans","Present, active,supportive of morality"))
#TO OUTPUT THE FILE outDat<-cbind(Rain.f,Terrain.f,Water.f,Missions.f,data.frame(HiGod4.f)) write.csv(outDat,"SnareyFactors.csv") save(outDat,file="outDat.Rdata")
-----------------------------
Create factors for AddSnarey,csv conversion to AddSnarey.Rdata (no factors for SCCS Raininch or YearSnarey Rain Terrain Water Missions HiGod4 Rain 1= >40"/year 2= Moderate 3= <10"/year Terrain 1= Surface water 2= Moderate 3= Surface H2O scarce Water budget 1= Neither rain nor surface H2O high 2= Either rain or surface H2O high Missions 1=Premissionized 2=Missionized HiGod4 1= Absent 2= Present, inactive, unconcerned 3= Present, active, unconcerned with humans 4= Present, active, supportive of morality
Do you think that you can take a look at this and see if what I did was what
you wanted?
####sccs<-sapply(sccsfac,function(x) as.numeric(x)) #### Error in sccs$v1 : $ operator is invalid for atomic vectors
Splice new variables, e.g., Premissionized
setwd('/Users/drwhite/Documents/3sls')
aux_data = outData
#sccs<-load("sccsfac.Rdata",.GlobalEnv)
load('sccs.Rdata')
sccsmsn<-data.frame(cbind(aux_data,sccs))
#save(newsccs2, file ="sccsmsn.Rdata")
save(sccsmsn, file ="sccsmsn.Rdata") #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott)
sccsmsn$Missions
NOT WORKING
3. I am trying to splice the AddSnarey.csv to the end of your sccs.Rdata It has those V1-V9 variables and those are the variable labels underneath and the result is weird. I had to take off the variable labels to make a 186-186 congruence between the two data files. But when done I didnt know the way to insert my variable names. So my problem is HOW DO I GET THE VARIABLE LABELS INTO THE NEW MERGED FILE, either the first row or the second....?
do ?read.table
V1____V2______V3______V4____V5______V6_______V7___V8 SCCS Raininch Rain Terrain Water Missions HiGod4 Year
setwd('/Users/drwhite/Documents/3sls')
###aux_data = read.table("AddSnarey.csv", header = TRUE, sep = ",", col.names) #Snarey186.csv
aux_data = read.csv("AddSnarey.csv", header = TRUE, sep = ",") #Snarey186.csv
sccs<-load("SCCS.Rdata",.GlobalEnv)
#newsccs2<-data.frame(cbind(aux_data,sccs))
sccsmsn<-data.frame(cbind(aux_data,sccs))
#save(newsccs2, file ="sccsmsn.Rdata")
save(sccsmsn, file ="sccsmsn.Rdata") #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott)
sccsmsn$Missions
Splice in WldRelig=cbind(sccs$v2001,sccs$v2002) -- Its already in sccs!!!! Forget Var1918-2000.cvs Grief and Mourning World Religion v2001 v2002 is in sccs
Select rows in my_sccs
z<-which(my_sccs$Missions==1) #nonMissionized my_sccs[z,] zz<-my_sccs[z,] my_sccs<-zz
Select rows in my_sccs as above -Can be done more easily by having a data name and value like Missions==1 in the create file.
If this is not null, then instead of taking a 79% or whatever size subsample in TRAIN you use this variable to select the subsample, measure the %cases and use the remaining sample to report on in TEST. This gives forr ways to select samples:
total % select by variable (gives imputation) m_sccs (no imputation)
data(sccs)
data(sccs) only lives in the source files that was under the 'examples" sub-directory. There is no reference to sccs in the actual two stage ols code. So the refactoring I did earlier this year was already supposed to do this. So the examples code can be thought of as 'client code' and the two stage ols as 'library code.' How clients wish to refer to their data is up to them.
My understanding is that you started making copies of the create model value children file which used data(sccs) and then you found it hard to modify all of them not to depend on data(sccs). But this does not mean other people need to do it that way.
Does that help?
To extract a column you just say
ols_stats$imputed_data[,j] #where j is the jth column or you can say ols_stats$imputed_data[,"column_name"] > ols_stats$imputed_data[,"anim"] [1369] 4 4 2 2 3 3 3 2 4 2 1 2 1 1 3 2 3 4 3 1 3 1 2 1 1 2 1 3 2 2 2 2 3 1 2 1 2 1 [1407] 2 2 2 1 1 2 2 2 2 2 3 3 2 1 1 1 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [1445] 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 1 8 3 1 1 1 1 1 1 1 2 1 1 3 4 1 1 1 1 1 1 1 1 [1483] 1 1 2 3 1 1
1488/8=186 cases with imp=8, but N=168 for the nonmissing cases on dep_var which are those to impute if data were missing.
source("examples/create/create_EduR_1/create_EduR_1.5DistB_Eff.DryMoneyLE4LowPopden&ANIMxBWEALh2oMSNS_.R")
source("R_3_s_ols/two_stage_ols.R")
source("examples/create/R_run/run_model_100.R")
New functions
- Many of these will be worked out with Victoria Pevarnic, borrowed from Brown and Eff and from Eff "Prosociality" (Moran's i)
- [http://stat.ethz.ch/R-manual/R-patched/library/base/html/factor.html factor {base}
Doug TODO
- Scott: Install package library(sccs) to NewMac
- Merge Snarey.cvs with sccs.Rdata, requires variable labels and add to a newsccs.Rdata.
- Run on Missions only, using z<-which(sccs$vMissions==1) # No missions
- THESE LINES DO FIND my_sccs
###z<-which(sccs$Missions==1) #No z<-which(my_sccs$Missions==1) #nonMissionized my_sccs[z,] zz<-my_sccs[z,] my_sccs<-zz
- Run without autocorrelation (for paper) - BUT WHAT ABOUT SIGNIFICANCE? Effective Sample size? Bonferroni corrected significance tests
?
- Use effective sample size for pairs of variables using covariance -- Project for Victoria Pevarnic? --needs covariance in the equation
- http://www.r-tutor.com/elementary-statistics/numerical-measures/covariance
0) rename file to have extension 'tgz' (for security reasons Google won't let me attach a file with extension ".tgz") 1) Decompress sccs.tgz 2) cd Statistical-Inference-With-SCCS
Documents/sccs-1.tgz
3 ) R
4) load('examples/data/sccs.Rdata')
5) load('examples/data/vaux.Rdata')
6) source('examples/src/create_model_value_children.R')
7) source('sccs/R/two_stage_ols.R')
8) source('examples/src/run_model.R')
Scott TO CONSIDER
the new program could be generalized to include subsample selection just after my_sccs if you set, just after my_sccs (and thus whatever row selection follows), not only
data=my_sccs,
but
aux_data=my_aux,
prox_list=list(distance=Wdd), #distancelang=distlang),
#,language=Wll,distance=Wdd,language=Wll
To do this there would be a default zz = all rows, with an alternative set by rows numbers selected just after my_sccs
z<-which(my_sccs$Missions==1) my_sccs[z,] zz<-my_sccs[z,] my_sccs<-zz
The zz rows would be used to similarly reset the rows in my_aux and the rows and columns in the prox_list w matrices
A 2nd ALTERNATIVE is that I could write a little program that could take whatever *.Rdata is used, along with the W matrices and Vaux.Rdata and write more generic matrices for these sets of inputs
z<-which(my_sccs$Missions==1) my_sccs[z,] zz<-my_sccs[z,] my_sccs<-zz
That would require keeping different subdirectories of the three reset datasets in different subdirectories.
Seems to me the first alternative is simpler and more generic.
Scott Possible TODO
1. Scott: After
4) load('examples/data/sccs.Rdata')
can sccs be used to make function data(sccs)??? If so then we will have a version for installation without R CMD... in PC and a package installation version.
Victoria and I, however, need installations, she on her PC and me on my NewMac.
Hi Dr. White,
To fix the problem, you have to do the following:
- 1. Open the following document in a text reader: two_stage_ols which is found in Statistical-Inference-With-SCCS->sccs->R
- 2. Copy the text into R into a new script frame
- 3. Save the new script with a .R extension named slightly differently
- 4. Edit the following line in the script:
#--the imputand is placed as a field in imputed_data and named-- imputed_data <-cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)])) Change it to the following: #--need to add +2 after NCOL(combine_input) and stringsAsFactors=FALSE)) imputed_data <-cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2],stringsAsFactors=FALSE))
- 5. Save the new script and close R
- 6. Re run the programs with a change in the last line of the code as follows:
This is the original code:
load('examples/data/sccs.Rdata')
load('examples/data/vaux.Rdata')
source('examples/src/create_model_value_children.R')
source('sccs/R/two_stage_ols_vp.R') #Change this line to include the name of the new script file that you created above
source('examples/src/run_model.R')
The program should work now. I was able to write this up based on some corrections that Giorgio had previously made but never written up.
2&3 have high priority for the article I am writing
Select subsample
2. z<-which(sccs$V7==2) is a statement that, if z is the data selected because the depvar is not-missing, can further restrict the data to be used to those cases with certain values (V7=2) on a control variable. We are able to make this selection in create.model and pass this into two_stage_ols then run_model.
NOW run_model has 73 rows for incoming my_sccs but 186 for a version called otherwise.
Splice new variables, e.g., Premissionized
3. I am trying to splice the AddSnarey.csv to the end of your sccs.Rdata It has those V1-V9 variables and those are the variable labels underneath and the result is weird. I had to take off the variable labels to make a 186-186 congruence between the two data files. But when done I didnt know the way to insert my variable names. So my problem is HOW DO I GET THE VARIABLE LABELS INTO THE NEW MERGED FILE, either the first row or the second....?
do ?read.table
V1____V2______V3______V4____V5______V6_______V7___V8 SCCS Raininch Rain Terrain Water Missions HiGod4 Year
setwd('/Users/drwhite/Documents/3sls')
###aux_data = read.table("AddSnarey.csv", header = TRUE, sep = ",", col.names) #Snarey186.csv
aux_data = read.csv("AddSnarey.csv", header = TRUE, sep = ",") #Snarey186.csv
sccs<-load("SCCS.Rdata",.GlobalEnv)
#newsccs2<-data.frame(cbind(aux_data,sccs))
sccsmsn<-data.frame(cbind(aux_data,sccs))
#save(newsccs2, file ="sccsmsn.Rdata")
save(sccsmsn, file ="sccsmsn.Rdata") #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott)
sccsmsn$Missions
Splice in WldRelig=cbind(sccs$v2001,sccs$v2002) -- Its already in sccs!!!! Forget Var1918-2000.cvs Grief and Mourning World Religion v2001 v2002 is in sccs
table(sccsmsn$Missions,sccsmsn$v2001)
?read.table
newsccs2
4. This item has very high priority for SFI and the article and it would be good to implement. I thought of a solution for the 100% option.
Just allow a choice of 100%. But when the program comes to the test sample you just reset the 0 to 30%.
The program will run as usual. No major reprogramming needed. What I really need is to that that option soon. Victoria can take care of the
rest I think. It's a suggestion that might entail minimum program change. If you see a problem lets discuss.
5. Output the full inputed dataset.
Splitting the sample various ways, i.e., USING CONTROL VARIABLES
Great!! And what if I want to split based on the value of a dichotomized variable
z<-which(((SCCS$v238==4)*1)==TRUE) for example? (Doug)
Even easier: z<-which(SCCS$v238==4)
If you want to run just on the odd-numbered (as so many coders did), then you can do the following:
#z<-which(odd(SCCS$"sccs#")==FALSE) #gives even numbered z<-which(odd(SCCS$"sccs#")==TRUE) #gives odd numbered oddSCCS<-SCCS[z,]
Now oddSCCS will be a data frame with only the odd-numbered societies.
If you want to randomly split it:
z<-which(rank(rnorm(186))<=93) randSCCS<-SCCS[z,]
And randSCCS will contain 93 societies.
Investigate adjust
http://www.stata.com/support/faqs/stat/adjust.html
ANSWER: "Controlling for some set of variables" would just mean including those as independent variables in the regression (so the effect of the variable of interest is not obscured by omitted variable bias). This is always done in a good regression model. (DRW: See H. White on core variables vs. controls)
"Fixing certain variables at their mean values", in their paper, simply means that when they calculated the values for their charts, they used the mean values of all the variables EXCEPT the variable they wanted to show the effect of--for that variable they used the actual values. This is an issue of making charts, not an estimation issue.
Another option for "adjust" is to have an auxiliary program that takes a set of independent variables and constructs a subsNsccs.Rdata, either selecting a sample with certain values, or setting certain variables to their means or other values. -- NOT NEEDED
---------------------------- I put this question to Anthon. But "adjust" is a standard option in Stata and other regression programs http://www.stata.com/support/faqs/stat/adjust.html If you have a look at that page what I want to discuss with you briefly is how to implement controls and adjustments in our program. Having the type of selection where z<-which(SCCS$v238==4) is a start and will do for now but we should think ahead ---------------------------- Original Message ---------------------------- Subject: [Fwd: Re: splitting the SCCS] From: drwhite@uci.edu Date: Sun, July 24, 2011 6:59 am To: "E. Anthon Eff" <eaeff@mtsu.edu> --------------------------------------------------------------------------
Thanks : In addition to making it possible to model for a selected control category (like modeling polygyny for the large subsample of Missionized societies) I am trying to get me head around this:
what is meant when a regression is reported (as in Henrich, et al. 2010. Markets, Religion, Community Size, and the Evolution of Fairness and Punishment Science 19 March 2010: 1480-1484.) as fixing certain variables at their mean values? Or controlling for some set of variables? How would that be done in Eff and Dow, generally speaking?
Original Message ----------------------------
Subject: Re: splitting the SCCS From: "E. Anthon Eff" <eaeff@mtsu.edu> Date: Sat, July 23, 2011 1:37 pm To: drwhite@uci.edu
Even easier:
z<-which(SCCS$v238==4)
Moralizing gods
Add a variable in .csv format to the sccs.Rdata file
What we need now
Save variables after imputation e.g. for PCA 100% sample' Align print labels Hausman test Output results from stage 1
- Instructions include ols_stats$
- DW should code Paige and Paige
Load(sccs.Rdata) load(vaux,Rdata) rather than *gz installation
For PCA can this be passed to two_stage_ols - so it does not have to be changed? E.g., put the variables in create and the code in 2sls?
Hi Doug. We're back. Still haven't found the time to look at the problem Scott mentioned...
The principal components are all done within the loop over the imputed data sets. It's in the script we submitted with the paper. Looks like this:
#--composite variable for size--
o<-princomp(m1[z,c("superjh","commsize")],cor=TRUE)
o1<-data.frame(psychometric::alpha(m1[z,names(o$center)]))
names(o1)<-"sizAlpha"
oj<-cor(m1[z,names(o$center)])
o1<-cbind(o1,oj[1,2])
names(o1)[NCOL(o1)]<-paste(rownames(oj)[1],colnames(oj)[2],sep="|")
siz<-(o$sdev/sum(o$sdev))[1]
o4<-cbind(o4,siz,o1)
m1[z,"PCsize"]<-sign(o$loadings[1,1])*o$scores[,1]
Not in the main program:
- Johnson, Dominic D. P. 2005. God’s punishment and public goods: A Test of the supernatural punishment hypothesis in 186 world cultures. Human Nature 16:410-446. Dominic D. P. Johnson. 2005. Bonferroni corrected significance tests
Preprint: "R Software for 2sls inferential statistics (2sls-is)" pdf and doc
- 2stageLS-is3b.pdf with live links June 12, 2011 for the Leibnitz seminar http://intersci.ss.uci.edu/wiki/pdf/2stageLS-is3b.pdf Comments by Giorgio: I think this document may be a very helpful introduction also for people that are not experts. Nevertheless its very detailed, and gives all the information needed to start using 2sls-is.
- 2stageLS-is3b.doc
Latest attempt at change in the code (see: John Fox) If you added the lines I showed you then lm.restricted is a part of ols_stats, so try ols_stats$lm.restricted. You can also try names(ols_stats) to see all the sub-components of ols_stats.
Hal: The challenge for me as an outsider to anthropolgy and as a semi-outsider to causal analysis is that I can only comment on the science and not on the crucial sociological/anthropological issues of what folks embedded in this area will find appealing or exciting. The latter issues are unfortunately the most determinative factors in the funding decision, so I fear I can't be of much help in that crucial dimension.
As far as the science goes, I can offer the following. 2sls-iv is only one of a considerable number of approaches to recovering causal/structural information. I attach a copy of a paper I think I sent you earlier that encompasses quite a few of these. ("An Extended Class ...") Proposing to extend the software to these contexts might have some appeal.
Another extension is to consider nonparametric structural systems. I attach two papers that examine the nonparametric versions of the standard instrument (2sls) and conditioning instrument cases. ("Local Indirect Least Squares ..." - LILS_SUPP.pdf and "Estimating Nonseparable Models ...".)
Your use of regression diagnostics is appealing. The attached paper "Robustness Checks and Robustness Tests ..." could be extended to the 2sls framework. This might also be appealing to reviewers.
Finally, although I have a great deal of respect for Judea and his contributions, I find the do operator awkward and unnecessary. Rather than focusing on which variables are fixed and which are subject to intervention, as the do operator requires, if one instead focuses on which variables are jointly free to vary as functions of all other system variables, as the partitioning device of settable systems allows, one gains access to causal discourse in a variety of contexts where it is not otherwise possible. This also make distinctions between partial equilibrium effects and full equilibrium effects natural, which is especially important in economics. I attach one of my latest items ( "Causal Discourse ...") that pursues this approach and applies it to an auction game.
I apologize for sending along such a wad of papers, but hopefully some of the ideas here may be useful in your nsf proposal or elsewhere.
There is a concert with the Big Band and Jazz Hall of Fame Orchestra coming up at the end of August where I am slated to play. I will let you know the details soon. Also, I sometimes sub in the big band at Clay's (the old Elario's); if so I will let you know.
We still need to get together for lunch, and now that I am finished traveling for a bit, that is feasible. Are you in town? Or are you in Santa Fe? I'd love to get together!
Warm regards,
Hal (see Halbert_White#On Causal modeling and Xun Lu for Matlab code on testing robustness that could be translated into R)
Curating the code for successful models
Tolga and Giorgio: if not already, you should download textpad to edit results pages on the wiki by insertion of a first column of spaces
- http://intersci.ss.uci.edu/wiki/0/create_EduR_1.5DistBrownEff.R (EduR_1.2) EduR_1.2
- Comments: Replicates http://intersci.ss.uci.edu/wiki/pdf/eScholarshipBrown&Eff.pdf
- http://intersci.ss.uci.edu/wiki/0/create_model_value_childrenLangOnly.R (EduR_2)
- Comments: Does not replicate http://intersci.ss.uci.edu/wiki/pdf/eScholarshipEff&Dow2009.pdf
- Moral gods: new model (Startup Only)
- http://intersci.ss.uci.edu/wiki/0/create_EduR_1.5DistLangBrownEff.R (EduR_1.3) same as below:
- http://intersci.ss.uci.edu/wiki/0/create_EduR_1.5DistLangBrownEffNewMoralGods.R (EduR_1.1)
- Comments: continued below
- Moral gods moves to (EduR_1.8)
- Comments: incomplete: one case to be deleted Society 186, Yahgan
- Comments: Compared to Brown and Eff (2010), substitutes ecorich (which groups sccs$v857 categories to “harsher” codes 1&2&6, “temperate or rainforest” codes 3&4, and “savannah” code 5) for PCAP and foodscarc, substitutes animXbridewealth for anim, and has 24 of 28 significance tests at p<.10. That model, with an average R2 =.494, 6.5% higher than the seven-variable model, posits a poor environment, small communities, a pastoral economy with exchange of brides for animals, and caste stratification (the weakest variable), such as occupational castes in pastoral economies, as predictors of high gods that also tend to be more concerned with human morality.
- Evil eye (EduR_1.3)
- Comments: animXbridewealth predicts evil eye
- Money (EduR_1.4)
- Comments: not a good model
- PastoralExch Or: animXbridewealth ? (EduR_1.5)
- Comments: needs to be run and modified
- Fratgrpstr Fraternal Group Strength (EduR_1.6)
- Comments: (problem with prestate sample)
- General polygyny v860 (EduR_1.7)
- Comments: Good model, replicates at 79%
- Comments:
- [
- Comments:
- [
April 2011 instructions from Github (temporarily inactive)
Basics
This short guide will walk you through the steps of installing the package, building a model, and fitting it.
- Install sccs_1.0.tar.gz to your working directory
- On linux type, ./install.sh; on windows type, ./install.bat
- That requires the "R CMD INSTALL sccs_1.0.tar.gz" command
- which requires that R commander be installed
- Load R, type: library(sccs)
- In R, set the working directory by typing: setwd("<working dir>") as in the example below. You create your model in the file that you will name in the first source line below. Once it is finished you copy and paste the nine lines below into R, either from R in the unix terminal or the R Gui.
NEW
setwd('/Users/drwhite/Documents/3sls/sccs/') #Macbook
setwd('C:/My Documents/2sls-is/') #PC
library(sccs)
data(sccs)
source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff.R")
source("R_3_s_ols/two_stage_ols.R")
#source("examples/create/R_run/run_model_79.R")
source("examples/create/R_run/run_model.R")
OLD
setwd('/Users/drwhite/Documents/sccs-latest')
library(sccs)
data(sccs)
source("examples/src/Create-EduR-S.R")
source("sccs/R/two_stage_ols.R")
source("examples/src/run_model.R")
ols_stats$restrict_stats
ols_stats$r2
ols_stats$restrict_diagnostics
- Create-EduR-S.R - start modifying a sample program (change "S" to an unused number in this wiki)
- EduR-0 outlines how the file system is organized at installation and WHERE TO PUT EduR-n.R programs where n is a number so as to keep track of models in this installation.
Creating a system
- DRW: In Create-EduR-S.R I am trying to give the new user a starting point for the simplest possible working model. IT DOESNT WORK AS YET. NOT CLEAR HOW TO DEBUG IT. E.g., debug(run_model.R).
- EduR-0 outlines how the file system is organized at installation and WHERE TO PUT EduR-n.R programs where n is a number so as to keep track of models in this installation.
- BELOW there are some working models.
setwd('/Users/drwhite/Documents/sccs/sccs-2011/models/src')
library(sccs)
data(sccs)
#You do not debug files but functions defined within files, type ?debug at the command line.
setwd('/Users/drwhite/Documents/sccs/sccs-2011')
library(sccs)
data(sccs)
source("models/src/Edu-3.R")
debug(two_stage_ols_mi)
source("sccs/R/two_stage_ols.R")
source("examples/src/run_model.R")
warnings
Spatial weights matrix not row standardized
6: In lm.LMtests(lm.restricted, W_mat, test = c("LMlag")) :
setwd('/Users/drwhite/Documents/sccs/sccs-2011) #/models/src')
library(sccs)
data(sccs)
#You do not debug files but functions defined within files, type ?debug at the command line.
setwd('/Users/drwhite/Documents/sccs/sccs-2011')
library(sccs)
data(sccs)
source("models/src/create-edumac-1.R")
debug(two_stage_ols_mi)
source("sccs/R/two_stage_ols.R")
isdebugged(two_stage_ols_mi)
source("examples/src/run_model.R")
EduR-0 - the new series setwd('/Users/drwhite/Documents/sccs-latest') library(sccs) data(sccs) source("examples/src/create_model_value_children.R") source("sccs/R/two_stage_ols.R") source("examples/src/run_model.R") ols_stats$restrict_stats ols_stats$r2 ols_stats$restrict_diagnostics
EduR-0 - the new series setwd('/Users/drwhite/Documents/sccs-latest') library(sccs) data(sccs) source("examples/src/create_model_value_children.R") source("sccs/R/two_stage_ols_split_halves.R") source("examples/src/run_model.R") ols_stats$restrict_stats ols_stats$r2 ols_stats$restrict_diagnostics
EduR-1 - the new series 1 setwd('/Users/drwhite/Documents/sccs/sccs-2011') library(sccs) data(sccs) source("models/src/create-edumac-1.R") source("sccs/R/two_stage_ols.R") source("examples/src/run_model.R") ols_stats$restrict_stats ols_stats$r2 ols_stats$restrict_diagnostics
DEBUGGED THE NEW two_stage_ols.R WITH o <- linear.hypothesis(lm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2] was linear.Hypothesis or linearhypothesis
setwd('/Users/drwhite/Documents/sccs/sccs-2011) #/models/src')
library(sccs)
data(sccs)
debug(create-edumac-1.R)
#You do not debug files but functions defined within files, type ?debug at the command line.
setwd('/Users/drwhite/Documents/sccs/sccs-2011')
library(sccs)
data(sccs)
source("models/src/create-edumac-1.R")
debug(two_stage_ols_mi)
source("sccs/R/two_stage_ols.R")
source("examples/src/run_model.R")
ols_stats$restrict_stats ols_stats$r2 ols_stats$restrict_diagnostics
setwd('/Users/drwhite/Documents/sccs/sccs-2011')
library(sccs)
data(sccs)
source("models/src/create-edumac-1dichAliased.R")
source("sccs/R/two_stage_ols.R")
source("examples/src/run_model.R") ##the error will occur here, put in debug()
DEBUGGED THE NEW two_stage_ols.R WITH o <- linear.hypothesis(lm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2]
- In R, load an example model by typing: source("examples/src/create_model_value_children.R")
- To source the latest code, you can type: source("sccs/R/two_stage_ols.R')
- In R, run two stage ordinary least squares by typing: source("examples/src/run_model.R")
- Examine the summary statistics generated from the fitted model by inspecting the file: summary_model_value_children.csv
top top top priority to fix in two_stage_ols_split_halves
- see Debug correct r2s errors to big, use DEBUG to check that W matrices are row normalized
- YES Done with option to add/ignore the distance/language matrices. See Run_model.R#User_instructions:_Language_and_Distance
- Double check that the random seed for RANDOM in split halves is also random.
- DRW added set.seed(11)
- must also add right after set.seed(11):
- seed.save = .Random.seed
- print seed.save in output. That will allow results to be checked for replication.
- The distance/language matrices that affect the errors are set in Run_model.R
- in split halves print the first results then print the second.
- Might be that deleting
These two lines in original: did not help (the earlier 3-variable model was better!) but set.seed did help qxx[,"fydd"]<-cyd qxx[,"fyll"]<-cyl map to initial for loop in compute_predictions. Those are not the fitted values but the actual interpolated values from Wy. these two lines in compute_predictions are used in two-stage-ols and in split halves so lets see if COMPUTING ORIGINAL EFF-DOW the results are changed............
seeEff and Dow #--Stage 2 OLS estimate of restricted model-- xR<-lm(depvar ~ fyll + cultints + roots + fish + exogamy + settype + femsubs, data = m9) #--corrected sigma2 and R2 for 2SLS-- qxx<-m9 qxx[,"fydd"]<-cyd qxx[,"fyll"]<-cyl
Download aux files
restrvar.for restrvar.exe Has to be run on a pc. Can get a series of my_sccs columns ready for our model. WITH THIS PROGRAM I CAN WORK THRU EACH OLDER MODEL WANTED FOR A NEW SERIES WHERE THE Create Model Rfile is mostly automated. This file takes a column of my_sccs and converts to standard my_sccs, indep_vars and restrict_vars formats.
- 2OR3WAY.FOR Fisher exact test 2 or 3 way N=400 max
- 2or3way.for Fisher exact test 2 or 3 way N=999 max
Most recent changes
Items before Leipzig
ITEMS WOULD BE NICE
- 1. Train N: 124 depending on the train percentage, depending on depvar: N*ratio and Test N:the remainder
- 2. An *.Rdata with the stage2 depvar and original indepvars for the model
- 3. Can you pass out lm.restricted and the equivalent of row.names(sccs)?
lm.restricted <- lm(restrict_eq,data=dataset) reason: I want to do a qq.plot (car): According to John Fox p193 qqplot(lm.restricted, simulate.T, labels=row.names(sccs))
Train N test N
Train..R2.final.model Train.R2.IV_distance Train N: 0.457551025 0.980458455 124 CELL 1 CELL 2 SKIP CELL 3 CELL 4 (NOTE THAT TRAIN N, TEST N will depend not only on % but on NUMBER CODED FOR DEP VAR so dont just put the percentage.
Column labels in *.csv
Train..R2.final.model Train.R2.IV_distance 0.457551025 0.980458455
Train N test N
("variables","coef","range","effect","ratio","Fstat","ddf","p-value","VIF","abs.ratio","%coded","tot.cor")#,"part.cor")
fixes the misalignment, the user must add part.cor is all. Or maybe that last missing name can be appended
Major decisions and simple priorities
option to include/exclude language and/or distance
- This will improve the split-halves test.
two_stage_ols_split_halves.R DECISION TO USED set.seed not DROP a peice of code
DRW
- added a NEW program two_stage_ols_split_halves.R that comments out the old two_stage_ols.R routine and comments in the split_halves version
- modified NEW program two_stage_ols_split_halves_seed.R ditto above and added set.seed
DRW Fortran program to convert old EduMod/EduMac programs to EduR-x
Just use the existing code, apply to my_sccs, insert in standard wiki page.
Tolga will do p12 test of Commentator code
Commentator code graph with seven variables
If train_predict works, alter the draft of the article
- The idea is that significance tests from many independent variables are not good criteria for model fitting.
- What the split halves test shows is that there is often a cutoff of the top predictor variables, between those that replicate and those that do not.
- If the model eliminates the non-replicating variables and split halves is rerun, the train-predictor test is often much more significant.
- This can be triangulated with the residual correlation test.
- This also allows us to experiment with sorting potential indep_vars by *significance *effect ratios *low VIF
- TRIANGULATION is therefore essential. MUDSTICKING is as much a problem with 2SLS as elsewhere.
- What is going on is that 95% of the variance in the dependent variables is accounted for in a rather deterministic way (no missing data) by autocorrelation.
- What is left over, however, is a mixture of replicable structure and random noise.
- When you apply the SPLIT HALVES test there is now NOISE in the sampling of the cases and their W MATRIX subsets.
- So applied once there is a test of replication. The replicable variables are more likely to have TRAIN-TEST correlations.
- Applied several times ... when there are replicable TRAIN-TEST correlations the MUDSTICKING variables will be eliminated.
- The magnitude of the problem is seen in the Brown-Eff paper, where the residual correlations are computed, and are very low.
- Computing the outer product of a vector by itself for a train-test (split-halves).
Start page
List of five items for new package 4-24-2011
Best to always make options activated by parameters at the top of the program.
- Important to find a way to identify aliased variables in indep_vars, restrict_vars, drop_vars
- The commands I am using to append to a file do not have any parameters
relating to closing the file. You may want to try adding the param eol="\r\n" to each of the write.table commands in run_model.R.
Debug() instruction
You do not debug files but functions defined within files, type ?debug at the command line.
To put a breakpoint at the beginning of the function two_stage_ols_mi you can type: debug(two_stage_ols_mi) and the run the program. The program will halt when the function is called and you can inspect all the variables defined at that point by typing ls(). You can type in all of the parameters in that function and any local variables as well as any functions defined.
To remove the breakpoint, type undebug(breakpoint).
Once you are at a breakpoint you can go to the next line by the typing the letter 'n.' You can type the letter 'c' to continue to the next breakpoint if there is one or until the end of the program. For example, after getting to the beginning of the two_stage_ols_mi function you can hit 'n' several times until impute is called and then can inspect everything right after the imputation is done.
1a. Output a file with finished variables, averaged
Key insight: combine the inputed data into averages. Then we can not only correlate but use plots because the averages will no longer be integer (and and option for some .5% f random variation added the the variables) and can give better plots. And probit can be done on the output variables not the integer input. This also makes possible running interaction among variables with imputed values
Option activated by parameters at the top of the program.
1b. For this file and the original variables, options to SAVE a frequency table
Functions in R - can we put my_sccs into a table, compute row sums with margin.table(my_sccs,1)?
- Option 1: Output table saved right after original variables defined.
- Has 186 entries in rows, with (a) var name, (b) number of cases codes (c) the first 15 ordered variables frequencies, e.g.,
v3 186: 14 26 17 ... v7 43: 7 3 1 ... ...
- Option 2: Same, now after data are imputed and imputed variables are averaged (the cases coded will be a constant)
v3 35: 4 2 7 ... v7 35: 1 3 1 2 ... ...
- No column values. This is like the function table(v1) with or preferably without the header variables
1 2 3 4 v3 4 2 7 ... v7 1 3 1 2 ... ...
- Explanation. This is needed as a SAVED file not a screen printout to help DEBUG the program when it doesn't work.
In Eff and Dow we could debug with variables with a command
Using debug()
with a variable name. The program often crashes when a variable is empty or not defined.
2. printing output with coefficients: Include the new parameters in the new package plus two others:
The 4 new inside \\ \\: \\coef\\ range effect ratio corr \\Fstat
You did: range effect ratio is effect divided by the range of the depvar. The range fo distance is 0-2. -1 to +1 for language.
Corr is corr(x,y) function where x=depvar y=indepvar (new: if this can be done AFTER imputation; maybe it needs to be done AFTER imputations are averaged)
- Pearl 2009: Regression method for a causal graph
- Between the coef and fstat insert range, effect, and ratio ...
coef range effect ratio corr Fstat
- ALSO ADD 2 COLS: correlation with RESidual and RAW dependent variable
- Then for all rows except (Intercept) -- including distance and language with ranges of 2:
Get the distribution of each variable and its min and max, Pearl's indep var x and x', say the variable runs from 3 to 16.
Range is the difference between max and min, e.g., 1-14, in this case 13.
Effect is the UNSIGNED product of max (in this case 14) TIMES the regression coefficient.
Ratio is the effect divided by range of the dependent variable, all exemplified below:
coef range effect ratio corr Fstat ddf pvalue VIF Dependent variable: Parental nurturant guidance
(Intercept) 23.345 7.979 31196 0.005 SCCS v492 plus (9-v504) range 1-14 (range 13)
language -1.084 0-2 -2.17 .504 0.971 2.818 21143 0.093 1.558
distance 0.352 0-2 0.70 .071 -0.322 1.352 60204 0.245 1.766
Positive:
valchild 0.108 1-36 3.89 .299 0.321 6.436 1299 0.011 1.128 (range=35)
Negative:
superjh -0.472 1-5 2.36 .182 -0.166 6.554 28408 0.010 1.151 (range=4)
pigs -1.632 1-2 1.63 .125 -0.231 5.621 11009 0.018 1.418 (range=1)
agrlateboy -0.309 1-10 2.91 .224 -0.112 6.662 361 0.010 1.101 (range=9)
Sum .938 is circa 1, i.e., roughly 100% of the range of the dependent variable.
- Explanation: The effect and ratio coefficients allows the use of the regression coefficients for causal modeling. They give the relative contributions of each independent variable. The intercept term (23.345) is a baseline for measuring the depvar (here: nurturant parenting, higher than its empirical maximum of 14), raised by about 4 by how parents value children, and brought down 2.36 points by political complexity, 1.63 by keeping pigs, and by 2.91 points when teenage boys are aggressive slightly less in size to the value of children (.108*36 =3.89) effect.
Related change?
- would it make sense to give two rather than one set of outputs for the coefficients? I.e.:
- one as above, in the order requested in restrict_vars,
- then a blank an a line saying that the same results are given next sorted by:
- significance (low first ranking), eff_ratio (high first), and Vifs (low: first ranking).
- two weighting parameters can be set against a default x
- x= -log10(signif) - low signif is good, high x
- y= -log10(VIF) - low VIF is good, high y
- z=eff_ratio - high z is good
- x ranges from high (9 being p=.000000001) to below 1; 1 being p=.10
- y ranges from 0 (VIF=1=good) to negative (very high VIF)
- z ranges from 1+ (exceptional eff_ratio) to 0 (no effect)
- x is calculated. Higher significance ranked first
- y default = weighted from 0 (no influence on rank) to above or below 1 (same as x)
- z default = weighted from 0 (no influence on rank) to above or below 1 (same as x)
- If the user specifies 0, 0 significant results are ranked low to high.
- If the user specifies 1 1 VIF and eff_ratio are weighted equally with significance
- Weighting .4 .5 weights VIF a bit, eff_ratio half of significance.
- VIF, eff_ratio can be weighted >1 to override significance
Options activated by parameters at the top of the program.
- Question to statisticians. We have a 2SLS regression where autocorrelation in stage 1 is added as an estimated variable in stage 2, mostly eliminating autocorrelated errors according to our normalized W matrices.
- The significance tests are fairly standard for OLS then, with no reduction in effective sample size because the error term is random.
- At present we do unrestricted models (all possible indep vars) and then select the variables that are significant, sometimes in successive steps, to get a restricted model and its diagnostics.
- The original regression coefficients ignore Pearl's idea of an effect measure relative to the scale of the indep and dep variables, but these are easily calculated so they measure how much a change in one unit of the indep variable affects, by say 1.5 units changes say in a dep var that runs from 1-10 units thus a 25% change expected from the model.
- So I am wanting to figure out how to do a weighted sort of these two variables, and starting with the variables at the top of this sort, if they have VIFs fairly close to 1, fairly certain to be significant in a final mocel, until new items that could be added are not significant in competition with those already present.
- So far, significance alone does a good job -- picking those that are significant in the unrestricted model does a pretty good job of getting a final restricted model where they are significant. But I would like to improve on this a bit. Obviously those with higher eff_ratio will predict "more" so why not use that in the sort?
3. Make Probit an option
Option activated by parameters at the top of the program.
4. EduMac-1 test for split-halves
Option activated by parameters at the top of the program.
Please run the model at http://intersci.ss.uci.edu/wiki/index.php/EduMac-1
- Run the EduMac-1 model with split halves. Question is whether there is
still a negative correlation. That will determine whether we put split haves in the new package.
working with Split halves ("train")
Here are the two lines to search for:
- results = two_stage_ols_train_test(dataset_i,y,prox_list,indep_vars,restrict_vars)
results = two_stage_ols(dataset_i,y,prox_list,indep_vars,restrict_vars)
was: If you uncomment the first one, you get the split halves, if you leave as is, you get the normal version. BUT ISNT IT BE BETTER TO MAKE ALL SUCH OPTIONS CONTROLLED BY PARAMETERS SET AT THE TOP OF THE PROGRAM?
5. (check that these work) Output of Results (Priority rankings !1 !2 !3)
- !1 Include a column of correlation coefficients in addition to regression coefficients.
- !2 We want distance and language to be a default BUT ALSO TO BE ABLE TO ELIMINATE THEM IN THE MODEL, i.e., test language without distance, distance without language, both, or neither. The question here is whether one is significant without the other.
- !3 Code to take output of several models, use sna or Igraph to map the variables 1-3000 being the variable numbers accd to codebook, i --> j, value k reflecting categories of significance from 1=<.11, 2=<.03, 3=<.01, 4=<.001, 5=<.0001, 5=<.00001; sizes of nodes in a *.cls file to reflect significance of distance autocorrelation.
- Add simple mapping option that uses Extra_code#General_map
- Add country mapping option that uses Extra_code#Brown-Eff_map and source("http://intersci.ss.uci.edu/~drwhite/0/world_countries_24.rda")
Old ranking of priorities
We need to prioritize !1 !2 !3 etc the order in which to do those we choose to do. We can use a letter code A B,,, Z a b ...z just after the #.
- Note - Cosma blog post
DONE new parameters for reporting *neff
4 new inside \\ \\ then *neff: \\coef\\ range effect ratio corr \\Fstat ... pval \\ norig *nefforig nimp *neffimp
- these will come from *neff algorithm
- http://www.cscs.umich.edu/~crshalizi/weblog/656.html'
Earlier priority for making giant or large EduMod project graph with a Python script
Excel file for making giant EduMod project graph
- Python for the Pajek file
*Vertices max (e.g., 3067) of col V (projects) that number of numbered rows, rownum=1-max col 1<3000: rownums plus "label 7 or 12" col 1>2999: rownums plus "label 3" *Arcs col 1<3000: cols19, 5, 15 "label 7" col 1>2999: cols 22, 5 "label 5 & 7"
- The Pajek file made by Python should use only those indepvars that line to 2+ depvars
- This can also provide an excel format for recording causal link analysis for the 339 foragers database
Using the new output coefficients with a large EduMod project graph from Python script
Turns out they work perfectly. Language can be given a range -1 to 1, effect 2, and ratio of the coef/2
- and distance can be given a range 0-1, effect 1, and ratio equal to coef.
For a multivariate causal graph, do adjustments
Split halves correlation and regression (!4 up through !6)
Sun, January 23, 2011 2:28 pm
Ok I made all the changes we talked about and fixed the train/test splitting so it's now randomized. See the attached code. Essentially I did what we discussed otp where I just pass everything through for the train except the R^2 for the final model is now coming from the test. I am not passing anything else through for the test set. The r^2 looks strongly negative which may not be correct. If there is a problem I think it's in the method called "correct_r2s". What I recommend is you source this file and then type "debug(two_stage_ols_train_test)" and then source run_model.R and you can walk through each line. The correct_r2s happens right at the end of the function. The test errors looks about 2x those of the training errors which seems plausible.
Scott
- Let the program run through Vaux for missing data. Then pick half the sample of N cases randomly.
- !4 Break the W matrices (zeros in the diagonal of course, and row normalized to sum to 1) into two rectangular parts, one for the 1-n societies another for the remaining 1 to m. (n+m=N)
- !4 Now for 1-n do the first of the two 2sls routines, but use rectangular Ws with 1-N cols and 1-n rows.
- !4 Then do the second of the 2sls routines, and obtain the betas Bj for each j, Xj,n of the variables for the 1 to n sample.
- !4 Now with the second set of rectangular W matrices, for the 1-m sample of cases.
- !4 Using the betas already computed, Bj,n, and the Xj,m of the variables for the 1-m sample, compute the predicted
- !4 Correlate Y for the second sample with
and report the
.
- !5 Report the intercept, regression coefficient, and significance for the
, with regression line and scatterplot graph
- !6 Option to repeat multiple times and plot error bars on the regression line and scatterplot graph
- !6 Option to do intercept, regression coefficient, and significance for the
, with regression line and scatterplot graph, for each independent variable.
- !6 Together, repeated tests should be able to sort out which are the reliable predictors and which are not replicable.
Names and options
- Names need to be generalized:
- not my_sccs but my_data instead, encouraging use of ANY *.Rdata with W matrices and .Vaux.
- not Rsccs or the current as the name of the program but R2stage or something more generic
- Specify in the user program (that has my_data etc) options such as
- (x) correlation coefficients
- (x) distance and/or language
- (y) (n) distance in output
- (y) (n) language in output
- (x) maps (mutually exclusive choice)
- (o) simple map
- (o) country outline map
- whichever is chosen should crosscut these non-mutually exclusive options
- (o) dependent variable map
- (o) independent variable map (give variable $v##num or my_date:varname
Error messages
- We need an error message when the program does not run. Would also help if the program STOPPED when there is an error. And an error message is needed when depvar is not defined, a restrict_var was not in indep_var or in my_sccs or an indep_var is not in my_sccs. The lack of an error statement is confusing to students, e.g., when an restrict_ or indep_ var change gives an error the program just reprints the previous results and students think it is a new correct result.
- We need something like the coef(xR) and coef (xUr) Error messages for Eff and Dow 2009 that will work for my_data: Error messages for Rccs 2010.
- After the my_data program runs with some part of the variables in restrict_vars, if the WALD Test diagnostic is significant, have the program cycle through the remaining indep_vars and find the ones that are significant if added (singly) to the current model, the print those results (one or more models).
- If the RESET diagnostic is significant restart run through and print each regresstion with all but one restrict_vars but the final variables squared, then square rooted. Stop one one is found. User will have a choice what to do next.
Automation
- Would it be hard to have an option that allows an exchange of the depvar with an indepvar, gives the new depvarname the name of the old indepvar? ---- One reason why we find little reciprocal causation is that the indepvars are not investigated as depvars.
- With the new restrVar.exe program we need some way (like my FORTRAN program) to autocopy all the fx_ or my_sccs into indep_var then all but the last into restrict_var. ---- If the vNums and the vNames in Fx could be saved, it would be good to put the vNums at the right in the line for each causal variable (restrict_var results).
- Can we 'automate so it can collect new bunches of variables that work together. Or add one variable at a time.
- Ditto, if the RESET Test diagnostic is significant, have the program cycle through the significant indep_vars and find the ones that predict better with log(1+old var) or oldvar2 (squared).
- When there are more then 30 indep_vars, and when many high VIFs, automatically fill the restrict_vars with several overlapping contrast sets, run each separately, and print as many results as there are sets. This will reduce the VIFs and by having fewer "congested variables" increase the likelihood that a significant variable WILL be found to be significant in one or more contrast sets.
The package
We need to settle on a version 1.0 package among these choices, make it a standard installation from the [R] site (find out how to do this), and thus make the installation easy and not so complex, so students or researchers can easily install at home. And we need to give Mike Migalski an installable version by mid-summer for the fall classes. He does one big installation, not dribs and drabs, so is important to get him our installable code by the end of August.
Creation of a program to convert my_sccs to indep_var and restrict_var lists
- DRW did one in FORTRAN. Uses SCCS 2SLS Variable Names - SCCS R package Install
- Program is in c:\Core and called restrVar.for /.exe. You give it the filename restr1 to read and it returns the results to restr2. Created Oct 4 2010.
Debug
The problem I mentioned happens here: o <- linear.hypothesis(lm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2] I'm not quite sure what's wrong though. Still debugging...
On Mon, Jan 17, 2011 at 2:16 PM, Scott White <scottblanc@gmail.com> wrote: > I don't know how to get it to stop on a line but you can get it to stop at a function by saying debug(function). I'm working on this today.
Examples of new code
Making and Expanding databases
R programs for converting databases to *.Rdata *.rar file, right click and save
License
- License: GNU General Public License, version GPL-2.0
- Authors: Anthon Eff, Malcolm Dow, Scott White, and Douglas White
- Title of work: SCCS R package, Authors: Scott White and Douglas White (in process).
- Containing: Eff, E. Anthon, and Malcolm Dow. 2009. How to Deal with Missing Data and Galton's Problem in Cross-Cultural Survey Research: A Primer for R. Structure and Dynamics: eJournal of Anthropological and Related Sciences 3#3 art 1.
Guide to fixing user errors
Changes to the R code (Brown & Eff 2010 --> Scott White)
Chris Brown and Tony Eff have a new article with some additions to their [R] code and a real example with R2=.50 or so. New items:
- Ways of referencing variable names, e.g. "dummy", PC... for principal component, ... squared.
- Calculating PCAs which requires psychometric package seems not to be available or a *.gz *.zip
- The code is attached as "supplementary"
- In the code the better way of drawing maps
- R2p is the R2 partitioned to each independent variable (Chevan and Sutherland 1991; Grömping 2006). R code for relaimpo
- Chevan, A., and M. Sutherland. 1991. Hierarchical partitioning. The American Statistician 45: 90- 96.
- Grömping, U. 2006. Relative Importance for Linear Regression in R: The Package relaimpo. Journal of Statistical Software, 17(1), 1-27.
- Hausman test code Jeffrey Wooldridge 2006 (3rd edition) Introductory Econometrics: A Modern Approach. Florence, KY: Wadsworth Publishing. pp532-534 Hausman test
- Moran I test code
- A reference to a way of determining the optimal combination of the language and distance matrices: Dow and Eff 2009a. Cultural trait transmission and missing data as sources of bias in cross-cultural research: Explanations of polygyny re-examined. Cross-Cultural Research 43: 134-151.
- "Employing the composite weight matrix method presented in Dow and Eff (2009a:142), we combine the two weight matrices in order to find the linear combination of the two which best explains the transmission process (i.e., results in higher model R2); the resulting weights provide a way of assessing the relative importance of the two cultural transmission channels."
- Fuller explanation of the diagnostics, output labels for each, and a new Hausman test for each variable
- TWO MORE COEFFICIENTS: IN ADDITION TO OLS coef, a stdcoef in terms of #std deviations, and R^2p
How to add a variables like *.csv to the sccs.Rdata file in the program
Add a variable in .csv format to the sccs.Rdata file
Notes for Users
- In Eff and Dow if there is no error in part 1, you can change part 2 and run independently
- Debug with coef(xUR) and coef(xR)
- In White and White if there is no error in my_data, you can change indep_vars or restrict_vars and run again from there
Getting started
Team Viewer - Free for non-commercial use
http://teamviewer.com lets you share your screen with your team, anyone can enter keystrokes. DRW's usernumber is 793 304 281 password is the earliest z
Download source code
- Click "Download source" (upper right) and download the file archived source file to your computer, e.g., to "C:/My Documents/SCCS"
- Uncompress the file, e.g. "tar -zxvf drwhite-Statistical-Inference-In-SCCS-888d07e.tar.gz" or the zip file (make sure to extract and choose on your PC "C:/My Documents/SCCS" as the directory for the extraction.
- A new directory should appear in "C:/My Documents/SCCS", e.g., C:\My Documents\SCCS\drwhite-Statistical-Inference-In-SCCS-db3c8fd -- rename it "source" - it will have two subdirectories, sccs and data.
Windows Users: Setting up R
- Make sure R is in your path. You can verify this by going to "Control Panel -> System -> Advanced -> Environment Variables -> Path -- e.g. add to the end of your path ;C:\Program Files\R\R-2.6.2\bi
- [Optional] Download RTools from http://www.murdoch-sutherland.com/Rtools/ (make sure it's compatible w/ your version of R and Windows)
Build and Install R Package
Note: make sure you have R installed on your machine before proceeding.
From the parent directory (one directory above 'sccs') (in class and a home we use c:\My Documents\SCCS -- and type the following commands at the command line
- Scott White package installation, in Unix (Macbook)
Then
- [Optional] 'R CMD CHECK sccs' to verify the package is set up correctly
- 'R CMD BUILD sccs' to build the R package
- 'R CMD INSTALL sccs_[VERSION].tar.gz' to install the R package
- PC DOS On a Macbook
an x86 emulator with DOS An open source DOS emulator for BeOS, Linux, Mac OS X, OS/2, and Windows. Primarily focuses on running DOS Games. www.dosbox.com
DOSEMU Main Page DOSEMU Main Page. DOSEMU stands for DOS Emulation, and allows you to run DOS and many DOS programs, including many DPMI applications such as DOOM and Windows 3 ... dosemu.org
- On a PC (Windows7 etc): Open a DOS window.
- cd \
- cd My Documents
- cd SCCS
- cd source
Extra Installs if needed
- You may need to install Microsoft's HTML Help Workshop using iExplorer (iE). Make sure R is in your path. You can verify this by going to "Control Panel -> System -> Advanced -> Environment Variables -> Path. -- add to your path ;C:\Program Files\HTML Help Workshop
- Verify there is now a file called sccs_[VERSION].tar.gz in the parent directory.
Run Examples in SCCS R Code
Initialize
In R run, the following commands:
library(sccs)
setwd("C:/My Documents/SCCS") #class and home-PC use
setwd("[mypath]") #where mypath is the path to the parent directory where sccs was installed
rm(list=ls(all=TRUE)) # removes all the current variables in the workspace
options(echo=TRUE) # turn on verbose output
Create your own SCCS and auxiliary data set
- Run the following command to create the dataset:
source("examples/src/create_dataset.R")
- Verify the following datasets were created: my_sccs and my_aux
- Verify rows are properly aligned in two datasets by checking the output of the following command is 186:
sum((my_sccs$socname==my_aux$socname)*1) #Dont worry if this doesnt work
- DRW note: create_dataset.R
- DRW note: create_model_value_children.R
Run multiple imputation
- Within R, run the following command to run multiple imputation:
source("examples/src/run_impute.R")
- Verify the dataset "sccs_imputed" was created and saved (in working directory, e.g., "source")
Run two stage ols
- Within R, run the following command to run two stage ols:
source("examples/src/run_two_stage_ols.R")
- Check output of OLSresults.csv #in "source" directory: highlight the numeric fields in excel, right click highlighted area, click "format cells" / "numeric" / 3 decimal places (or higher for pval)
Hosting
- GitHub - Secure source code hosting and collaborative development - Online project hosting using Git. Includes source-code browser, in-line editing, wikis, and ticketing. Free for public open-source code.
Agenda: Scott
Fix multiple matrices Document wiki EasyChange dep_var EasyChange add/delete indep_var Normalization of values option
Normalized re-expression of ordinal variables
Basic ideas
- Normal re-expression (probit analysis) simply reassigns the o(i) values of ordinal variables to ones that are normally distributed in the range 1 ≤ x=integer[1 + k(probit(p(o(i)))] ≤ k, where p(o(i)) is the cumulative percentage of cases in category i, k is a multiplier, and x is an integer truncation of the re-expressed value.
- This option can transform each ordinal variable in sccs.Rdata to a common scale 1 to k to create normalized variables, save an sccsNorm.Rdata file, and use this file for the linear regression analysis.
- A second option can raise each such variable to a power t to minimize least-square fit to the dependent variable. This option is specified in the program and does not require a new dataset.
Formuli
Let each of i=1,...,k categories of an ordinal variable have frq(i), sample size n, and let
. In the standard normal distribution,
and
. The cumulative distribution function and its inverse, the probit function, is
To compute in [R] see pnorm and erf. I propose THAT
- OUR TRANSFORMATION HAVE A LOWER DEFAULT Zmin=1 and an UPPER Zmax=7 that can be reset as parameters.
- we replace SCCS.Rdata with SCCSN.Rdata. If this were done for integers but now between Zmin-Zmax, normally distributed, I cant see why it wouldnt (1) work and (2) give better Rsq (3) provide for more useful diagnostics esp. regarding normal distribution of residual errors if the regression model is correctly specified. Doug 12:18, 8 June 2010 (PDT)
Pseudo-code
Compute 0 ≤ Z(i) = probit(i) ≤ 1.
Note that the Z(i) are breakpoints between intervals. What we want are the centroids of these intervalues
. Compute Z-intervals for i=2 to n-1,
.
Let the Z-centroids be
.
for i=2,...,n-1
.
and
.
last term in excel=G3+(Zmax-G4)/2
Consider the normal curve as having unit volume, bounded left and right at +/-s std.devs rescaled in the range 1-7. Now for each Z-centroid i=1,...,k, Zcen(i) re-expresses a normalized value s.t. cen(i)= f(probit(i)). To compute first get the probit for all values for area of the curve under the unit normal distribution, from .00,.01,.02,...,.99,1.00 (j=0 to 100) for +/-3 std.devs, then iterate the probit values j=1,...,1.00 and for each i set
Then for each i the value Z-centroid Zcen(i) has a value between Zmin=1 and Zmax=7.
Create a new .Rdata file. If decimals dont work with Eff and Dow, then truncate Zcen to integer.
Commentary
This is done for ordinal categories, if k>2, and all others as well (e.g., percentages). TEST IF THIS TRANSFORMATION CAN BE DONE FOR THE ENTIRE DATABASE AND SAVED AS A NEW DATABASE. Values with low frequency are be moved closer to adjacent values in the normalized scale. The normal curve is considered as having unit volume but the Z scale in bounded. Consider the histogram rectangle for an ordinal variable as having unit volume. Consider the rectangle for the "middle" category j of i=1,...,k categories that contains the median of the distribution and prop(j) of the cases in category j. Adjust that rectangle to have a width and height that minimize the difference in shapes with its position superimposed over the normal curve of unit 1. Consider how to do the same for each j rectangle to the right and to the left. The left/right spacing of each rectangle is not yet fixed however, so consider how to do that as follows.
- The spacing adjustments have to recognize that the right boundary on the -tail / Z score / + tail has to correspond to the negative z score so that the area proportion to the left of the unit area for the normal distribution equals the proportion of cases for that variable. Do each category j of i=1,...,k according will 100% of the unit area is covered. You can solve for the height of each rectangle and that show satisfy our criterion. You dont need calculus for this, just a function that converts z-scores to areas under the normal curve.
- So all you need is the pdf or cdf for the normal curve, as shown at Wikipedia:Normal Curve
pdf =![]()
The general ideas for re-expression come from Tukey (e.g., 1977: 103), but his methods are all pencil and ruler, and dont require a program.
- Tukey, John 1977 Exploratory Data Analysis. Reading, Mass.: Addison-Wesley. This takes each of the variables, after imputation, and including the dep_var, and normalizes them. The second page is from a study that uses his ideas:
- Mallows, C. L., & Tukey, J. W. (1982). An overview of techniques of data analysis, emphasizing its exploratory aspects. In J. T. de Oliveira & B. Epstein (Eds.). Some recent advances in statistics (pp. 111-172). London: Academic Press.
A second idea: transforming the variables prior to regression
Tukey's contribution was to rexpress (each of our independent) variables so that their plot is linear. This is done by minimizing least squares between val(i) and
by varying the exponent t. (The t for each independent variable would be given with the results.
Links
References
- Grant V. Farnsworth. 2008. Econometrics in R pp 21-22 have
- for dichotomies probit [R] glm(c^y, family=binomial(link="probit")) THIS WE CAN USE FOR DICHOTOMIES. IN GENERAL, HOWEVER, WE CAN DO OUR OWN PROBIT ANALYSIS BY RENORMALIZATION, HAVE OUR [R] PROGRAM RECOGNIZE BINARY VS ORDINAL AND TREAT ACCORDINGLY
- ordered logit/probit [R] p21 MASS library polr(Sat ~ Infl + Type + Cont, method="probit"), where Sat is an ordered factor vector
- for multinomials nnet [R]
- for MCMC p21 MNP [R] library mnp() see 4.2.2
- UCLA probit example
- An algorithm for computing the inverse normal cumulative distribution function
- Applied multiple regression/correlation analysis for the ..., Volume 1 By Jacob Cohen p 247 Normalization of Ranks A normalization strategy based on percentiles of the ... Other methods of addressing ordinal data are presented by Cliff (1996)
- p242: "Following the elementary statistics textbooks for finding centiles (percentiles), express the ranks as cumulative proportions, and refer these to a unit normal curve ... to read off fz, or use the PR column of Table 6.4.2, where 5 has been added to zp to yeild probits."
- Mosteller, Frederick & John W Tukey (1977). Data analysis and regression: a second course in statistics. Addison-Wesley.
- AN OVERVIEW OF REMEDIAL TOOLS FOR THE VIOLATION OF PARAMETRIC TEST ASSUMPTIONS IN THE SAS SYSTEM Chong Ho Yu, Ph.D., Arizona State University, Tempe AZ
- References but not good algorithms: *Google or Google Scholar search: "normalizing ordinal variables" e.g.:
- box-cox but only good for dep/indep problem.
Agenda: Doug
NSF Proposal 2010:July see: EduMod-71 - EduMod-52:Imputation and Regression has shortened commands that will run the sample programs
- The lang and dist matrices have reversible time, similarities from proximities.
Would it make sense to also have a directed time matrix, i.e., asymmetric. That would differ from the variable "dateobs" which can have a symmetric effect, like distance DONE: have need chunks of code that will let students run code on wiki
- MAKE PARAMETERS TO change nimps --- OK NOW FOR parameters 10,10,3 (iterations, ..., pval)
- CAN NOW change the dep_var
- CAN NOW change the indep_vars
- CAN NOW change the restricted model--xUR
- CAN NOW change the restricted model--xR
- CAN NOW have all variables in the model diagnostics--dropt --> wald
- Comments on imputed vars relative to dep_var
- CAUTIONS: Cannot run from restricted down
- CAN RUN FROM dep_var down
Fall 2010 course
- Software R for Human Social Complexity - Causal Networks among World Cultures fall 2010
- Exploratory Causal Analysis within and between sociocultural groups








