SCCS R package

From InterSciWiki
Jump to: navigation, search
causal graph

EduB - 2SLS-IS - Sccs 2SLS-IS downloads - SFI2011 project -


Plots for subsets
Generalization G11.11 p 420A	+cond	Pop 
x <- cbind(LRB$pop,LRB$subdiv,LRB$packinx) #subsistence diversity
x.df <- data.frame(x)
x.sub <- subset(x.df, X3 = 2)
plot(x.sub$X1,x.sub$X2, cex=.5)
x.sub <- subset(x.df, X3 = 1)
plot(x.sub$X1,x.sub$X2, cex=.5)
z <- which(LRB$packing<2)# & LRB$subsp<2)
plot(log(LRB$pop[z]),LRB$subdiv[z], cex=0.5)
cor.test(LRB$pop[z],LRB$subdiv[z], cex=0.5)
Generalization G12.02 p 434
conditional: exclusively terrestrial resources
conditional:  population density has not reached reduced mobility options to movement within a small foraging area
(not reached packing 1) packing < 1
subdiv increases linearly with log10 of pop #subsistence diversity #population density
x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$) #subsistence diversity
x.df <- data.frame(x)
x.sub <- subset(x.df, X3 = 2)
plot(x.sub$X1,x.sub$X2, cex=.5)
this works
x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$subsp) #subsistence diversity #subsp 1=terrestrial animals 2= terrestrial plants 3=aquatic 
x.df <- data.frame(x)
x.sub <- subset(x.df, X3 < 2) #, X4 < 2)
plot(x.sub$X1,x.sub$X2, cex=.5)
this doesnt but "should" how do I get an x.sub dataset with two conditions
x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$subsp) #subsistence diversity #subsp 1=terrestrial animals 2= terrestrial plants 3=aquatic 
x.df <- data.frame(x)
x.sub <- subset(x.df, X3 < 2 & X4 < 2)
plot(x.sub$X1,x.sub$X2, cex=.5)
For two conditions, use 
x.sub <- subset(x.df, X3 < 2 & X4 < 2)
"&" ==and
EFF: Either of these will work. The first uses different symbols for the two types; the second plots only one of the two types.
plot(LRB$pop,LRB$subdiv, cex=LRB$packing/2,pch=LRB$packing+20,col=LRB$packing)
plot(LRB$pop[z],LRB$subdiv[z], cex=0.5)
z<-which(LRB$packing<2 & LRB$subsp<2)
plot(log(LRB$pop[z]),LRB$subdiv[z], cex=0.5)

Maps in R

Where we go with SEM

  • John Fox
  • Re: Non-intertible matrices
  • Dear Doug,
  • Do you have at least as many instrumental variables as coefficients to estimate?
  • John

data() in its most common usage just tells R to load a stored data object from an attached package. if you do help("data") you will see there is another application if package=NULL, so that it will look in a folder data of the current working directory. that might be simplest for you.

do not confuse it with the parameter name data= in functions -- that generally just allows you to specify a data frame that gives bindings for variables.

the solution for you is to have in your working .GlobalEnv a data frame of any name, say "foo", and then you can use data=foo as one of the parameter settings for your sem call This worked

neff almost working

DW: I see the problem here, its the p742 (click the small blue "pdf" below) upper right formula Y = ... e*. That's probably lm(Y~V^.5), where e* is Y=mu + (V^.5)e* which is slightly out of my pay grade because e* is multiplicative. There's another step where sigma^2V^-1 is his covariance V not just the matrix V=W that I was using, which ought to bring the n* result into the proper range 1 < n* < n. Could one of you make those equations for me, starting with any old Y and W? With that addition I think the little R program will work. And n* is different for each variable so it can be listed in the output of our 2SLS scripts, so if this works feel free to use it.

Working Calculation for n*, neff, effective sample size with W matrix

p742 Suppose the n x n matrix V contains the covariation structure among n georeferenced observations (more precisely, matrix \sigma^2 V^{-1/2} is the covariance matrix), such that Y = \mu + e = \mu + V^{-1/2}e*, where Y denotes … attribute values, and mu denotes the population mean of variable Y…, and e and e* … denote the vectors of spatially autocorrelated and unautocorrelated errors…. If V = I, the n x n identity matrix, then the n observations are uncorrelated.
#A. Eff helped out here
ww<-ww/rowSums(ww) #making sure that ww is row normalized
ww=cov(ww) # THIS IS V <--- NEW (THE FIX)
w_inv<-solve(ww)                             #this is computationally singular unless the diag(ww)=0
chk<-ww%*%w_inv # verifying that w_inv is indeed the inverse
TRw_inv<-sum(diag(w_inv)) #trace of inverse
one<-matrix(1,n,1) #column vector of ones 
nstar<-TRw_inv/(t(one)%*%w_inv%*%one)*n #formula
n* whether or not ww is row normalized to 1, ww<-ww/rowSums(ww)
141.7 for Wlink
212.9 for Eff-Dow Distance matrix
#inverse of V <--- where does Y come in? Y = mu + e = mu + V-1/2e*,  where  Y denotes … attribute values, and mu denotes the population mean of variable Y…, and e and e* … denote the vectors of spatially autocorrelated and unautocorrelated errors….

Construct dummy variables for FxCmtyW, Fratgrpstr

FxCmtyWages=((sccs$v2125>=2)*1)*((sccs$v61==6)*1) #having defined 2010 ERROR: ((sccs$v2010>=2)*1)*((sccs$v61==6)*1)
  0   1 
112  74  so 1 means missing data.

Which cases?

pastName <- sccs$socnam[which(p == 1)]

p <- which(sccs$v1=='No Trade')

which(sccsdata$v1=='No Trade')

[ 1] 7 55 93 170 180 183 186

Try using the function 'which' with your expression inside.

which(Islam==2 & sccs$v238==1)
[1] 83 Javanese
    0  1
 1 67  1 Javanese
 2 47  0
 3 13  0
 4 17 23
    0  1
 1 58 10
 2 41  6
 3 10  3
 4 31  9

Make a copy of sccs.Rdata named sccs2.Rdata or Strip some variables from comb.Rdata

delete or subset variables

  1. exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3") 
newdata <- mydata[!myvars]
myvars <- names(comb) %in% c("Missions.f", "HiGod4.f") ##THESE .f are key
sccsdata <- comb[!myvars]
[1857] v1913                      v1914                      v1915                      v1916                     
[1861] v1917                      v19_22                     v24_25                     v26_27                    
[1865] v53_54                     complex                    v159_177                   fortitud                  
[1869] aggressi                   competiv                   selfrely                   achievem                  
[1873] industry                   responsi                   obedienc                   selfrest                  
[1877] sexrestr                   id66                       v2001_1806                 v2002_1807                
[1881] v713rev                    Islam                      Christian                  Buddhist                  
[1885] Hindu                      HinduBuddhist              Shaman2                    GamesOfChance             
[1889] GamesOfSkill               GamesOfStrategy            NomEurasia                 NomEurasiaPacific         
[1893] NomAfroEurasia             NomOldWorld                NomNewWorldAfroPacific     ExternalWar13             
[1897] ExternalWar50              ExternalWar50Bands         InterrnalWar46             InterrnalWar46Bands       
[1901] ExtWarFactor               IntWarFactor               FratIntGroups60            FratIntGroups60Bands      
[1905] FemalePower107             FemalePower107Bands        MaleAggression50           MaleAggression50Bands     
[1909] Modernization102           Modernization102Bands      PoliticalIntegration82     PoliticalIntegration82Band
[1913] v1260                      Rain.f                     Terrain.f                  Water.f                   
[1917] Missions.f                 HiGod4.f
myvars <- names(comb) %in% c("HiGod4.f") ##THESE .f are key
sccs2 <- comb[!myvars]

myvars <- names(comb) %in% c("Missions.f") ##THESE .f are key
sccs1 <- comb[!myvars]

dy/dx, elasticity and derivatives

> D(expression(sqrt(1 - x^2)), 'x')

-(0.5 * (2 * x * (1 - x^2)^-0.5))

Doug: For example, if one estimates:  
In other words, this is just like any other calculation of a first derivative. No need to do any special estimation.

Doug: OK, maybe I didn’t give such a great example. So imagine that one estimates y=f(x,z,v):

y=a0+a1*x+a2*x^2+a3*x*z +a4*ln(z)+a5*v





These are all partial derivatives, which give the ceteris paribus effect of a change in one variable at a time.

Note that only the last (dy/dv) is a scalar that one can introduce into a table next to a regression coefficient. And, since it is identical to the regression coefficient, it would be redundant to introduce it into the table.

The other two (dy/dx and dy/dz) are not scalars that can be introduced as a single number in a table--they are vectors, with as many values as there are observations (they are functions of x and z). That's why the best way to understand these is to plot them (against x and z, respectively)

In the code for the Farming and Fighting paper, the calculation of the marginal effects is done when we create the figures. For example, the calculation of the marginal effect of SCCS$v814 is done in this line:


#now apply that to fitting a smooth lowess function to f(y,x)
plot(cars, main = "lowess(cars)")
lines(lowess(cars), col = 2)
#lines(lowess(cars, f=.2), col = 3)
legend(5, 120, c(paste("f = ", c("2/3", ".2"))), lty = 1, col = 2:3)
sccs<-data.frame(sapply(comb,function(x) as.numeric(x)))
plot(sccs$v814,sccs$v815) #YES
vars=data.frame(sccs$v814,sccs$v815) #missing data
#vars=data.frame(sccs$v5,sccs$v3) #no missing data
#plot(vars) #YES
varNoNA <- na.omit(vars)
plot(varNoNA) #YES
xy= lines(lowess(vars),lty=2,col="red") ##NOT WORK

Scatter Plot Smoothing

  • Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74, 829–836.
  • Cleveland, W. S. (1981) LOWESS: A program for smoothing scatterplots by robust locally weighted regression. The American Statistician, 35, 54.

Reff and Elasticity

Hi Doug --

I think I now understand that you are interested in measuring effects in percentage change terms. In economics, this is the concept of "elasticity": the percent change in one variable resulting from a given percent change in another. Here are some wikipedia articles on this notion:

As the latter item notes, the concept goes back at least to Alfred Marshall 1890, so you are in august company.

Simply put, elasticity = (dy/y) / (dx/x). Whenever dy / dx has a causal interpretation, so does the elasticity. Whether you use dy / dx or elasticity is purely a matter of convenience or interpretability in a given context.

In case it may be of interest, I attach a recent update of my paper with Karim where we define a variety of effects without recourse to the "do" operator.

I hope this helps.

Best, Hal

marginal effect marginal effects for binary or ordinal responses, computed as partial derivatives, are available from SAS/ETS PROC QLIM by specifying the MARGINAL option in the OUTPUT statement. While no procedure in SAS/STAT software directly computes marginal effects, the partial derivative can be computed using results from the procedure used to fit the model.

can we run GLM inside 2SLS-IS?

can we run the wikipedia:Generalized_linear_model inside 2SLS-IS? see Sccs 2SLS-IS downloads

Changed (lm in two_stage_ols.R to compute_stats <- function(glm.unrestricted,lm.restricted,errors,prox_list) { and got this error

 8   3  data_to_impute...vars_missing_data.i..
 8   4  data_to_impute...vars_missing_data.i..
 8   5  data_to_impute...vars_missing_data.i..
 8   6  data_to_impute...vars_missing_data.i..
 8   7  data_to_impute...vars_missing_data.i..
 8   8  data_to_impute...vars_missing_data.i..

Error in linearHypothesis(lm.unrestricted, drop_vars, test = "Chisq") :

 object 'lm.unrestricted' not found

In addition: Warning messages: 1: In min(y[test_idxs], na.rm = TRUE) :

 no non-missing arguments to min; returning Inf

2: In max(y[test_idxs], na.rm = TRUE) :

 no non-missing arguments to max; returning -Inf

Changed (lm in two_stage_ols.R to compute_stats TO function(glm.unrestricted,lm.restricted,errors,prox_list) Changed (lm to glm here: linearHypothesis(glm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2] and got a RESULT - can this be right? YES, because I then restarted R, and did not include the earlier two_stage_old.R

              coef range effect  ratio Fstat      ddf     pvalue   VIF abs.ratio %coded tot.cor part.cor
(Intercept) -9.512    NA     NA     NA 0.727 2849.468 0.39399615    NA        NA  0.000   0.000    0.000
language     1.346     2  2.691  0.096 7.484 2200.414 0.00627649 1.306     0.096  0.000   0.207   -0.023
cultints     0.805     5  4.027  0.144 5.691  755.965 0.01730165 1.868     0.144  1.000   0.162    0.091
roots       -1.995     1 -1.995 -0.071 2.770  233.726 0.09741558 1.196     0.071  1.000  -0.081   -0.097
fish         0.558     9  5.019  0.179 4.864 2579.736 0.02750355 1.205     0.179  1.000   0.051    0.078
exogamy     -0.860     4 -3.438 -0.123 5.160  592.550 0.02347549 1.115     0.123  0.995  -0.111   -0.167
settype     -0.450     7 -3.152 -0.113 3.858  858.149 0.04983512 1.678     0.113  1.000   0.005   -0.069
femsubs      0.458     8  3.667  0.131 2.189 1030.255 0.13933123 1.224     0.131  0.995   0.094    0.040
Train  R2:final model  Train R2:IV_language 
          0.09759433            0.97002048 
              Fstat         df pvalue
RESET          0.001     73.273  0.974
Wald.on.restrs 1.138    228.137  0.287
NCV            0.406     58.725  0.527
SW.normal      0.763   2455.566  0.382
lag..language  1.568 316525.108  0.211
               coef range effect  ratio Fstat      ddf     pvalue   VIF abs.ratio %coded tot.cor part.cor
(Intercept) -13.190    NA     NA     NA 0.940  130.471 0.33415460    NA        NA  0.000   0.000    0.000
language      1.498     2  2.996  0.107 6.245  121.638 0.01378760 1.309     0.107  0.000   0.209   -0.023
cultints      0.762     5  3.810  0.136 5.156  854.501 0.02341128 1.860     0.136  1.000   0.154    0.084
roots        -1.969     1 -1.969 -0.070 2.816  416.877 0.09406955 1.196     0.070  1.000  -0.078   -0.095
fish          0.517     9  4.654  0.166 4.143 1609.501 0.04197396 1.205     0.166  1.000   0.041    0.070
exogamy      -0.815     4 -3.259 -0.116 4.507  318.360 0.03453283 1.112     0.116  0.995  -0.102   -0.159
settype      -0.441     7 -3.090 -0.110 3.495  289.899 0.06257948 1.681     0.110  1.000   0.002   -0.072
femsubs       0.482     8  3.855  0.138 2.392  964.099 0.12227872 1.233     0.138  0.995   0.108    0.049
Train  R2:final model  Train R2:IV_language 
          0.08700948            0.96158666 
               Fstat        df pvalue
RESET          -0.023   143.632  1.000
Wald.on.restrs  1.278   185.277  0.260
NCV             0.640    81.212  0.426
SW.normal       0.545  3139.683  0.461
lag..language   1.613 84002.356  0.204

The generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

How to recover Imputed data values and dummy for missing data


Dichotomize NA versus Others to make dummy coded/missing cases for networks of variables$v287)
 table(di,di287) retrieve Anthon's file
SEE RESULTS AT Draft#HiGod4 for FxCmtyWages, Full imputation test and gives better results, "more positive cases"
Meaning of Positive correlation of imputed with Coded variables

SFI Support

Bug Reports

here I attach a detailed bug description, and a R source to replicate it.

Giorgio Ubuntu bug reports W %*% X (following data imputation) produces an error in an R version for Ubuntu linux
But it was one compiled by Giorgio so the solution may not be general
new Ubuntu_run_model.R
change in two_stage_ols.R and  two_stage_ols_full.R
imputed_data <- cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2]))
imputed_data <- cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2], stringsAsFactors=FALSE))

Anthon Eff codebook NA

The dropbox codebook at correctly indicates a number of variables in sccsfac.Rdata that lack NA

  • BUT SOMETIMES THE MISSING DATA ARE NOT CORRECTLY CODED as NA #Eff dropbox load('sccsfac.Rdata') sccs<-data.frame(sapply(sccsfac,function(x) as.numeric(x)))

I searched in the codebook for "don't know" etc.


Variable to delete


Spelling changes

sccs$v763 Ude -> Use
sccs$v908 evan -> even
sccs$v918 nt -> not
sccs$v1249:6 prp -> pro

These are sample N=coded for numerically summarized variables

> sum(table(sccs$v871))
[1] 147
> sum(table(sccs$v872))
[1] 145
> sum(table(sccs$v1753))
[1] 96
> sum(table(sccs$v1848))
[1] 135
> sum(table(sccs$v1849))
[1] 135
> sum(table(sccs$v1861))
[1] 186
> sum(table(sccs$v1862))
[1] 93

Other Ns to do:

sccs$v1865 to 1873
sccs$v1873 to 1880
sccs$v1913 to 1917
sccs$v1977 to 1978

NaNs not NA


How to do 1st PCA for variables before my_sccs

#Compute the PCAP variable for Brown and Eff (courtesy of Eff)$v237[c(115,172)])<-TRUE$v63[88])<-TRUE
length(sccs$pc1) #186
#Compute the PCAP variable for Brown and Eff (courtesy of Eff)$v237[c(115,172)])<-TRUE$v63[88])<-TRUE
pc1<-pcdat1$scores[,1] #---first principal component
#--loop through cases for missing data and do 1st PCA --
#--loop through cases for missing data --
for (i in 1:nimp){
z<-which(twoVARS==NA) #???? newvar[which(newvar==88)] = NA
z<-which(my_sccs$v921==NA) #???? newvar[which(newvar==88)] = NA                          
see source("examples/create/create_EduR_1/create_EduR_1.5DistB_EffHiddenVariablesV238AP1_2.R")
v237= sccs$v237
v63= sccs$v63

#newvar[which(newvar==0)] = NA     
#newvar[which(newvar==88)] = NA
v237= sccs$v237
v237= sccs$v237

Compute R2 between significance and relative effect ratios


x 8 34 41 43 102

 96  0  0  0  0   1
 102 0  0  1  0   0
 176 0  0  0  1   0
 208 1  0  0  0   0
 237 0  1  0  0   0
1] -0.6998707
1] 0.4898189
1] 0.09697337
1] 0.5300817
1] 0.729
1] 0.857375
1] 0.9025

Drop box

Sccs 2SLS-IS downloads‎‎


Sccs 2SLS-IS downloads‎‎
  • The Public Folder lets you easily share single files in your Dropbox. Any file you put in this folder gets its own Internet link so that you can share it with others-- even non-Dropbox users! These links work even if your computer’s turned off.
Step 1: Drop a file into the Public folder.
Step 2: Right-click/control-click this file, then choose Dropbox > Copy Public Link. This copies the Internet link to your file so that you can paste it somewhere else.
That's it! You can now share this file with others: just paste the link into e-mails, instant message conversations, blogs, etc.!

I searched in Eff’s R codebook for "don't know" and other indicators of missing data without NA.


Variable to delete

sccs$externalwar13 (too few variables)

Spelling changes in codebook needed for these words

sccs$v763 Ude -> Use (ok in main codebook)
sccs$v908 evan -> even (ditto)
sccs$v918 nt -> not (ditto)
sccs$v1249:6 prp -> pro (ditto)

NaNs not NA



Install or start R: go to R/Package & Data/Package installer, search for "igraph", install, 
then Cut and paste into R as a command
setwd('/Users/drwhite/Documents/3sls/sccs/')                           # library(foreign)
sccs<-data.frame(sapply(comb,function(x) as.numeric(x)))          #data(sccs)   

source("") #vars problem, sccs not found
source("")   #"output" is the name of the active output file 
On Mac:
#source("dropbox create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R")    
source("examples/create/R_run/averageAll.R")   #"output" is the name of the active output file 

Alternate urls for drw home pages

Crucial change for SCOTT to make: adjust relimp to R2 for 2SLS Oct 2011

Brown and Eff 2010 code - Mac See:
#--adjust relimp to R2 for 2SLS--

OLS reminder

lmout=lm(sccs$v238 ~ output$no_rain_dry +output$superjh +output$milk +output$AnimXbwealth +output$FxCmntyWage +output$logdate) #) #output$fratgrpstr +output$pigs +output$eeextwar +output$Missions +output$socplex + lmout summary(lmout) vif(lmout)

Correction for missing data in v1650 and v1685

comp.Rdata and comb2.Rdata updated - always check table(comp$v1650) and table(comp$v1685) 
#convert factor -->as.numeric
newvar[which(newvar==0)] = NA     
newvar[which(newvar==88)] = NA                          
#convert numeric -->as.factor
newvar[which(newvar==0)] = NA     
newvar[which(newvar==88)] = NA                          
#convert factor -->as.numeric
newvar[which(newvar==0)] = NA     
newvar[which(newvar==8)] = NA                          
#convert numeric -->as.factor
newvar[which(newvar==0)] = NA     
newvar[which(newvar==8)] = NA                          
save(comb, file='comb.Rdata')

How to normalize and average data in a data frame


Ideally I want to scale the values for each column in the range of (-1,1)


Ideally I want to scale the values for each column in the range of (-1,1)

#A <- matrix(c(2,3,-2,1,2,2),3,2)
                                          -2 -1  0  1  2
 Absent                                   17 21 13  9  9
 Present, inactive, unconcerned           11  9 12 11  8
 Present, active, unconcerned with humans  4  4  7  4  4
 Present, active, supportive of morality   2  8  8 16  9
newvar[which(newvar==88)] = NA
Replace 0 with NA
mean(popmean) #~zero
v1650[which(v1650==0)] = NA
plot(popmean,scale(v1650)) #sccs$v1650
# Scatter Plot Smoothing
#lowess(x, y = NULL, f = 2/3, iter = 3,       delta = 0.01 * diff(range(xy$x[o])))
x = popmean
y = scale(v1650)
lowess(x, y = NULL, f = 2/3, iter = 3,       delta = 0.01 * diff(range(xy$x[o])))
AgpotScale=matrix(c(scale(sccs$v921),scale(sccs$v928)),186,2)  #AgriPot 1&2
SizeJHscale=matrix(c(scale(sccs$v63),scale(sccs$v237)),186,2)  #CmtySize & Superjh

Replace PCA from Psychometric package

twoVARS=cbind(sccs$v63,sccs$v237)  #average CmtySize with Superjh

White-Murdock alignment

from Murdock and White 1969
Wlink=matrix(data = 0, nrow = 186, ncol = 186)
for(i in 1:185) {
for(i in 1:184) {
for(i in 1:183) {
for(i in 1:182) {

New problems and possibilities e.g., Replace, Dichotomize NA / Others

data[which(data==0)] = NAReplace]
sccs$v2011= replace(sccs$v2009,$v2009),0)
sccs$v1732= replace(sccs$v1732,$v1732),0)
newvar[which(newvar==0)] = NA
Replace 0 with NA
    0  1  2  3  4  5  6  7
 0 16  7  2  0  4  5  2  0
 1 10  3  1  2  1  2  0  0
 2  0  0  0  0  1  1  5  2
 3  2  1  0  0  3  2  0  1
    1  2  3
 1 13  0  0  #No wage labor
 2  3  0  0  #No wage labor or coerced
 3  2  0  0  #No wage labor or coerced
 4  0 10  0  #Hired/Wage
 5  0  0 12  #Migrant wage
 6  0  9  0  #Hired/Wage
 7  0  3  0  #Hired/Wage
                      #analysis  v1009 is WorldSys labor
sccs$v2009=sccs$v1009                                                             #Proxy for  worldsys labor 1-7 N=52
sccs$v2732=sccs$v1732                                                             #Proxy for Wage labor 1-3  N=89
sccs$v2011=3-(sccs$v1009<=3)*1+(-1+(sccs$v1009==5)*1)   #Proxy for a labor variable 1-7 N=52 NOW AS 1 2 3 values
sccs$v2011= replace(sccs$v2011,$v2011),0)                #Replace NA with 0 keep 1-2-3
sccs$v2732= replace(sccs$v2732,$v2732),0)                #Same for v1732 labor variable
newvar<-apply(sccs[,c("v2011","v2732")],1,max)                          #Take max (i.e., nonzero)
newvar[which(newvar==0)] = NA                                                 #Replace 0 with NA  N=112 WHY 1-7???
sccs$v2010=newvar                                                                    #N=112
FxCmtyWages  1  2  3  v1732
         0 20  8 15 <-coding 0 error here hard to fathom  FxCmtyWages SUM IS N=112 v1732
         1 13  0  0
         2  1 14  0
         3  2  0 16
table(sccs$v1732,sccs$v2010)  #boosts overlap of 112 and 89 from to 89? GOOD PROXY
    1  2  3
 1 31  2  3
 2  0 21  1
 3  0  0 31
                      #analysis  v1009 is WorldSys labor
sccs$v2009=sccs$v1009                                                             #Proxy for  worldsys labor 1-7 N=52
sccs$v2011=3-(sccs$v1009<=3)*1+(-1+(sccs$v1009==5)*1)   #Proxy for a labor variable 1-7 N=52 NOW AS 1 2 3 values
sccs$v2011= replace(sccs$v2011,$v2011),0)                #Replace NA with 0 keep 1-2-3
sccs$v2732= replace(sccs$v2732,$v2732),0)                #Same for v1732 labor variable
newvar<-apply(sccs[,c("v2011","v2732")],1,max)                          #Take max (i.e., nonzero)
newvar[which(newvar==0)] = NA                                                 #Replace 0 with NA  N=112 WHY 1-7???
sccs$v2010=newvar                                                                    #N=112
sccs$v200==1  #Use 84% no wage labor for region 1 (Africa)      #$v287)
dich      0   1
 FALSE 177   0
 TRUE    0   9
  • Do this for FxCmtyWages
  • Use imputed data compared to real data for dep_vars in causal graph?
  • Then create dummy variables for missing v. coded 1/0 subsamples for each dep_var in influence graph
  • Laura might be able to do phylogentic modeling on HiGod4
  • What of cases like Superjh that are not PRIMARY but come out only when other variables are present in th model

Using the 2sls-is program

Maps and Plots

Giorgio Gosti drawVar.R modified, SFI Working group in Causal Analysis
Giorgio Gosti drawPath.R by Giorgio Gosti, SFI Working group in Causal Analysis
Giorgio Gosti drawPath.R by Giorgio Gosti, SFI Working group in Causal Analysis
Giorgio Gosti drawPath.R by Giorgio Gosti, SFI Working group in Causal Analysis

Brown-Eff Maps source("")

Moral gods v238



Zshare Installation from zshare

Link for forums: [URL=] - 0.90MB[/URL]

Week2: Breaking out the EduR_1.5 model for a causal graph

Controls for Islam and Xianity


No run HiGods without Snarey codes

HiGod4 - > sccs$v238
wet=1+((sccs$v929>8)*1) +((sccs$v929>3)*1) ##+((sccs$v929>7)*1) #DOES NOT CORRELATE with SNAREY$Rain
sccs$v1913 # NoRain=3 - (sccs$v1913>24)*1) + (sccs$v1913>101.6)*1)
 [1]  11.155  35.425  86.950  69.790 128.050 165.890 123.740 106.625  85.725 100.525 107.050 146.270 193.155
[14] 180.015 263.695 175.779 240.685 104.220 142.300 370.240 100.385  93.480 115.110  22.820  55.072 132.085
[27]  35.515 140.315  31.295  84.780  73.260 123.120 145.335 120.220  74.420  20.699 151.620  37.755   0.175
[40]   4.055   4.515  56.790   0.140  58.845  12.220  12.702  35.661 127.505  80.654  34.700  99.617  52.625
[53]  39.305  52.881 172.292  59.708  36.794  33.668  69.512 159.650 159.079 153.555 104.037 167.888  57.365
[66]   8.635 118.979  40.900 240.914 253.535  84.120 183.220 185.858 161.188 131.469 152.819 263.390 296.760
[79] 296.760 149.565 271.965 185.433 178.310 178.059 330.675 206.810 265.060 214.060  92.995 129.056  25.870
[92] 448.100 447.380 244.500 246.420 330.500 237.571 212.786 366.063 398.167 310.360 271.260 199.565 139.445
[105] 113.015 273.550 303.625 578.967 363.800 303.976 373.205 328.980 164.560 111.360  49.991 107.475 177.705
[118] 103.579  59.456  29.054  34.602  51.790  83.560  25.465  83.794 105.772  49.742  31.314  40.090 259.290
[131] 131.769 121.357 321.934 103.848  90.974  14.437  29.210  34.558  51.640  33.624  43.635  57.105  83.880
[144]  78.140 113.646 131.885  54.425  39.725  36.797  60.771  21.752  83.996  68.951 266.653  82.590 319.786
[157] 372.220 245.580  46.472 131.722 198.440 120.614 246.085 224.531 303.106 271.825 344.485 202.070  48.700
[170] 148.030  62.220  58.915 142.695 121.593 182.300 107.461 118.015 159.780 163.379 192.237 164.705 121.579
[183] 129.545  56.626  12.966  55.375

over 40" = over 101.6 cm 10" = 25.4 cm

NoRain=3 - ((sccs$v1913>24)*1) + ((sccs$v1913>101.6)*1)
NoRain= 1+((sccs$v1913>24)*1) + ((sccs$v1913>101.6)*1)
wet=1+((sccs$v929>8)*1) +((sccs$v929>3)*1) ##+((sccs$v929>7)*1) #DOES NOT CORRELATE with SNAREY$Rain

Source files

source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R") source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2Missions.R") source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R")

# THE DATA in comb.Rdata are not yet released for publication
# IF YOU DONT HAVE comb.Rdata edit out $Missions and $lo_water from your create file and use a new Which(sccs$v....)
setwd('/Users/drwhite/Documents/3sls/sccs/')                      #change to YOUR directory     # includes library(foreign)
load('comb.Rdata')  #alt: just load('sccs.Rdata') #for comb.Rdata you have added new variables such as Missions
sccs<-data.frame(sapply(comb,function(x) as.numeric(x)))          #data(sccs)    
dep_var<-sccs$HiGod4                      #sccs$v238  if you start with load('sccs.Rdata')
z<-which(sccs$Missions!=2)             #option to modify for a subsample dependent on selection variable[z])<-TRUE                     #option to set chosen selection variable values to missing
dep_var                                              #option to use this modified dep_var in 
#sccs$v238=dep_var                           #optional but woulf change sccs until reloaded
#  (73 observations will be deleted due to missionization) 
#Access to cached software 
#source("")    #option for Train-Test inferential states
source("") #OR: full sample
#OR# as configured on your own directories
source("examples/create/R_run/averageAll.R")   ##averages
average(ols_stats$imputed_data)               #independent variables average for imputed missing data
output=average(ols_stats$imputed_data)  #output but not in data frame
'class(output)'                               #output now in a class to make data frame
output =    #Once it's a data frame it should [almost] behave just like my_sccs.
lmout=lm(sccs$v238 ~ output$no_rain_dry +output$superjh +output$milk +output$AnimXbwealth +output$FxCmntyWage +output$logdate) #)  #output$fratgrpstr +output$pigs +output$eeextwar +output$Missions +output$socplex +

Science article

  • Comments of Simon Dedeo
population structure needed for "group selection"
run longitudinal gini indices in Europe, USA re: rise and fall in religious memberships
  • Outline
Science style lit.
Chamberlain mult. hyp.
Multiple hypotheses (table 1)
Stat. Results (discuss)
Tolga to look at how Science articles report and format their results
Comparison of 2sls and OLS
Unexpected and Serendipitous findings
Using backloops and control variables (SOM)
Need for influence diagrams and controls
Supp. Online Materials
Giorgio and Doug to work on the use of effect ratios to compare OLS with 2sls
  • New diagram
  • Science issue
4500 words: research article (we stated with 5200)
2500 words: research report
3500 survey, solicited by editor
... also perspectives

Readings for the SFI working group Sept 11, 2011

Discussion issues

  • Prosociality and Punishment the big issue in new evolution of religion articles -- basically game theoretic not biological
  • Is there any selection operating at the genetic level?
  • What exactly does David Sloan Wilson mean by group selection (review his major book), get Duran Bell reaction from email
  • Social evolution are likely for CmntyWageLabor because cause is episodic
  • Genetic condition of Lactose tolerence with AnimXbwealth, after, cause is episodic
  • Ecological dryness is long-term however.
  • Write the abstract model as Do operators, filled in by Bayesian results
  • Idea of a simulation (following Cosma) with a time dimension
  • Can we follow the Imai, et al. Causal Mediation in R in the format of treatment and antecedent conditions?
  • We can create new variables from MI of missing data for Paige and Paige

Working R Causal mediation package Imai, Kosuke, Luke Keele, and Dustin Tingley. (2010). ``A General Approach to Causal Mediation Analysis. Psychological Methods, Vol. 15, No. 4 (December), pp. 309-334. (lead article)
The companion paper: Imai, Kosuke, Luke Keele and Teppei Yamamoto. ``Identification, Inference, and Sensitivity Analysis for Causal Mediation Effects."
We have developed easy-to-use software and have written a paper that explains its use with some examples: Imai, Kosuke, Luke Keele, Dustin Tingley and Teppei Yamamoto. Causal Mediation Analysis Using R.
Judea Pearl#2011_Bayesia Pearl 2011: and citing papers by [ Imai et al. (Research papers)

See also

Ilya Shpitser, Tyler J Vanderweele. 2010. A Complete Graphical Criterion for the Adjustment Formula in Mediation Analysis. The international journal of biostatistics 7, 2, 1.



White, Oztan, Gosti, and Snarey, 2011 (SFI) Explaining the Evolution of Moralizing Gods (draft, not for circulation)
Brown and Eff, 2009.
White, Ren Feng, Gosti, Oztan, 2011 (Leipzig) R Software for Regression with Inferential Statistics (2sls-is)
Eff and Dow, 2010.

Sequential Bonferroni tests

Rice 1989
Holm 1969

Multiple working hypotheses: Required reading

Confounding and Simulations

Shalizi, Cosma R., and Andrew C. Thomas. 2011. Homophily and Contagion Are Generically Confounded in Observational Social Network Studies. Sociological Methods & Research 40: 211-239.
Snijders "Spread of Evidence-Poor Medicine"


Fox, John. 2006. Structural Equation Modeling with the sem Package in R. Structural Equation Modeling 13(3): 465–486. 2009. Structural equation Models(SEM) from John Fox on 2009-12 ...


Shalizi, Cosma R. 2009/2011. Graphical Models blog
Shalizi, Cosma R. 2011. Causality blog - Beware his misunderstanding of Don Rubin's propensity score methods - see Controversy under Judea Pearl#Articles.
Causal Graphs Laboratory paper:
Nihat Ay. 2008. A Refinement of the Common Cause Principle santa fe institute workingpaper
Nihat Ay, Jessica C. Flack, and David C. Krakauer. 2007. Robustness and Complexity Co-constructed in Multimodal Signaling Networks. Philosophical Transactions of the Royal Society. London. 362: 441-447.
Nihat Ay and David C. Krakauer. 2007. Geometric robustness theory and biological networks. Theory in Biosciences 125 (2): 93-121.
Nihat Ay and Daniel Polani. 2006. Information Flows in Causal Networks santa fe institute workingpaper
Wolfgang Löhr and Nihat Ay. 2008. On the Generative Nature of Prediction santa fe institute workingpaper

PEARL, Greenland, Halbert White and others

Lots of papers at hand and at Judea Pearl, Sander Greenland and James M. Robins
  • H.White, Lu and Chalak
Lots of papers at hand and at Halbert White, Karim Chalak and Xun Lu
White, Halbert, and Xun Lu. 2010. "Robustness Checks and Robustness Tests in Applied Economics". Journal of Financial Econometrics, 8: 193-243. Matlab code
White and Lu 2009. Non-robustness is signaled by failure of conditional exogeneity. See Wooldridge 2006: 532-534, also which might not be relevant. Lets look more into Hausman.
and more to come

Pod A Conference room in the main building for you for September 7

It's a good space for 4-6 people and has better internet connectivity than the Gatehouse.

Restricted samples based on values of a control variable

Restricted samples based on values of a control variable can be obtained by blanking those values in a new depvar.

input =$HiGod4)
depvar= input[z,]   #N=73

Doing OLS on the Imputed data

Scott's new Average.R

  • It appears that the imputed data is already being passed

out of the function. All you need to say is ols_stats$imputed_data and

  • To get a specific instance you can say
'imputed_data[which(imputed_data$.imp==i),]' where is the ith dataset.
  • To extract a column you just say:
ols_stats$imputed_data[,j] where j is the jth column or you can say
  • But its not that name I want
out=lm(sccs$v238 ~ ols_stats$imputed_data[,"Missions"])
Error in model.frame.default(formula = sccs$v238 ~ ols_stats$imputed_data[,  : 
 variable lengths differ (found for 'ols_stats$imputed_data[, "Missions"]')
  • Make sure output is a data frame by saying
output =    #Once it's a data frame it should [almost] behave just like my_sccs.
lmout=lm(sccs$v238 ~ output$fratgrpstr +output$Missions +output$socplex+output$milk +output$AnimXbwealth +output$pigs)
  • After summary do vif(lmout)

Handling files with factor categories (no longer numeric)

load('outDat.Rdata') #Snarey factors
sccsout<-data.frame(sapply(outDat,function(x) as.numeric(x))) #Convert factor to numeric
    1  2  3  4
 1 33 26  5  9
 2 36 25 18 34
table(sccsnew$Mission,sccsnew$v238) #Missions more high gods  
    1  2  3  4
 1 33 22  3  7
 2 35 25 10 33
 save(sccsnew, file = "sccsnew.RData")  (try ?base)
  • The only required argument is a vector of values which can be either string or numeric.
  • Optional arguments include the levels argument, which determines the categories of the factor variable, and the default is the sorted list of all the distinct values of the data vector.
  • The labels argument is another optional argument which is a vector of values that will be the labels of the categories in the levels argument.
  • The exclude argument is also optional; it defines which levels will be classified as NA in any output using the factor variable.
  • Graphics are another area that benefits from the use of factor variables. As in the tables the factor variable will indicate a better ordering of the graphs as well as add useful labels.

New: You can convert every field to numeric this way:
sccs<-data.frame(sapply(sccsfac,function(x) as.numeric(x)))
But some fields will be nonsense (society name, etc.)
sccsnew<-data.frame(sapply(comb,function(x) as.numeric(x)))
> table(sccs$v1,sccs$v2)
    1  2  3  4  5
 1  0  0  0  0  0
 2  0  0  0  0  0
 3  3  0  1  0  0
 4 36  5 15 22  2
 5 10  5 11 12  0
 7  0  0  1  0  1

Yay! With this we can use sccsfac.Rdata to replace sccs.Rdata

I attached a file that created the factors like I was creating a new variable. A command for creating the Rain factor would look like such: FactorRain= factor(Snarey186$Rain,levels = c(1,2,3),labels = c(">40in/year", "Moderate", "<10in/year")) You can also factor out Rain by just dropping FactorRain= from the above command, however this does not create values. However, when I create these new values as mentioned in the first command above, they do not show up in the dataset when viewing it in RStudio.

Snarey factor code for category names

New data with factors

we have a new version of sccsfac.Rdata that has categories at Cross-tabs can be done with category labels.
> data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
> fdata = factor(data)
> fdata
[1] 1 2 2 3 1 2 3 3 1 2 3 3 1
Levels: 1 2 3
> rdata = factor(data,labels=c("I","II","III"))
> rdata
Levels: I II III
Here is the code that I used, just so that you have it: Victoria

Moralizing_gods#Codebook_for_Moralizing_gods MUST ADD #1. Raininch #7. Snarey Year

# To make factors for variables: Rain, Terrain, Water, Missions, HiGod4
Rain.f <- factor(Rain, labels = c(">40in/year", "Moderate", "<10in/year"))
Terrain.f <- factor(Terrain, labels = c("Surface Water", "Moderate","Surface H20 Scarce"))
Water.f <- factor(Water, labels = c("Neither Rain nor Surface H20 high","Either Rain nor Surface H20 high"))
Missions.f <- factor(Missions, labels = c("Premissionized", "Missionized"))
HiGod4.f <- factor(HiGod4, labels = c("Absent", "Present, inactive,unconcerned","Present, active, unconcerned with humans","Present, active,supportive of morality"))
Create factors for AddSnarey,csv conversion to AddSnarey.Rdata (no factors for SCCS	Raininch or YearSnarey
Rain	Terrain	Water	Missions	HiGod4
1= >40"/year
2= Moderate
3= <10"/year
1= Surface water
2= Moderate
3= Surface H2O scarce
Water budget
1= Neither rain nor surface H2O high
2= Either rain or surface H2O high
1= Absent
2= Present, inactive, unconcerned
3= Present, active, unconcerned with humans
4= Present, active, supportive of morality

Do you think that you can take a look at this and see if what I did was what you wanted?

####sccs<-sapply(sccsfac,function(x) as.numeric(x))
#### Error in sccs$v1 : $ operator is invalid for atomic vectors

Splice new variables, e.g., Premissionized

aux_data = outData
#save(newsccs2, file ="sccsmsn.Rdata")
save(sccsmsn, file ="sccsmsn.Rdata")  #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott)


3. I am trying to splice the AddSnarey.csv to the end of your sccs.Rdata It has those V1-V9 variables and those are the variable labels underneath and the result is weird. I had to take off the variable labels to make a 186-186 congruence between the two data files. But when done I didnt know the way to insert my variable names. So my problem is HOW DO I GET THE VARIABLE LABELS INTO THE NEW MERGED FILE, either the first row or the second....?

do ?read.table

  SCCS Raininch Rain Terrain Water Missions HiGod4 Year  
###aux_data = read.table("AddSnarey.csv", header = TRUE, sep = ",", col.names)  #Snarey186.csv
aux_data = read.csv("AddSnarey.csv", header = TRUE, sep = ",")  #Snarey186.csv
#save(newsccs2, file ="sccsmsn.Rdata")
save(sccsmsn, file ="sccsmsn.Rdata")  #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott)
Splice in WldRelig=cbind(sccs$v2001,sccs$v2002) -- Its already in sccs!!!!
Forget Var1918-2000.cvs Grief and Mourning 
World Religion v2001 v2002 is in sccs

Select rows in my_sccs

 z<-which(my_sccs$Missions==1) #nonMissionized

Select rows in my_sccs as above -Can be done more easily by having a data name and value like Missions==1 in the create file. If this is not null, then instead of taking a 79% or whatever size subsample in TRAIN you use this variable to select the subsample, measure the %cases and use the remaining sample to report on in TEST. This gives forr ways to select samples:

select by variable (gives imputation)
m_sccs (no imputation)


data(sccs) only lives in the source files that was under the 'examples" sub-directory. There is no reference to sccs in the actual two stage ols code. So the refactoring I did earlier this year was already supposed to do this. So the examples code can be thought of as 'client code' and the two stage ols as 'library code.' How clients wish to refer to their data is up to them.

My understanding is that you started making copies of the create model value children file which used data(sccs) and then you found it hard to modify all of them not to depend on data(sccs). But this does not mean other people need to do it that way.

Does that help?

To extract a column you just say

ols_stats$imputed_data[,j] #where j is the jth column or you can say
> ols_stats$imputed_data[,"anim"]
[1369]  4  4  2  2  3  3  3  2  4  2  1  2  1  1  3  2  3  4  3  1  3  1 2  1  1  2  1  3  2  2  2  2  3  1  2  1  2  1
[1407]  2  2  2  1  1  2  2  2  2  2  3  3  2  1  1  1  6  1  1  1  1  1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[1445]  1  1  1  1  1  1  1  1  1  2  2  1  2  2  2  1  8  3  1  1  1  1 1  1  1  2  1  1  3  4  1  1  1  1  1  1  1  1
[1483]  1  1  2  3  1  1

1488/8=186 cases with imp=8, but N=168 for the nonmissing cases on dep_var which are those to impute if data were missing.


New functions


  • Scott: Install package library(sccs) to NewMac
  • Merge Snarey.cvs with sccs.Rdata, requires variable labels and add to a newsccs.Rdata.
  • Run on Missions only, using z<-which(sccs$vMissions==1) # No missions
###z<-which(sccs$Missions==1)  #No
z<-which(my_sccs$Missions==1) #nonMissionized
  • Run without autocorrelation (for paper) - BUT WHAT ABOUT SIGNIFICANCE? Effective Sample size? Bonferroni corrected significance tests P_i \le \alpha / (1 + k - i)?
Use effective sample size for pairs of variables using covariance -- Project for Victoria Pevarnic? --needs covariance in the equation
0) rename file to have extension 'tgz' (for security reasons Google won't let me attach a file with extension ".tgz")
1) Decompress sccs.tgz
2)  cd Statistical-Inference-With-SCCS


3 ) R
4) load('examples/data/sccs.Rdata')
5) load('examples/data/vaux.Rdata')
6) source('examples/src/create_model_value_children.R')
7) source('sccs/R/two_stage_ols.R')
8) source('examples/src/run_model.R')


the new program could be generalized to include subsample selection just after my_sccs if you set, just after my_sccs (and thus whatever row selection follows), not only

           prox_list=list(distance=Wdd), #distancelang=distlang),

To do this there would be a default zz = all rows, with an alternative set by rows numbers selected just after my_sccs


The zz rows would be used to similarly reset the rows in my_aux and the rows and columns in the prox_list w matrices

A 2nd ALTERNATIVE is that I could write a little program that could take whatever *.Rdata is used, along with the W matrices and Vaux.Rdata and write more generic matrices for these sets of inputs


That would require keeping different subdirectories of the three reset datasets in different subdirectories.

Seems to me the first alternative is simpler and more generic.

Scott Possible TODO

1. Scott: After

4) load('examples/data/sccs.Rdata')

can sccs be used to make function data(sccs)??? If so then we will have a version for installation without R CMD... in PC and a package installation version.

Victoria and I, however, need installations, she on her PC and me on my NewMac.

Hi Dr. White,

To fix the problem, you have to do the following:

  • 1. Open the following document in a text reader: two_stage_ols which is found in Statistical-Inference-With-SCCS->sccs->R
  • 2. Copy the text into R into a new script frame
  • 3. Save the new script with a .R extension named slightly differently
  • 4. Edit the following line in the script:
   #--the imputand is placed as a field in imputed_data and named--
   imputed_data <-cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)]))
   Change it to the following:
   #--need to add +2 after NCOL(combine_input) and stringsAsFactors=FALSE))
   imputed_data <-cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2],stringsAsFactors=FALSE))
  • 5. Save the new script and close R
  • 6. Re run the programs with a change in the last line of the code as follows:
This is the original code:
source('sccs/R/two_stage_ols_vp.R') #Change this line to include the name of the new script file that you created above

The program should work now. I was able to write this up based on some corrections that Giorgio had previously made but never written up.

2&3 have high priority for the article I am writing

Select subsample

2. z<-which(sccs$V7==2) is a statement that, if z is the data selected because the depvar is not-missing, can further restrict the data to be used to those cases with certain values (V7=2) on a control variable. We are able to make this selection in create.model and pass this into two_stage_ols then run_model.

NOW run_model has 73 rows for incoming my_sccs but 186 for a version called otherwise.

Splice new variables, e.g., Premissionized

3. I am trying to splice the AddSnarey.csv to the end of your sccs.Rdata It has those V1-V9 variables and those are the variable labels underneath and the result is weird. I had to take off the variable labels to make a 186-186 congruence between the two data files. But when done I didnt know the way to insert my variable names. So my problem is HOW DO I GET THE VARIABLE LABELS INTO THE NEW MERGED FILE, either the first row or the second....?

do ?read.table

  SCCS Raininch Rain Terrain Water Missions HiGod4 Year  
###aux_data = read.table("AddSnarey.csv", header = TRUE, sep = ",", col.names)  #Snarey186.csv
aux_data = read.csv("AddSnarey.csv", header = TRUE, sep = ",")  #Snarey186.csv
#save(newsccs2, file ="sccsmsn.Rdata")
save(sccsmsn, file ="sccsmsn.Rdata")  #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott)
Splice in WldRelig=cbind(sccs$v2001,sccs$v2002) -- Its already in sccs!!!!
Forget Var1918-2000.cvs Grief and Mourning 
World Religion v2001 v2002 is in sccs 

4. This item has very high priority for SFI and the article and it would be good to implement. I thought of a solution for the 100% option. Just allow a choice of 100%. But when the program comes to the test sample you just reset the 0 to 30%. The program will run as usual. No major reprogramming needed. What I really need is to that that option soon. Victoria can take care of the rest I think. It's a suggestion that might entail minimum program change. If you see a problem lets discuss.

5. Output the full inputed dataset.

Splitting the sample various ways, i.e., USING CONTROL VARIABLES

Great!! And what if I want to split based on the value of a dichotomized variable

for example?  (Doug)

Even easier: z<-which(SCCS$v238==4)

If you want to run just on the odd-numbered (as so many coders did), then you can do the following:

#z<-which(odd(SCCS$"sccs#")==FALSE)  #gives even numbered
z<-which(odd(SCCS$"sccs#")==TRUE)    #gives odd numbered

Now oddSCCS will be a data frame with only the odd-numbered societies.

If you want to randomly split it:


And randSCCS will contain 93 societies.

Investigate adjust

ANSWER: "Controlling for some set of variables" would just mean including those as independent variables in the regression (so the effect of the variable of interest is not obscured by omitted variable bias). This is always done in a good regression model. (DRW: See H. White on core variables vs. controls)

"Fixing certain variables at their mean values", in their paper, simply means that when they calculated the values for their charts, they used the mean values of all the variables EXCEPT the variable they wanted to show the effect of--for that variable they used the actual values. This is an issue of making charts, not an estimation issue.

Another option for "adjust" is to have an auxiliary program that takes a set of independent variables and constructs a subsNsccs.Rdata, either selecting a sample with certain values, or setting certain variables to their means or other values. -- NOT NEEDED

I put this question to Anthon. But "adjust" is a standard option in Stata
and other regression programs
If you have a look at that page what I want to discuss with you briefly is
how to implement controls and adjustments in our program. Having the type
of selection where
is a start and will do for now but we should think ahead
---------------------------- Original Message ----------------------------
Subject: [Fwd: Re: splitting the SCCS]
Date:    Sun, July 24, 2011 6:59 am
To:      "E. Anthon Eff" <>

Thanks : In addition to making it possible to model for a selected control category (like modeling polygyny for the large subsample of Missionized societies) I am trying to get me head around this:

what is meant when a regression is reported (as in Henrich, et al. 2010. Markets, Religion, Community Size, and the Evolution of Fairness and Punishment Science 19 March 2010: 1480-1484.) as fixing certain variables at their mean values? Or controlling for some set of variables? How would that be done in Eff and Dow, generally speaking?

Original Message ----------------------------

Subject: Re: splitting the SCCS From: "E. Anthon Eff" <> Date: Sat, July 23, 2011 1:37 pm To:

Even easier:


Moralizing gods

Add a variable in .csv format to the sccs.Rdata file

What we need now

Brown and Eff 2010 code

Save variables after imputation
 e.g. for PCA
100% sample'
Align print labels
Hausman test
Output results from stage 1
Instructions include ols_stats$
DW should code Paige and Paige
rather than *gz installation

For PCA can this be passed to two_stage_ols - so it does not have to be changed? E.g., put the variables in create and the code in 2sls?

Hi Doug. We're back. Still haven't found the time to look at the problem Scott mentioned...

The principal components are all done within the loop over the imputed data sets. It's in the script we submitted with the paper. Looks like this:

#--composite variable for size--

Not in the main program:

Preprint: "R Software for 2sls inferential statistics (2sls-is)" pdf and doc

Latest attempt at change in the code (see: John Fox) If you added the lines I showed you then lm.restricted is a part of ols_stats, so try ols_stats$lm.restricted. You can also try names(ols_stats) to see all the sub-components of ols_stats.

Hal: The challenge for me as an outsider to anthropolgy and as a semi-outsider to causal analysis is that I can only comment on the science and not on the crucial sociological/anthropological issues of what folks embedded in this area will find appealing or exciting. The latter issues are unfortunately the most determinative factors in the funding decision, so I fear I can't be of much help in that crucial dimension.

As far as the science goes, I can offer the following. 2sls-iv is only one of a considerable number of approaches to recovering causal/structural information. I attach a copy of a paper I think I sent you earlier that encompasses quite a few of these. ("An Extended Class ...") Proposing to extend the software to these contexts might have some appeal.

Another extension is to consider nonparametric structural systems. I attach two papers that examine the nonparametric versions of the standard instrument (2sls) and conditioning instrument cases. ("Local Indirect Least Squares ..." - LILS_SUPP.pdf and "Estimating Nonseparable Models ...".)

Your use of regression diagnostics is appealing. The attached paper "Robustness Checks and Robustness Tests ..." could be extended to the 2sls framework. This might also be appealing to reviewers.

Finally, although I have a great deal of respect for Judea and his contributions, I find the do operator awkward and unnecessary. Rather than focusing on which variables are fixed and which are subject to intervention, as the do operator requires, if one instead focuses on which variables are jointly free to vary as functions of all other system variables, as the partitioning device of settable systems allows, one gains access to causal discourse in a variety of contexts where it is not otherwise possible. This also make distinctions between partial equilibrium effects and full equilibrium effects natural, which is especially important in economics. I attach one of my latest items ( "Causal Discourse ...") that pursues this approach and applies it to an auction game.

I apologize for sending along such a wad of papers, but hopefully some of the ideas here may be useful in your nsf proposal or elsewhere.

There is a concert with the Big Band and Jazz Hall of Fame Orchestra coming up at the end of August where I am slated to play. I will let you know the details soon. Also, I sometimes sub in the big band at Clay's (the old Elario's); if so I will let you know.

We still need to get together for lunch, and now that I am finished traveling for a bit, that is feasible. Are you in town? Or are you in Santa Fe? I'd love to get together!

Warm regards,

Hal (see Halbert_White#On Causal modeling and Xun Lu for Matlab code on testing robustness that could be translated into R)

Curating the code for successful models

Tolga and Giorgio: if not already, you should download textpad to edit results pages on the wiki by insertion of a first column of spaces (EduR_1.2) EduR_1.2
Comments: Replicates (EduR_2)
Comments: Does not replicate (EduR_1.3) same as below: (EduR_1.1)
Comments: continued below
Comments: incomplete: one case to be deleted Society 186, Yahgan
Comments: Compared to Brown and Eff (2010), substitutes ecorich (which groups sccs$v857 categories to “harsher” codes 1&2&6, “temperate or rainforest” codes 3&4, and “savannah” code 5) for PCAP and foodscarc, substitutes animXbridewealth for anim, and has 24 of 28 significance tests at p<.10. That model, with an average R2 =.494, 6.5% higher than the seven-variable model, posits a poor environment, small communities, a pastoral economy with exchange of brides for animals, and caste stratification (the weakest variable), such as occupational castes in pastoral economies, as predictors of high gods that also tend to be more concerned with human morality.
Comments: animXbridewealth predicts evil eye
Comments: not a good model
Comments: needs to be run and modified
Comments: (problem with prestate sample)
Comments: Good model, replicates at 79%
  • [
  • [

April 2011 instructions from Github (temporarily inactive)


This short guide will walk you through the steps of installing the package, building a model, and fitting it.

  • Install sccs_1.0.tar.gz to your working directory
  • On linux type, ./; on windows type, ./install.bat
  • That requires the "R CMD INSTALL sccs_1.0.tar.gz" command
  • which requires that R commander be installed
  • Load R, type: library(sccs)
  • In R, set the working directory by typing: setwd("<working dir>") as in the example below. You create your model in the file that you will name in the first source line below. Once it is finished you copy and paste the nine lines below into R, either from R in the unix terminal or the R Gui.
setwd('/Users/drwhite/Documents/3sls/sccs/') #Macbook
setwd('C:/My Documents/2sls-is/') #PC
  • Create-EduR-S.R - start modifying a sample program (change "S" to an unused number in this wiki)
  • EduR-0 outlines how the file system is organized at installation and WHERE TO PUT EduR-n.R programs where n is a number so as to keep track of models in this installation.

Creating a system

  • DRW: In Create-EduR-S.R I am trying to give the new user a starting point for the simplest possible working model. IT DOESNT WORK AS YET. NOT CLEAR HOW TO DEBUG IT. E.g., debug(run_model.R).
  • EduR-0 outlines how the file system is organized at installation and WHERE TO PUT EduR-n.R programs where n is a number so as to keep track of models in this installation.
  • BELOW there are some working models.
#You do not debug files but functions defined within files, type ?debug at the command line.
 Spatial weights matrix not row standardized
 6: In lm.LMtests(lm.restricted, W_mat, test = c("LMlag")) :

 setwd('/Users/drwhite/Documents/sccs/sccs-2011)  #/models/src')
#You do not debug files but functions defined within files, type ?debug at the command line.

 EduR-0 - the new series 

 EduR-0 - the new series 

 EduR-1 - the new series 1
DEBUGGED THE NEW  two_stage_ols.R WITH
o <- linear.hypothesis(lm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2]
was linear.Hypothesis or linearhypothesis
 setwd('/Users/drwhite/Documents/sccs/sccs-2011)   #/models/src')
#You do not debug files but functions defined within files, type ?debug at the command line.

 source("examples/src/run_model.R")  ##the error will occur here, put in debug()
DEBUGGED THE NEW  two_stage_ols.R WITH 
o <- linear.hypothesis(lm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2]
  • In R, load an example model by typing: source("examples/src/create_model_value_children.R")
  • To source the latest code, you can type: source("sccs/R/two_stage_ols.R')
  • In R, run two stage ordinary least squares by typing: source("examples/src/run_model.R")
  • Examine the summary statistics generated from the fitted model by inspecting the file: summary_model_value_children.csv

top top top priority to fix in two_stage_ols_split_halves

  1. see Debug correct r2s errors to big, use DEBUG to check that W matrices are row normalized
  2. YES Done with option to add/ignore the distance/language matrices. See Run_model.R#User_instructions:_Language_and_Distance
  3. Double check that the random seed for RANDOM in split halves is also random.
DRW added set.seed(11)
must also add right after set.seed(11): = .Random.seed
print in output. That will allow results to be checked for replication.
  1. The distance/language matrices that affect the errors are set in Run_model.R
  2. in split halves print the first results then print the second.
  • Might be that deleting
These two lines in original: did not help (the earlier 3-variable model was better!) but set.seed did help
map to initial for loop in compute_predictions.
Those are not the fitted values but the actual interpolated values from Wy.
these two lines in compute_predictions are used in two-stage-ols and in split halves
so lets see if COMPUTING ORIGINAL EFF-DOW the results are changed............
seeEff and Dow
 #--Stage 2 OLS estimate of restricted model--
xR<-lm(depvar ~ fyll + cultints + roots + fish +
exogamy + settype + femsubs, data = m9)
#--corrected sigma2 and R2 for 2SLS--

Download aux files

restrvar.for restrvar.exe Has to be run on a pc. Can get a series of my_sccs columns ready for our model. WITH THIS PROGRAM I CAN WORK THRU EACH OLDER MODEL WANTED FOR A NEW SERIES WHERE THE Create Model Rfile is mostly automated. This file takes a column of my_sccs and converts to standard my_sccs, indep_vars and restrict_vars formats.

Most recent changes

Items before Leipzig


  • 1. Train N: 124 depending on the train percentage, depending on depvar: N*ratio and Test N:the remainder
  • 2. An *.Rdata with the stage2 depvar and original indepvars for the model
  • 3. Can you pass out lm.restricted and the equivalent of row.names(sccs)?
 lm.restricted <- lm(restrict_eq,data=dataset)
reason: I want to do a qq.plot (car): According to John Fox p193
qqplot(lm.restricted, simulate.T, labels=row.names(sccs))

Train N test N        Train.R2.IV_distance        Train N:
0.457551025        0.980458455                         124
CELL 1                    CELL 2              SKIP CELL 3  CELL 4
(NOTE THAT TRAIN N, TEST N will depend not only on % but on NUMBER CODED
FOR DEP VAR so dont just put the percentage.

Column labels in *.csv	Train.R2.IV_distance 
0.457551025	        0.980458455

Train N test N

fixes the misalignment, the user must add part.cor is all. Or maybe that last missing name can be appended

Major decisions and simple priorities

option to include/exclude language and/or distance

  • This will improve the split-halves test.

two_stage_ols_split_halves.R DECISION TO USED set.seed not DROP a peice of code


  1. added a NEW program two_stage_ols_split_halves.R that comments out the old two_stage_ols.R routine and comments in the split_halves version
  2. modified NEW program two_stage_ols_split_halves_seed.R ditto above and added set.seed

DRW Fortran program to convert old EduMod/EduMac programs to EduR-x

Just use the existing code, apply to my_sccs, insert in standard wiki page.

Tolga will do p12 test of Commentator code

Commentator code graph with seven variables

If train_predict works, alter the draft of the article

  • The idea is that significance tests from many independent variables are not good criteria for model fitting.
  • What the split halves test shows is that there is often a cutoff of the top predictor variables, between those that replicate and those that do not.
  • If the model eliminates the non-replicating variables and split halves is rerun, the train-predictor test is often much more significant.
  • This can be triangulated with the residual correlation test.
  • This also allows us to experiment with sorting potential indep_vars by *significance *effect ratios *low VIF
  • TRIANGULATION is therefore essential. MUDSTICKING is as much a problem with 2SLS as elsewhere.
  • What is going on is that 95% of the variance in the dependent variables is accounted for in a rather deterministic way (no missing data) by autocorrelation.
  • What is left over, however, is a mixture of replicable structure and random noise.
  • When you apply the SPLIT HALVES test there is now NOISE in the sampling of the cases and their W MATRIX subsets.
  • So applied once there is a test of replication. The replicable variables are more likely to have TRAIN-TEST correlations.
  • Applied several times ... when there are replicable TRAIN-TEST correlations the MUDSTICKING variables will be eliminated.
  • The magnitude of the problem is seen in the Brown-Eff paper, where the residual correlations are computed, and are very low.
  • Computing the outer product of a vector by itself for a train-test (split-halves).

Start page


List of five items for new package 4-24-2011

Best to always make options activated by parameters at the top of the program.

  • Important to find a way to identify aliased variables in indep_vars, restrict_vars, drop_vars
  • The commands I am using to append to a file do not have any parameters

relating to closing the file. You may want to try adding the param eol="\r\n" to each of the write.table commands in run_model.R.

Debug() instruction

You do not debug files but functions defined within files, type ?debug at the command line.

To put a breakpoint at the beginning of the function two_stage_ols_mi you can type: debug(two_stage_ols_mi) and the run the program. The program will halt when the function is called and you can inspect all the variables defined at that point by typing ls(). You can type in all of the parameters in that function and any local variables as well as any functions defined.

To remove the breakpoint, type undebug(breakpoint).

Once you are at a breakpoint you can go to the next line by the typing the letter 'n.' You can type the letter 'c' to continue to the next breakpoint if there is one or until the end of the program. For example, after getting to the beginning of the two_stage_ols_mi function you can hit 'n' several times until impute is called and then can inspect everything right after the imputation is done.

1a. Output a file with finished variables, averaged

Key insight: combine the inputed data into averages. Then we can not only correlate but use plots because the averages will no longer be integer (and and option for some .5% f random variation added the the variables) and can give better plots. And probit can be done on the output variables not the integer input. This also makes possible running interaction among variables with imputed values

Option activated by parameters at the top of the program.

1b. For this file and the original variables, options to SAVE a frequency table

Functions in R - can we put my_sccs into a table, compute row sums with margin.table(my_sccs,1)?

  • Option 1: Output table saved right after original variables defined.
Has 186 entries in rows, with (a) var name, (b) number of cases codes (c) the first 15 ordered variables frequencies, e.g.,
v3 186: 14 26 17 ...
v7  43: 7   3   1  ...
  • Option 2: Same, now after data are imputed and imputed variables are averaged (the cases coded will be a constant)
v3 35: 4 2 7 ...
v7 35: 1 3 1 2 ...
  • No column values. This is like the function table(v1) with or preferably without the header variables
     1 2 3 4     
v3 4 2 7 ...
v7 1 3 1 2 ...
  • Explanation. This is needed as a SAVED file not a screen printout to help DEBUG the program when it doesn't work.

In Eff and Dow we could debug with variables with a command

Using debug()

with a variable name. The program often crashes when a variable is empty or not defined.

2. printing output with coefficients: Include the new parameters in the new package plus two others:

CSV Results.png

The 4 new inside \\ \\: \\coef\\ range effect ratio corr \\Fstat

You did: range effect
ratio is effect divided by the range of the depvar. 
The range fo distance is 0-2. 
-1 to +1 for language.

Corr is corr(x,y) function where x=depvar y=indepvar (new: if this can be done AFTER imputation; maybe it needs to be done AFTER imputations are averaged)

coef range effect ratio corr Fstat 
  • ALSO ADD 2 COLS: correlation with RESidual and RAW dependent variable
  • Then for all rows except (Intercept) -- including distance and language with ranges of 2:
Get the distribution of each variable and its min and max, Pearl's indep var x and x', say the variable runs from 3 to 16.
Range is the difference between max and min, e.g., 1-14, in this case 13.
Effect is the UNSIGNED product of max (in this case 14) TIMES the regression coefficient. 
Ratio is the effect divided by range of the dependent variable, all exemplified below:
            coef range effect ratio corr  Fstat  ddf pvalue   VIF    Dependent variable: Parental nurturant guidance 
(Intercept) 23.345                        7.979 31196  0.005           SCCS v492 plus (9-v504) range 1-14 (range 13)
language    -1.084  0-2 -2.17 .504  0.971 2.818 21143  0.093 1.558 
distance     0.352  0-2  0.70 .071 -0.322 1.352 60204  0.245 1.766 
valchild     0.108 1-36  3.89 .299  0.321 6.436  1299  0.011 1.128 (range=35)
superjh     -0.472  1-5  2.36 .182 -0.166 6.554 28408  0.010 1.151 (range=4) 
pigs        -1.632  1-2  1.63 .125 -0.231 5.621 11009  0.018 1.418 (range=1)
agrlateboy  -0.309 1-10  2.91 .224 -0.112 6.662   361  0.010 1.101 (range=9)
                          Sum .938 is circa 1, i.e., roughly 100% of the range of the dependent variable.
  • Explanation: The effect and ratio coefficients allows the use of the regression coefficients for causal modeling. They give the relative contributions of each independent variable. The intercept term (23.345) is a baseline for measuring the depvar (here: nurturant parenting, higher than its empirical maximum of 14), raised by about 4 by how parents value children, and brought down 2.36 points by political complexity, 1.63 by keeping pigs, and by 2.91 points when teenage boys are aggressive slightly less in size to the value of children (.108*36 =3.89) effect.

Related change?

would it make sense to give two rather than one set of outputs for the coefficients? I.e.:
one as above, in the order requested in restrict_vars,
then a blank an a line saying that the same results are given next sorted by:
significance (low first ranking), eff_ratio (high first), and Vifs (low: first ranking).
two weighting parameters can be set against a default x
x= -log10(signif) - low signif is good, high x
y= -log10(VIF) - low VIF is good, high y
z=eff_ratio - high z is good
x ranges from high (9 being p=.000000001) to below 1; 1 being p=.10
y ranges from 0 (VIF=1=good) to negative (very high VIF)
z ranges from 1+ (exceptional eff_ratio) to 0 (no effect)
x is calculated. Higher significance ranked first
y default = weighted from 0 (no influence on rank) to above or below 1 (same as x)
z default = weighted from 0 (no influence on rank) to above or below 1 (same as x)
  • If the user specifies 0, 0 significant results are ranked low to high.
If the user specifies 1 1 VIF and eff_ratio are weighted equally with significance
Weighting .4 .5 weights VIF a bit, eff_ratio half of significance.
VIF, eff_ratio can be weighted >1 to override significance

Options activated by parameters at the top of the program.

Question to statisticians. We have a 2SLS regression where autocorrelation in stage 1 is added as an estimated variable in stage 2, mostly eliminating autocorrelated errors according to our normalized W matrices.
The significance tests are fairly standard for OLS then, with no reduction in effective sample size because the error term is random.
At present we do unrestricted models (all possible indep vars) and then select the variables that are significant, sometimes in successive steps, to get a restricted model and its diagnostics.
The original regression coefficients ignore Pearl's idea of an effect measure relative to the scale of the indep and dep variables, but these are easily calculated so they measure how much a change in one unit of the indep variable affects, by say 1.5 units changes say in a dep var that runs from 1-10 units thus a 25% change expected from the model.
So I am wanting to figure out how to do a weighted sort of these two variables, and starting with the variables at the top of this sort, if they have VIFs fairly close to 1, fairly certain to be significant in a final mocel, until new items that could be added are not significant in competition with those already present.
So far, significance alone does a good job -- picking those that are significant in the unrestricted model does a pretty good job of getting a final restricted model where they are significant. But I would like to improve on this a bit. Obviously those with higher eff_ratio will predict "more" so why not use that in the sort?

3. Make Probit an option

Option activated by parameters at the top of the program.

4. EduMac-1 test for split-halves

Option activated by parameters at the top of the program.

Please run the model at

  • Run the EduMac-1 model with split halves. Question is whether there is

still a negative correlation. That will determine whether we put split haves in the new package.

working with Split halves ("train")

Here are the two lines to search for:

  1. results = two_stage_ols_train_test(dataset_i,y,prox_list,indep_vars,restrict_vars)
 results = two_stage_ols(dataset_i,y,prox_list,indep_vars,restrict_vars)

was: If you uncomment the first one, you get the split halves, if you leave as is, you get the normal version. BUT ISNT IT BE BETTER TO MAKE ALL SUCH OPTIONS CONTROLLED BY PARAMETERS SET AT THE TOP OF THE PROGRAM?

5. (check that these work) Output of Results (Priority rankings !1 !2 !3)

  1. !1 Include a column of correlation coefficients in addition to regression coefficients.
  2. !2 We want distance and language to be a default BUT ALSO TO BE ABLE TO ELIMINATE THEM IN THE MODEL, i.e., test language without distance, distance without language, both, or neither. The question here is whether one is significant without the other.
  3. !3 Code to take output of several models, use sna or Igraph to map the variables 1-3000 being the variable numbers accd to codebook, i --> j, value k reflecting categories of significance from 1=<.11, 2=<.03, 3=<.01, 4=<.001, 5=<.0001, 5=<.00001; sizes of nodes in a *.cls file to reflect significance of distance autocorrelation.
  4. Add simple mapping option that uses Extra_code#General_map
  5. Add country mapping option that uses Extra_code#Brown-Eff_map and source("")

Old ranking of priorities

We need to prioritize !1 !2 !3 etc the order in which to do those we choose to do. We can use a letter code A B,,, Z a b ...z just after the #.

DONE new parameters for reporting *neff

4 new inside \\ \\ then *neff: \\coef\\ range effect ratio corr \\Fstat ... pval \\ norig *nefforig nimp *neffimp

Earlier priority for making giant or large EduMod project graph with a Python script

Excel file for making giant EduMod project graph

  • Python for the Pajek file
*Vertices   max (e.g., 3067) of col V (projects)
that number of numbered rows, rownum=1-max
 col 1<3000: rownums plus "label 7 or 12"
 col 1>2999: rownums plus "label 3"
 col 1<3000: cols19, 5, 15 "label 7"
 col 1>2999: cols 22, 5 "label 5 & 7"
  • The Pajek file made by Python should use only those indepvars that line to 2+ depvars
  • This can also provide an excel format for recording causal link analysis for the 339 foragers database

Using the new output coefficients with a large EduMod project graph from Python script

Turns out they work perfectly. Language can be given a range -1 to 1, effect 2, and ratio of the coef/2

and distance can be given a range 0-1, effect 1, and ratio equal to coef.

For a multivariate causal graph, do adjustments

Split halves correlation and regression (!4 up through !6)

Sun, January 23, 2011 2:28 pm

Ok I made all the changes we talked about and fixed the train/test splitting so it's now randomized. See the attached code. Essentially I did what we discussed otp where I just pass everything through for the train except the R^2 for the final model is now coming from the test. I am not passing anything else through for the test set. The r^2 looks strongly negative which may not be correct. If there is a problem I think it's in the method called "correct_r2s". What I recommend is you source this file and then type "debug(two_stage_ols_train_test)" and then source run_model.R and you can walk through each line. The correct_r2s happens right at the end of the function. The test errors looks about 2x those of the training errors which seems plausible.


  1. Let the program run through Vaux for missing data. Then pick half the sample of N cases randomly.
  2. !4 Break the W matrices (zeros in the diagonal of course, and row normalized to sum to 1) into two rectangular parts, one for the 1-n societies another for the remaining 1 to m. (n+m=N)
  3. !4 Now for 1-n do the first of the two 2sls routines, but use rectangular Ws with 1-N cols and 1-n rows.
  4. !4 Then do the second of the 2sls routines, and obtain the betas Bj for each j, Xj,n of the variables for the 1 to n sample.
  5. !4 Now with the second set of rectangular W matrices, for the 1-m sample of cases.
  6. !4 Using the betas already computed, Bj,n, and the Xj,m of the variables for the 1-m sample, compute the predicted \hat Y = \sum_{w=1,2} \sum_{j=1,k} \rho_w W_w X_j +  \sum_{j=1,k} B_j X_j
  7. !4 Correlate Y for the second sample with \hat Y and report the R^2 .
  8. !5 Report the intercept, regression coefficient, and significance for the \hat Y, Y, with regression line and scatterplot graph
  9. !6 Option to repeat multiple times and plot error bars on the regression line and scatterplot graph
  10. !6 Option to do intercept, regression coefficient, and significance for the \hat Y, x_i, with regression line and scatterplot graph, for each independent variable.
  11. !6 Together, repeated tests should be able to sort out which are the reliable predictors and which are not replicable.

Names and options

  • Names need to be generalized:
not my_sccs but my_data instead, encouraging use of ANY *.Rdata with W matrices and .Vaux.
not Rsccs or the current as the name of the program but R2stage or something more generic
  • Specify in the user program (that has my_data etc) options such as
(x) correlation coefficients
(x) distance and/or language
(y) (n) distance in output
(y) (n) language in output
(x) maps (mutually exclusive choice)
(o) simple map
(o) country outline map
whichever is chosen should crosscut these non-mutually exclusive options
(o) dependent variable map
(o) independent variable map (give variable $v##num or my_date:varname

Error messages

  1. We need an error message when the program does not run. Would also help if the program STOPPED when there is an error. And an error message is needed when depvar is not defined, a restrict_var was not in indep_var or in my_sccs or an indep_var is not in my_sccs. The lack of an error statement is confusing to students, e.g., when an restrict_ or indep_ var change gives an error the program just reprints the previous results and students think it is a new correct result.
  2. We need something like the coef(xR) and coef (xUr) Error messages for Eff and Dow 2009 that will work for my_data: Error messages for Rccs 2010.
  3. After the my_data program runs with some part of the variables in restrict_vars, if the WALD Test diagnostic is significant, have the program cycle through the remaining indep_vars and find the ones that are significant if added (singly) to the current model, the print those results (one or more models).
  4. If the RESET diagnostic is significant restart run through and print each regresstion with all but one restrict_vars but the final variables squared, then square rooted. Stop one one is found. User will have a choice what to do next.


  1. Would it be hard to have an option that allows an exchange of the depvar with an indepvar, gives the new depvarname the name of the old indepvar? ---- One reason why we find little reciprocal causation is that the indepvars are not investigated as depvars.
  2. With the new restrVar.exe program we need some way (like my FORTRAN program) to autocopy all the fx_ or my_sccs into indep_var then all but the last into restrict_var. ---- If the vNums and the vNames in Fx could be saved, it would be good to put the vNums at the right in the line for each causal variable (restrict_var results).
  3. Can we 'automate so it can collect new bunches of variables that work together. Or add one variable at a time.
  4. Ditto, if the RESET Test diagnostic is significant, have the program cycle through the significant indep_vars and find the ones that predict better with log(1+old var) or oldvar2 (squared).
    • When there are more then 30 indep_vars, and when many high VIFs, automatically fill the restrict_vars with several overlapping contrast sets, run each separately, and print as many results as there are sets. This will reduce the VIFs and by having fewer "congested variables" increase the likelihood that a significant variable WILL be found to be significant in one or more contrast sets.

The package

We need to settle on a version 1.0 package among these choices, make it a standard installation from the [R] site (find out how to do this), and thus make the installation easy and not so complex, so students or researchers can easily install at home. And we need to give Mike Migalski an installable version by mid-summer for the fall classes. He does one big installation, not dribs and drabs, so is important to get him our installable code by the end of August.

Creation of a program to convert my_sccs to indep_var and restrict_var lists


The problem I mentioned happens here: 
o <- linear.hypothesis(lm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2]
I'm not quite sure what's wrong though. Still debugging...

On Mon, Jan 17, 2011 at 2:16 PM, Scott White <> wrote: > I don't know how to get it to stop on a line but you can get it to stop at a function by saying debug(function). I'm working on this today.

Examples of new code

Making and Expanding databases

R programs for converting databases to *.Rdata *.rar file, right click and save


Guide to fixing user errors

Changes to the R code (Brown & Eff 2010 --> Scott White)

Chris Brown and Tony Eff have a new article with some additions to their [R] code and a real example with R2=.50 or so. New items:

  1. Ways of referencing variable names, e.g. "dummy", PC... for principal component, ... squared.
  2. Calculating PCAs which requires psychometric package seems not to be available or a *.gz *.zip
  3. The code is attached as "supplementary"
  4. In the code the better way of drawing maps
  • R2p is the R2 partitioned to each independent variable (Chevan and Sutherland 1991; Grömping 2006). R code for relaimpo
Chevan, A., and M. Sutherland. 1991. Hierarchical partitioning. The American Statistician 45: 90- 96.
Grömping, U. 2006. Relative Importance for Linear Regression in R: The Package relaimpo. Journal of Statistical Software, 17(1), 1-27.
"Employing the composite weight matrix method presented in Dow and Eff (2009a:142), we combine the two weight matrices in order to find the linear combination of the two which best explains the transmission process (i.e., results in higher model R2); the resulting weights provide a way of assessing the relative importance of the two cultural transmission channels."
  1. Fuller explanation of the diagnostics, output labels for each, and a new Hausman test for each variable
  2. TWO MORE COEFFICIENTS: IN ADDITION TO OLS coef, a stdcoef in terms of #std deviations, and R^2p

How to add a variables like *.csv to the sccs.Rdata file in the program

Add a variable in .csv format to the sccs.Rdata file

Notes for Users

  • In Eff and Dow if there is no error in part 1, you can change part 2 and run independently
Debug with coef(xUR) and coef(xR)
  • In White and White if there is no error in my_data, you can change indep_vars or restrict_vars and run again from there

Getting started

Team Viewer - Free for non-commercial use lets you share your screen with your team, anyone can enter keystrokes. DRW's usernumber is 793 304 281 password is the earliest z

Download source code
  • Click "Download source" (upper right) and download the file archived source file to your computer, e.g., to "C:/My Documents/SCCS"
  • Uncompress the file, e.g. "tar -zxvf drwhite-Statistical-Inference-In-SCCS-888d07e.tar.gz" or the zip file (make sure to extract and choose on your PC "C:/My Documents/SCCS" as the directory for the extraction.
  • A new directory should appear in "C:/My Documents/SCCS", e.g., C:\My Documents\SCCS\drwhite-Statistical-Inference-In-SCCS-db3c8fd -- rename it "source" - it will have two subdirectories, sccs and data.

Windows Users: Setting up R

  • Make sure R is in your path. You can verify this by going to "Control Panel -> System -> Advanced -> Environment Variables -> Path -- e.g. add to the end of your path ;C:\Program Files\R\R-2.6.2\bi
  • [Optional] Download RTools from (make sure it's compatible w/ your version of R and Windows)

Build and Install R Package

Note: make sure you have R installed on your machine before proceeding.

From the parent directory (one directory above 'sccs') (in class and a home we use c:\My Documents\SCCS -- and type the following commands at the command line

  • Scott White package installation, in Unix (Macbook)


  • [Optional] 'R CMD CHECK sccs' to verify the package is set up correctly
  • 'R CMD BUILD sccs' to build the R package
  • 'R CMD INSTALL sccs_[VERSION].tar.gz' to install the R package

an x86 emulator with DOS An open source DOS emulator for BeOS, Linux, Mac OS X, OS/2, and Windows. Primarily focuses on running DOS Games.

DOSEMU Main Page DOSEMU Main Page. DOSEMU stands for DOS Emulation, and allows you to run DOS and many DOS programs, including many DPMI applications such as DOOM and Windows 3 ...

Yahoo: PC DOS On a Macbook

  • On a PC (Windows7 etc): Open a DOS window.
cd \
cd My Documents
cd source

Extra Installs if needed

You may need to install Microsoft's HTML Help Workshop using iExplorer (iE). Make sure R is in your path. You can verify this by going to "Control Panel -> System -> Advanced -> Environment Variables -> Path. -- add to your path ;C:\Program Files\HTML Help Workshop
Verify there is now a file called sccs_[VERSION].tar.gz in the parent directory.

Run Examples in SCCS R Code


In R run, the following commands:

setwd("C:/My Documents/SCCS") #class and home-PC use
setwd("[mypath]") #where mypath is the path to the parent directory where sccs was installed
rm(list=ls(all=TRUE))  # removes all the current variables in the workspace
options(echo=TRUE)  # turn on verbose output


Create your own SCCS and auxiliary data set

  • Run the following command to create the dataset:
  • Verify the following datasets were created: my_sccs and my_aux
  • Verify rows are properly aligned in two datasets by checking the output of the following command is 186:
sum((my_sccs$socname==my_aux$socname)*1) #Dont worry if this doesnt work

Run multiple imputation

  • Within R, run the following command to run multiple imputation:
  • Verify the dataset "sccs_imputed" was created and saved (in working directory, e.g., "source")

Run two stage ols

  • Within R, run the following command to run two stage ols:
  • Check output of OLSresults.csv #in "source" directory: highlight the numeric fields in excel, right click highlighted area, click "format cells" / "numeric" / 3 decimal places (or higher for pval)


  • GitHub - Secure source code hosting and collaborative development - Online project hosting using Git. Includes source-code browser, in-line editing, wikis, and ticketing. Free for public open-source code.

Agenda: Scott

Fix multiple matrices
Document wiki
 EasyChange dep_var
 EasyChange add/delete indep_var
 Normalization of values option

Normalized re-expression of ordinal variables


Basic ideas

  1. Normal re-expression (probit analysis) simply reassigns the o(i) values of ordinal variables‎ to ones that are normally distributed in the range 1 ≤ x=integer[1 + k(probit(p(o(i)))] ≤ k, where p(o(i)) is the cumulative percentage of cases in category i, k is a multiplier, and x is an integer truncation of the re-expressed value.
  2. This option can transform each ordinal variable in sccs.Rdata to a common scale 1 to k to create normalized variables, save an sccsNorm.Rdata file, and use this file for the linear regression analysis.
  3. A second option can raise each such variable to a power t to minimize least-square fit to the dependent variable. This option is specified in the program and does not require a new dataset.


Probit function: 0 ≤ x ≤ 1 click to enlarge
click to enlarge the table

Let each of i=1,...,k categories of an ordinal variable have frq(i), sample size n, and let p(i)=\frac{frq(i)}{n}. In the standard normal distribution, \mu=0 and \sigma^2=1. The cumulative distribution function and its inverse, the probit function, is

  cdf        = \frac12\Big[1 + \operatorname{erf}\Big( \frac{x-\mu}{\sqrt{2\sigma^2}}\Big)\Big] 
  {probit}(p) = cdf^{-1} = \sqrt{2}\,\operatorname{erf}^{-1}(2p-1) 

To compute in [R] see pnorm and erf. I propose THAT

  1. OUR TRANSFORMATION HAVE A LOWER DEFAULT Zmin=1 and an UPPER Zmax=7 that can be reset as parameters.
  2. we replace SCCS.Rdata with SCCSN.Rdata. If this were done for integers but now between Zmin-Zmax, normally distributed, I cant see why it wouldnt (1) work and (2) give better Rsq (3) provide for more useful diagnostics esp. regarding normal distribution of residual errors if the regression model is correctly specified. Doug 12:18, 8 June 2010 (PDT)


Compute 0 ≤ Z(i) = probit(i) ≤ 1.

Note that the Z(i) are breakpoints between intervals. What we want are the centroids of these intervalues

Zinterv(1)=Zmin+\frac{Z(1)-1}{2}. Compute Z-intervals for i=2 to n-1, Zinterv(i)=Z(i+1)-Z(i).

Let the Z-centroids be Zcent(1)=Zmin+\frac{Z(1)-1}{2}.

for i=2,...,n-1 Zcent(i)=(Zmin+\frac{Z(1)-1}{2})+(\sum_{j=2,...i}Zinterv(j)) .

and Zcent(n) = (Z(n-1)+\frac{Z(n)-Z(n-1)}{2}) .

last term in excel=G3+(Zmax-G4)/2

Consider the normal curve as having unit volume, bounded left and right at +/-s std.devs rescaled in the range 1-7. Now for each Z-centroid i=1,...,k, Zcen(i) re-expresses a normalized value s.t. cen(i)= f(probit(i)). To compute first get the probit for all values for area of the curve under the unit normal distribution, from .00,.01,.02,...,.99,1.00 (j=0 to 100) for +/-3 std.devs, then iterate the probit values j=1,...,1.00 and for each i set

Zcen(i)=f(probit(j)) \equiv  f(probit(j))\le Zcen(i) < f(probit(j))

Then for each i the value Z-centroid Zcen(i) has a value between Zmin=1 and Zmax=7.

Create a new .Rdata file. If decimals dont work with Eff and Dow, then truncate Zcen to integer.


This is done for ordinal categories, if k>2, and all others as well (e.g., percentages). TEST IF THIS TRANSFORMATION CAN BE DONE FOR THE ENTIRE DATABASE AND SAVED AS A NEW DATABASE. Values with low frequency are be moved closer to adjacent values in the normalized scale. The normal curve is considered as having unit volume but the Z scale in bounded. Consider the histogram rectangle for an ordinal variable as having unit volume. Consider the rectangle for the "middle" category j of i=1,...,k categories that contains the median of the distribution and prop(j) of the cases in category j. Adjust that rectangle to have a width and height that minimize the difference in shapes with its position superimposed over the normal curve of unit 1. Consider how to do the same for each j rectangle to the right and to the left. The left/right spacing of each rectangle is not yet fixed however, so consider how to do that as follows.

  1. The spacing adjustments have to recognize that the right boundary on the -tail / Z score / + tail has to correspond to the negative z score so that the area proportion to the left of the unit area for the normal distribution equals the proportion of cases for that variable. Do each category j of i=1,...,k according will 100% of the unit area is covered. You can solve for the height of each rectangle and that show satisfy our criterion. You dont need calculus for this, just a function that converts z-scores to areas under the normal curve.
  2. So all you need is the pdf or cdf for the normal curve, as shown at Wikipedia:Normal Curve
  pdf        = \frac{1}{\sqrt{2\pi\sigma^2}}\; e^{ -\frac{(x-\mu)^2}{2\sigma^2} } 

The general ideas for re-expression come from Tukey (e.g., 1977: 103), but his methods are all pencil and ruler, and dont require a program.

Tukey, John 1977 Exploratory Data Analysis. Reading, Mass.: Addison-Wesley. This takes each of the variables, after imputation, and including the dep_var, and normalizes them. The second page is from a study that uses his ideas:
Mallows, C. L., & Tukey, J. W. (1982). An overview of techniques of data analysis, emphasizing its exploratory aspects. In J. T. de Oliveira & B. Epstein (Eds.). Some recent advances in statistics (pp. 111-172). London: Academic Press.

A second idea: transforming the variables prior to regression

Tukey's contribution was to rexpress (each of our independent) variables so that their plot is linear. This is done by minimizing least squares between val(i) and tval(i)=val(i)^t by varying the exponent t. (The t for each independent variable would be given with the results.


Ordinal probit and o^t


  2. ordered logit/probit [R] p21 MASS library polr(Sat ~ Infl + Type + Cont, method="probit"), where Sat is an ordered factor vector
  3. for multinomials nnet [R]
  4. for MCMC p21 MNP [R] library mnp() see 4.2.2
p242: "Following the elementary statistics textbooks for finding centiles (percentiles), express the ranks as cumulative proportions, and refer these to a unit normal curve ... to read off fz, or use the PR column of Table 6.4.2, where 5 has been added to zp to yeild probits."

Agenda: Doug

NSF Proposal 2010:July see: EduMod-71 - EduMod-52:Imputation and Regression has shortened commands that will run the sample programs

  1. The lang and dist matrices have reversible time, similarities from proximities.

Would it make sense to also have a directed time matrix, i.e., asymmetric. That would differ from the variable "dateobs" which can have a symmetric effect, like distance DONE: have need chunks of code that will let students run code on wiki

  1. MAKE PARAMETERS TO change nimps --- OK NOW FOR parameters 10,10,3 (iterations, ..., pval)
  2. CAN NOW change the dep_var
  3. CAN NOW change the indep_vars
  4. CAN NOW change the restricted model--xUR
  5. CAN NOW change the restricted model--xR
  6. CAN NOW have all variables in the model diagnostics--dropt --> wald
  7. Comments on imputed vars relative to dep_var
  8. CAUTIONS: Cannot run from restricted down
  9. CAN RUN FROM dep_var down

Fall 2010 course