# SCCS R package

causal graph

## Contents

### Plots for subsets

http://www.gardenersown.co.uk/Education/Lectures/R/correl.htm#correlation
Generalization G11.11 p 420A	+cond	Pop
x <- cbind(LRB$pop,LRB$subdiv,LRB$packinx) #subsistence diversity x.df <- data.frame(x) x.sub <- subset(x.df, X3 = 2) plot(x.sub$X1,x.sub$X2, cex=.5) x.sub <- subset(x.df, X3 = 1) plot(x.sub$X1,x.sub$X2, cex=.5) OR: z <- which(LRB$packing<2)# & LRB$subsp<2) plot(log(LRB$pop[z]),LRB$subdiv[z], cex=0.5) cor.test(LRB$pop[z],LRB$subdiv[z], cex=0.5)  Generalization G12.02 p 434 conditional: exclusively terrestrial resources conditional: population density has not reached reduced mobility options to movement within a small foraging area  (not reached packing 1) packing < 1 subdiv increases linearly with log10 of pop #subsistence diversity #population density thus: x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$) #subsistence diversity x.df <- data.frame(x) x.sub <- subset(x.df, X3 = 2) plot(x.sub$X1,x.sub$X2, cex=.5)  this works x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$subsp) #subsistence diversity #subsp 1=terrestrial animals 2= terrestrial plants 3=aquatic x.df <- data.frame(x) x.sub <- subset(x.df, X3 < 2) #, X4 < 2) plot(x.sub$X1,x.sub$X2, cex=.5)  this doesnt but "should" how do I get an x.sub dataset with two conditions x <- cbind(LRB$pop,LRB$subdiv,LRB$packing,LRB$subsp) #subsistence diversity #subsp 1=terrestrial animals 2= terrestrial plants 3=aquatic x.df <- data.frame(x) x.sub <- subset(x.df, X3 < 2 & X4 < 2) plot(x.sub$X1,x.sub$X2, cex=.5) For two conditions, use x.sub <- subset(x.df, X3 < 2 & X4 < 2) "&" ==and "|"==or  EFF: Either of these will work. The first uses different symbols for the two types; the second plots only one of the two types.  plot(LRB$pop,LRB$subdiv, cex=LRB$packing/2,pch=LRB$packing+20,col=LRB$packing)

z<-which(LRB$packing<2) plot(LRB$pop[z],LRB$subdiv[z], cex=0.5)  z<-which(LRB$packing<2 & LRB$subsp<2) plot(log(LRB$pop[z]),LRB$subdiv[z], cex=0.5)  ### Maps in R ### Where we go with SEM • John Fox • Re: Non-intertible matrices • Dear Doug, • Do you have at least as many instrumental variables as coefficients to estimate? • John data() in its most common usage just tells R to load a stored data object from an attached package. if you do help("data") you will see there is another application if package=NULL, so that it will look in a folder data of the current working directory. that might be simplest for you. do not confuse it with the parameter name data= in functions -- that generally just allows you to specify a data frame that gives bindings for variables. the solution for you is to have in your working .GlobalEnv a data frame of any name, say "foo", and then you can use data=foo as one of the parameter settings for your sem call This worked ### neff almost working DW: I see the problem here, its the p742 (click the small blue "pdf" below) upper right formula Y = ... e*. That's probably lm(Y~V^.5), where e* is Y=mu + (V^.5)e* which is slightly out of my pay grade because e* is multiplicative. There's another step where sigma^2V^-1 is his covariance V not just the matrix V=W that I was using, which ought to bring the n* result into the proper range 1 < n* < n. Could one of you make those equations for me, starting with any old Y and W? With that addition I think the little R program will work. And n* is different for each variable so it can be listed in the output of our 2SLS scripts, so if this works feel free to use it. Working Calculation for n*, neff, effective sample size with W matrix p742 Suppose the n x n matrix V contains the covariation structure among n georeferenced observations (more precisely, matrix $\sigma^2 V^{-1/2}$ is the covariance matrix), such that $Y = \mu + e = \mu + V^{-1/2}e*$, where Y denotes … attribute values, and mu denotes the population mean of variable Y…, and e and e* … denote the vectors of spatially autocorrelated and unautocorrelated errors…. If V = I, the n x n identity matrix, then the n observations are uncorrelated. #A. Eff helped out here ww<-as.matrix(read.dta("http://dl.dropbox.com/u/37813961/dist25wm.dta")[,c(-1,-2,-1,-189)]) #source(url('http://intersci.ss.uci.edu/wiki/R/source/Wlink.R')) #ww<-Wlink dim(ww) round(ww[1:3,1:3],3) round(rowSums(ww)[1:3]) ww<-ww/rowSums(ww) #making sure that ww is row normalized round(ww[1:3,1:3],3) round(rowSums(ww)[1:18]) n<-NROW(ww) ww=cov(ww) # THIS IS V <--- NEW (THE FIX) diag(ww)=0 round(ww[1:7,1:7],9) round(ww[185:186,185:186],9) w_inv<-solve(ww) #this is computationally singular unless the diag(ww)=0  round(w_inv[1:3,1:3],3) chk<-ww%*%w_inv # verifying that w_inv is indeed the inverse table(diag(chk));table(colSums(chk)) TRw_inv<-sum(diag(w_inv)) #trace of inverse one<-matrix(1,n,1) #column vector of ones nstar<-TRw_inv/(t(one)%*%w_inv%*%one)*n #formula nstar  n* whether or not ww is row normalized to 1, ww<-ww/rowSums(ww) 141.7 for Wlink 212.9 for Eff-Dow Distance matrix #inverse of V <--- where does Y come in? Y = mu + e = mu + V-1/2e*, where Y denotes … attribute values, and mu denotes the population mean of variable Y…, and e and e* … denote the vectors of spatially autocorrelated and unautocorrelated errors….  ### Construct dummy variables for FxCmtyW, Fratgrpstr FxCmtyWages=((sccs$v2010>=2)*1)*((sccs$v61==6)*1) FxCmtyW=as.numeric(is.na(FxCmtyWages))  Fratgrpstr=sccs$v570
misFratgrpstr=as.numeric(is.na(Fratgrpstr))
table(FxCmtyW)
FxCmtyW
0   1
112  74  so 1 means missing data.


### Which cases?

pastName <- sccs$socnam[which(p == 1)] p <- which(sccs$v1=='No Trade')

which(sccsdata$v1=='No Trade')  [ 1] 7 55 93 170 180 183 186 Try using the function 'which' with your expression inside. which(Islam==2 & sccs$v238==1)
[1] 83  http://eclectic.ss.uci.edu/~drwhite/worldcul/Sccs34.htm Javanese
IslamXian=((sccs$v2002==2)*1)+((sccs$v2002==3)*1)
table(sccs$v238,IslamXian) IslamXian 0 1 1 67 1 Javanese 2 47 0 3 13 0 4 17 23 IslamXian=((sccs$v2002==4)*1)+((sccs$v2002==5)*1) table(sccs$v238,IslamXian)
IslamXian
0  1
1 58 10
2 41  6
3 10  3
4 31  9


### Make a copy of sccs.Rdata named sccs2.Rdata or Strip some variables from comb.Rdata

1. exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3")
newdata <- mydata[!myvars]

setwd('/Users/drwhite/Documents/3sls/sccs/')
myvars <- names(comb) %in% c("Missions.f", "HiGod4.f") ##THESE .f are key
sccsdata <- comb[!myvars]
save(sccsdata,file='sccsdata.Rdata')

load('sccsdata.Rdata')

[1857] v1913                      v1914                      v1915                      v1916
[1861] v1917                      v19_22                     v24_25                     v26_27
[1865] v53_54                     complex                    v159_177                   fortitud
[1869] aggressi                   competiv                   selfrely                   achievem
[1873] industry                   responsi                   obedienc                   selfrest
[1877] sexrestr                   id66                       v2001_1806                 v2002_1807
[1881] v713rev                    Islam                      Christian                  Buddhist
[1885] Hindu                      HinduBuddhist              Shaman2                    GamesOfChance
[1889] GamesOfSkill               GamesOfStrategy            NomEurasia                 NomEurasiaPacific
[1893] NomAfroEurasia             NomOldWorld                NomNewWorldAfroPacific     ExternalWar13
[1897] ExternalWar50              ExternalWar50Bands         InterrnalWar46             InterrnalWar46Bands
[1901] ExtWarFactor               IntWarFactor               FratIntGroups60            FratIntGroups60Bands
[1905] FemalePower107             FemalePower107Bands        MaleAggression50           MaleAggression50Bands
[1909] Modernization102           Modernization102Bands      PoliticalIntegration82     PoliticalIntegration82Band
[1913] v1260                      Rain.f                     Terrain.f                  Water.f
[1917] Missions.f                 HiGod4.f

setwd('/Users/drwhite/Documents/3sls/sccs/')
myvars <- names(comb) %in% c("HiGod4.f") ##THESE .f are key
sccs2 <- comb[!myvars]
save(sccs2,file='sccs2.Rdata')

load('sccs2.Rdata')


setwd('/Users/drwhite/Documents/3sls/sccs/')
myvars <- names(comb) %in% c("Missions.f") ##THESE .f are key
sccs1 <- comb[!myvars]
save(sccs1,file='sccs1.Rdata')



### dy/dx, elasticity and derivatives

> D(expression(sqrt(1 - x^2)), 'x')

-(0.5 * (2 * x * (1 - x^2)^-0.5))

Doug: For example, if one estimates:
y=a0+a1*x+a2*x^2+a3*ln(x)
then
dy/dx=a1+2*a2*x+a3/x
In other words, this is just like any other calculation of a first derivative. No need to do any special estimation.

xxxx


Doug: OK, maybe I didn’t give such a great example. So imagine that one estimates y=f(x,z,v):

y=a0+a1*x+a2*x^2+a3*x*z +a4*ln(z)+a5*v

then

dy/dx=a1+2*a2*x+a3*z

dy/dz=a3*x+a4/z

dy/dv=a5

These are all partial derivatives, which give the ceteris paribus effect of a change in one variable at a time.

Note that only the last (dy/dv) is a scalar that one can introduce into a table next to a regression coefficient. And, since it is identical to the regression coefficient, it would be redundant to introduce it into the table.

The other two (dy/dx and dy/dz) are not scalars that can be introduced as a single number in a table--they are vectors, with as many values as there are observations (they are functions of x and z). That's why the best way to understand these is to plot them (against x and z, respectively)

In the code for the Farming and Fighting paper, the calculation of the marginal effects is done when we create the figures. For example, the calculation of the marginal effect of SCCS$v814 is done in this line: dyd814<-axp*g[,"pdens"]+ax1+2*ax2*g[,"v814"] #now apply that to fitting a smooth lowess function to f(y,x) plot(cars, main = "lowess(cars)") lines(lowess(cars), col = 2) #lines(lowess(cars, f=.2), col = 3) legend(5, 120, c(paste("f = ", c("2/3", ".2"))), lty = 1, col = 2:3)  lines(lowess(g[,"pdens"],jj*g[,"pdens"]),col="red") lines(lowess(sccsA[zdv,"v814"],y814),lty=2,col="red")  setwd('/Users/drwhite/Documents/3sls/sccs/') load('comb.Rdata') sccs<-data.frame(sapply(comb,function(x) as.numeric(x))) plot(sccs$v814,sccs$v815) #YES vars=data.frame(sccs$v814,sccs$v815) #missing data #vars=data.frame(sccs$v5,sccs$v3) #no missing data #plot(vars) #YES varNoNA <- na.omit(vars) sum(table(varNoNA)) plot(varNoNA) #YES lines(lowess(varNoNA),lty=2,col="red") ##YES, PLOTS OK  xy= lines(lowess(vars),lty=2,col="red") ##NOT WORK  • Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74, 829–836. • Cleveland, W. S. (1981) LOWESS: A program for smoothing scatterplots by robust locally weighted regression. The American Statistician, 35, 54. ### Reff and Elasticity Hi Doug -- I think I now understand that you are interested in measuring effects in percentage change terms. In economics, this is the concept of "elasticity": the percent change in one variable resulting from a given percent change in another. Here are some wikipedia articles on this notion: As the latter item notes, the concept goes back at least to Alfred Marshall 1890, so you are in august company. Simply put, elasticity = (dy/y) / (dx/x). Whenever dy / dx has a causal interpretation, so does the elasticity. Whether you use dy / dx or elasticity is purely a matter of convenience or interpretability in a given context. In case it may be of interest, I attach a recent update of my paper with Karim where we define a variety of effects without recourse to the "do" operator. I hope this helps. Best, Hal marginal effect marginal effects for binary or ordinal responses, computed as partial derivatives, are available from SAS/ETS PROC QLIM by specifying the MARGINAL option in the OUTPUT statement. While no procedure in SAS/STAT software directly computes marginal effects, the partial derivative can be computed using results from the procedure used to fit the model. ### can we run GLM inside 2SLS-IS? can we run the wikipedia:Generalized_linear_model inside 2SLS-IS? see Sccs 2SLS-IS downloads Changed (lm in two_stage_ols.R to compute_stats <- function(glm.unrestricted,lm.restricted,errors,prox_list) { and got this error  8 3 data_to_impute...vars_missing_data.i.. 8 4 data_to_impute...vars_missing_data.i.. 8 5 data_to_impute...vars_missing_data.i.. 8 6 data_to_impute...vars_missing_data.i.. 8 7 data_to_impute...vars_missing_data.i.. 8 8 data_to_impute...vars_missing_data.i..  Error in linearHypothesis(lm.unrestricted, drop_vars, test = "Chisq") :  object 'lm.unrestricted' not found  In addition: Warning messages: 1: In min(y[test_idxs], na.rm = TRUE) :  no non-missing arguments to min; returning Inf  2: In max(y[test_idxs], na.rm = TRUE) :  no non-missing arguments to max; returning -Inf  Changed (lm in two_stage_ols.R to compute_stats TO function(glm.unrestricted,lm.restricted,errors,prox_list) Changed (lm to glm here: linearHypothesis(glm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2] and got a RESULT - can this be right? YES, because I then restarted R, and did not include the earlier two_stage_old.R

              coef range effect  ratio Fstat      ddf     pvalue   VIF abs.ratio %coded tot.cor part.cor
(Intercept) -9.512    NA     NA     NA 0.727 2849.468 0.39399615    NA        NA  0.000   0.000    0.000
language     1.346     2  2.691  0.096 7.484 2200.414 0.00627649 1.306     0.096  0.000   0.207   -0.023
cultints     0.805     5  4.027  0.144 5.691  755.965 0.01730165 1.868     0.144  1.000   0.162    0.091
roots       -1.995     1 -1.995 -0.071 2.770  233.726 0.09741558 1.196     0.071  1.000  -0.081   -0.097
fish         0.558     9  5.019  0.179 4.864 2579.736 0.02750355 1.205     0.179  1.000   0.051    0.078
exogamy     -0.860     4 -3.438 -0.123 5.160  592.550 0.02347549 1.115     0.123  0.995  -0.111   -0.167
settype     -0.450     7 -3.152 -0.113 3.858  858.149 0.04983512 1.678     0.113  1.000   0.005   -0.069
femsubs      0.458     8  3.667  0.131 2.189 1030.255 0.13933123 1.224     0.131  0.995   0.094    0.040
Train  R2:final model  Train R2:IV_language
0.09759433            0.97002048
Fstat         df pvalue
RESET          0.001     73.273  0.974
Wald.on.restrs 1.138    228.137  0.287
NCV            0.406     58.725  0.527
SW.normal      0.763   2455.566  0.382
lag..language  1.568 316525.108  0.211

               coef range effect  ratio Fstat      ddf     pvalue   VIF abs.ratio %coded tot.cor part.cor
(Intercept) -13.190    NA     NA     NA 0.940  130.471 0.33415460    NA        NA  0.000   0.000    0.000
language      1.498     2  2.996  0.107 6.245  121.638 0.01378760 1.309     0.107  0.000   0.209   -0.023
cultints      0.762     5  3.810  0.136 5.156  854.501 0.02341128 1.860     0.136  1.000   0.154    0.084
roots        -1.969     1 -1.969 -0.070 2.816  416.877 0.09406955 1.196     0.070  1.000  -0.078   -0.095
fish          0.517     9  4.654  0.166 4.143 1609.501 0.04197396 1.205     0.166  1.000   0.041    0.070
exogamy      -0.815     4 -3.259 -0.116 4.507  318.360 0.03453283 1.112     0.116  0.995  -0.102   -0.159
settype      -0.441     7 -3.090 -0.110 3.495  289.899 0.06257948 1.681     0.110  1.000   0.002   -0.072
femsubs       0.482     8  3.855  0.138 2.392  964.099 0.12227872 1.233     0.138  0.995   0.108    0.049
Train  R2:final model  Train R2:IV_language
0.08700948            0.96158666
Fstat        df pvalue
RESET          -0.023   143.632  1.000
Wald.on.restrs  1.278   185.277  0.260
NCV             0.640    81.212  0.426
SW.normal       0.545  3139.683  0.461
lag..language   1.613 84002.356  0.204


The generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

### Dichotomize NA versus Others to make dummy coded/missing cases for networks of variables

dich=is.na(sccs$v287) di287=as.numeric(is.na(sccs$v287))
table(dich,di287)

di=is.na(sccs$v287) di287=as.numeric(is.na(sccs$v287))
table(di,di287)
http://escholarship.org/uc/item/7cm1f10b retrieve Anthon's file

SEE RESULTS AT Draft#HiGod4 for FxCmtyWages, Full imputation test and gives better results, "more positive cases"
Meaning of Positive correlation of imputed with Coded variables


### Bug Reports

here I attach a detailed bug description, and a R source to replicate it.

Giorgio Ubuntu bug reports W %*% X (following data imputation) produces an error in an R version for Ubuntu linux
But it was one compiled by Giorgio so the solution may not be general
new Ubuntu_run_model.R
change in two_stage_ols.R and  two_stage_ols_full.R
imputed_data <- cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2]))
to
imputed_data <- cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2], stringsAsFactors=FALSE))


### Anthon Eff codebook NA

The dropbox codebook at http://dl.dropbox.com/u/9256203/ccc.txt correctly indicates a number of variables in sccsfac.Rdata that lack NA

• BUT SOMETIMES THE MISSING DATA ARE NOT CORRECTLY CODED as NA

http://dl.dropbox.com/u/9256203/sccsfac.Rdata #Eff dropbox load('sccsfac.Rdata') sccs<-data.frame(sapply(sccsfac,function(x) as.numeric(x)))

I searched in the codebook for "don't know" etc.

sccs$v271:5 sccs$v273:5
sccs$v917:4 sccs$v917:7
sccs$v1648 sccs$v1649
sccs$v1650 sccs$v1651
sccs$v1652 sccs$v1653
sccs$v1654 sccs$v1659
sccs$v1676 sccs$v1677
sccs$v1678 sccs$v1683
sccs$v1684 sccs$v1685
sccs$v1694  Variable to delete sccs$externalwar13


Spelling changes

sccs$v763 Ude -> Use sccs$v908 evan -> even
sccs$v918 nt -> not sccs$v1249:6 prp -> pro


These are sample N=coded for numerically summarized variables

> sum(table(sccs$v871)) [1] 147 > sum(table(sccs$v872))
[1] 145
> sum(table(sccs$v1753)) [1] 96 > sum(table(sccs$v1848))
[1] 135
> sum(table(sccs$v1849)) [1] 135 > sum(table(sccs$v1861))
[1] 186
> sum(table(sccs$v1862)) [1] 93  Other Ns to do: sccs$v1865 to 1873
sccs$v1873 to 1880 sccs$v1911
sccs$v1913 to 1917 sccs$v1975
sccs$v1977 to 1978 sccs$v2000


NaNs not NA

sccs$v1805.1 sccs$v1805.2
sccs$v1805.3 sccs$v1805.4
sccs$v1805.5 sccs$v1805.6
sccs$v1805.7 sccs$v1805.8


### How to do 1st PCA for variables before my_sccs

#Compute the PCAP variable for Brown and Eff (courtesy of Eff)
is.na(sccs$v237[c(115,172)])<-TRUE is.na(sccs$v63[88])<-TRUE
twoVARS<-sccs[,c("v63","v237")]
z<-which(!is.na(rowSums(twoVARS)))
pcdat2<-princomp(twoVARS[z,])
pc1<-pcdat2$scores[,1] sccs$pc1[z]<-pc1
length(sccs$pc1) #186 #pc1[172]=NA #pc1[115]=NA #pc1[88]=NA  $FORGET THE REST BELOW
#Compute the PCAP variable for Brown and Eff (courtesy of Eff)
is.na(sccs$v237[c(115,172)])<-TRUE is.na(sccs$v63[88])<-TRUE
twoVARS<-sccs[,c("v63","v237")]
z<-which(!is.na(rowSums(twoVARS)))
pcdat1<-princomp(twoVARS[z,])
pc1<-pcdat1$scores[,1] #---first principal component length(pc1) pc1[172]=NA pc1[115]=NA pc1[88]=NA  #FORGET THE REST BELOW http://stat.ethz.ch/R-manual/R-devel/library/stats/html/princomp.html #--loop through cases for missing data and do 1st PCA -- twoVARS=cbind(sccs$v921,sccs$v928) #--loop through cases for missing data -- for (i in 1:nimp){ #z<-which(my_sccs$.imp==i)
z<-which(twoVARS==NA) #???? newvar[which(newvar==88)] = NA
z<-which(my_sccs$v921==NA) #???? newvar[which(newvar==88)] = NA #princomp(twoVARS) pcdat=princomp(twoVARS) #pcdat$scores
PCcap1=round(pcdat$scores[1:186],4) sccs$v1918=PCcap1

see source("examples/create/create_EduR_1/create_EduR_1.5DistB_EffHiddenVariablesV238AP1_2.R")

sccs$v237[172]=NA sccs$v237[115]=NA
sccs$v63[88]=NA v237= sccs$v237
z<-which(v237==NA)
v63= sccs$v63 z<-which(v63==NA) twoVARS=cbind(v63,v237) z=which(twoVars=NA) #princomp(twoVARS) pcdat2=princomp(z,twoVARS)  #newvar[which(newvar==0)] = NA #newvar[which(newvar==88)] = NA #newvar<-sccs$v237
#sccs$v237=newvar sccs$v237[172]=NA
sccs$v237[115]=NA v237= sccs$v237
z<-which(v237==NA)
v237= sccs$v237 sccs$v63[88]=NA
z<-which(twoVARS==NA)
twoVARS=cbind(sccs$v63,sccs$v237)
z<-which(twoVARS==NA)
pcdat2=princomp(z,twoVARS)
PCsize1=round(pcdat2$scores[1:186],4) sccs$v1919=PCsize1


### Compute R2 between significance and relative effect ratios

 x=c(176,237,96,208,102)
y=c(43,34,102,8,41)
table(x,y)
y


x 8 34 41 43 102

 96  0  0  0  0   1
102 0  0  1  0   0
176 0  0  0  1   0
208 1  0  0  0   0
237 0  1  0  0   0
cor(x,y)
1] -0.6998707
cor(x,y)^2
1] 0.4898189
xL=c(237,293,364,187,252,212,143)
yL=c(91,15,21,2,45,39,65)
cor(xL,yL)^2
1] 0.09697337
xM=c(156,5,216,72,5,99,36,279,186,106,174,85)
yM=c(303,975,261,244,975,384,664,92,250,80,53,97)
cor(xM,yM)^2
1] 0.5300817
.9^3
1] 0.729
.95^3
1] 0.857375
.95^2
1] 0.9025


### Drop box

#### Dropbox

https://www.dropbox.com/help/140
• The Public Folder lets you easily share single files in your Dropbox. Any file you put in this folder gets its own Internet link so that you can share it with others-- even non-Dropbox users! These links work even if your computer’s turned off.
Step 1: Drop a file into the Public folder.
Step 2: Right-click/control-click this file, then choose Dropbox > Copy Public Link. This copies the Internet link to your file so that you can paste it somewhere else.
That's it! You can now share this file with others: just paste the link into e-mails, instant message conversations, blogs, etc.!

I searched in Eff’s R codebook for "don't know" and other indicators of missing data without NA.

sccs$v271:5 sccs$v273:5
sccs$v917:4 sccs$v917:7
sccs$v1648 sccs$v1649
sccs$v1650 sccs$v1651
sccs$v1652 sccs$v1653
sccs$v1654 sccs$v1659
sccs$v1676 sccs$v1677
sccs$v1678 sccs$v1683
sccs$v1684 sccs$v1685
sccs$v1694  Variable to delete sccs$externalwar13 (too few variables)


Spelling changes in codebook needed for these words

sccs$v763 Ude -> Use (ok in main codebook) sccs$v908 evan -> even (ditto)
sccs$v918 nt -> not (ditto) sccs$v1249:6 prp -> pro (ditto)


NaNs not NA

sccs$v1805.1 sccs$v1805.2
sccs$v1805.3 sccs$v1805.4
sccs$v1805.5 sccs$v1805.6
sccs$v1805.7 sccs$v1805.8


#### Source

Install or start R: go to R/Package & Data/Package installer, search for "igraph", install,
then Cut and paste into R as a command
setwd('/Users/drwhite/Documents/3sls/sccs/')                           # library(foreign)
sccs<-data.frame(sapply(comb,function(x) as.numeric(x)))          #data(sccs)


#source("http://intersci.ss.uci.edu/wiki/R/source/model/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R")
source("http://intersci.ss.uci.edu/wiki/R/source/model/
source("http://intersci.ss.uci.edu/wiki/R/source/model/
source("http://intersci.ss.uci.edu/wiki/R/source/model/create_EduR_1.5DistBrownEffSign.R")
source("http://intersci.ss.uci.edu/wiki/R/source/model/create_EduR_1.5DistB_EffHiddenVariables.R")
source("http://intersci.ss.uci.edu/wiki/R/source/two_stage_ols.R")

source("http://intersci.ss.uci.edu/wiki/R/source/run_model_100.R")
source("http://intersci.ss.uci.edu/wiki/R/source/averageAll.R")   #"output" is the name of the active output file

On Mac:
source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R")
#source("dropbox create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R")
source("R_3_s_ols/two_stage_ols.R")
source("examples/create/R_run/run_model_100.R")
source("examples/create/R_run/averageAll.R")   #"output" is the name of the active output file
source("examples/create/R_run/averageAll.R")


## Crucial change for SCOTT to make: adjust relimp to R2 for 2SLS Oct 2011

Brown and Eff 2010 code - Mac See: http://intersci.ss.uci.edu/wiki/index.php/Brown_and_Eff_2010_code_-_Mac
#--adjust relimp to R2 for 2SLS--
xz<-bbb$relimp[2:NROW(bbb)] wt<-r2[1]/sum(xz) bbb$relimp[2:NROW(bbb)]<-xz*wt


## OLS reminder

lmout=lm(sccs$v238 ~ output$no_rain_dry +output$superjh +output$milk +output$AnimXbwealth +output$FxCmntyWage +output$logdate) #) #output$fratgrpstr +output$pigs +output$eeextwar +output$Missions +output$socplex + lmout summary(lmout) vif(lmout)

## Correction for missing data in v1650 and v1685

comp.Rdata and comb2.Rdata updated - always check table(comp$v1650) and table(comp$v1685)

#convert factor -->as.numeric
newvar<-sccs$v1650 newvar[which(newvar==0)] = NA newvar[which(newvar==88)] = NA sccs$v1650=newvar
#convert numeric -->as.factor
newvar<-comb$v1650 newvar[which(newvar==0)] = NA newvar[which(newvar==88)] = NA comb$v1650=newvar

#convert factor -->as.numeric
newvar<-sccs$v1685 newvar[which(newvar==0)] = NA newvar[which(newvar==8)] = NA sccs$v1685=newvar
#convert numeric -->as.factor
newvar<-comb$v1685 newvar[which(newvar==0)] = NA newvar[which(newvar==8)] = NA comb$v1685=newvar

save(comb, file='comb.Rdata')

lmout=lm(dep_var~PopdenEffect)
lmout
summary(lmout)
vif(lmout)


## How to normalize and average data in a data frame

PLOTTING
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/scatter.smooth.html
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/plot.html
http://www.astrostatistics.psu.edu/su07/R/html/graphics/html/plot.html



Ideally I want to scale the values for each column in the range of (-1,1)

?scale


Ideally I want to scale the values for each column in the range of (-1,1)

?scale
#A <- matrix(c(2,3,-2,1,2,2),3,2)

popscale153=scale(sccs$v153) popscale156=scale(sccs$156)
popscale1130=scale(sccs$1130) popscale=matrix(c(popscale153,popscale156,popscale1130),186,3)  popscale=matrix(c(scale(sccs$v153),scale(sccs$v156),scale(sccs$v1130)),186,3)
popdenmean=round(1.5*rowMeans(popscale),0)
popdenmean
-2 -1  0  1  2
Absent                                   17 21 13  9  9
Present, inactive, unconcerned           11  9 12 11  8
Present, active, unconcerned with humans  4  4  7  4  4
Present, active, supportive of morality   2  8  8 16  9

newvar=sccs$v1650 newvar[which(newvar==88)] = NA Replace 0 with NA plot(popdenmean,scale(sccs$v1650))
plot(popdenmean,scale(newvar))

popscale=matrix(c(scale(sccs$v153),scale(sccs$v156),scale(sccs$v1130)),186,3) popmean=rowMeans(popscale) max(popmean) min(popmean) mean(popmean) #~zero v1650=sccs$v1650
v1650[which(v1650==0)] = NA
plot(popmean,scale(v1650)) #sccs$v1650 #http://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html Scatter Plot Smoothing #lowess(x, y = NULL, f = 2/3, iter = 3, delta = 0.01 * diff(range(xy$x[o])))
x = popmean
y = scale(v1650)
lowess(x, y = NULL, f = 2/3, iter = 3,       delta = 0.01 * diff(range(xy$x[o])))  AgpotScale=matrix(c(scale(sccs$v921),scale(sccs$v928)),186,2) #AgriPot 1&2 Agpotmean=round(rowMeans(AgpotScale),1)  SizeJHscale=matrix(c(scale(sccs$v63),scale(sccs$v237)),186,2) #CmtySize & Superjh SizeJHmean=round(rowMeans(SizeJHscale),1)  ### Replace PCA from Psychometric package twoVARS=cbind(sccs$v921,sccs$v928) #princomp(twoVARS) pcdat=princomp(twoVARS) #pcdat$scores
PCcap1=round(pcdat$scores[1:186],3) sccs$v1918=PCcap1

sccs$v237[172]=5 sccs$v237[115]=3
sccs$v63[88]=3 twoVARS=cbind(sccs$v63,sccs$v237) #average CmtySize with Superjh #princomp(twoVARS) pcdat2=princomp(twoVARS) #pcdat$scores
PCsize1=round(pcdat2$scores[1:186],3) sccs$v1919=PCsize1


## White-Murdock alignment

from Murdock and White 1969
EduMod84#Coords
Wlink=matrix(data = 0, nrow = 186, ncol = 186)
for(i in 1:185) {
for(i in 1:184) {
for(i in 1:183) {
for(i in 1:182) {


## New problems and possibilities e.g., Replace, Dichotomize NA / Others

data[which(data==0)] = NAReplace]
sccs$v2009=sccs$v1009
sccs$v2011=3-(sccs$v1009<=3)*1+(-1+(sccs$v1009==5)*1) sccs$v2011= replace(sccs$v2009,is.na(sccs$v2009),0)
sccs$v1732= replace(sccs$v1732,is.na(sccs$v1732),0) newvar<-apply(sccs[,c("v2011","v1732")],1,max) newvar[which(newvar==0)] = NA sccs$v2010=newvar
Replace 0 with NA
table(sccs$v1732,sccs$v2011)
0  1  2  3  4  5  6  7
0 16  7  2  0  4  5  2  0
1 10  3  1  2  1  2  0  0
2  0  0  0  0  1  1  5  2
3  2  1  0  0  3  2  0  1
table(sccs$v2009,sccs$v2011)
1  2  3
1 13  0  0  #No wage labor
2  3  0  0  #No wage labor or coerced
3  2  0  0  #No wage labor or coerced
4  0 10  0  #Hired/Wage
5  0  0 12  #Migrant wage
6  0  9  0  #Hired/Wage
7  0  3  0  #Hired/Wage
#analysis  v1009 is WorldSys labor
sccs$v2009=sccs$v1009                                                             #Proxy for  worldsys labor 1-7 N=52
sccs$v2732=sccs$v1732                                                             #Proxy for Wage labor 1-3  N=89
sccs$v2011=3-(sccs$v1009<=3)*1+(-1+(sccs$v1009==5)*1) #Proxy for a labor variable 1-7 N=52 NOW AS 1 2 3 values sccs$v2011= replace(sccs$v2011,is.na(sccs$v2011),0)                #Replace NA with 0 keep 1-2-3
sccs$v2732= replace(sccs$v2732,is.na(sccs$v2732),0) #Same for v1732 labor variable newvar<-apply(sccs[,c("v2011","v2732")],1,max) #Take max (i.e., nonzero) newvar[which(newvar==0)] = NA #Replace 0 with NA N=112 WHY 1-7??? sccs$v2010=newvar                                                                    #N=112
FxCmtyWages  1  2  3  v1732
0 20  8 15 <-coding 0 error here hard to fathom  FxCmtyWages SUM IS N=112 v1732
1 13  0  0
2  1 14  0
3  2  0 16
table(sccs$v1732,sccs$v2010)  #boosts overlap of 112 and 89 from to 89? GOOD PROXY
1  2  3
1 31  2  3
2  0 21  1
3  0  0 31
GOOD TO HERE
#analysis  v1009 is WorldSys labor
sccs$v2009=sccs$v1009                                                             #Proxy for  worldsys labor 1-7 N=52
sccs$v2732=sccs$v1732
sccs$v2011=3-(sccs$v1009<=3)*1+(-1+(sccs$v1009==5)*1) #Proxy for a labor variable 1-7 N=52 NOW AS 1 2 3 values sccs$v2011= replace(sccs$v2011,is.na(sccs$v2011),0)                #Replace NA with 0 keep 1-2-3
sccs$v2732= replace(sccs$v2732,is.na(sccs$v2732),0) #Same for v1732 labor variable newvar<-apply(sccs[,c("v2011","v2732")],1,max) #Take max (i.e., nonzero) newvar[which(newvar==0)] = NA #Replace 0 with NA N=112 WHY 1-7??? sccs$v2010=newvar                                                                    #N=112

sccs$v200==1 #Use 84% no wage labor for region 1 (Africa) #  dich=is.na(sccs$v287)
ndich=as.numeric(dich)
ndich=as.numeric(is.na(sccs$v287)) table(dich,ndich) ndich dich 0 1 FALSE 177 0 TRUE 0 9  • Do this for FxCmtyWages di287=as.numeric(is.na(sccs$v287))
table(dich,di287)

• Use imputed data compared to real data for dep_vars in causal graph?
• Then create dummy variables for missing v. coded 1/0 subsamples for each dep_var in influence graph
• Laura might be able to do phylogentic modeling on HiGod4
• What of cases like Superjh that are not PRIMARY but come out only when other variables are present in th model

## Using the 2sls-is program

### Maps and Plots

Giorgio Gosti drawVar.R modified, SFI Working group in Causal Analysis
Giorgio Gosti drawPath.R by Giorgio Gosti, SFI Working group in Causal Analysis
Giorgio Gosti drawPath.R by Giorgio Gosti, SFI Working group in Causal Analysis
Giorgio Gosti drawPath.R by Giorgio Gosti, SFI Working group in Causal Analysis

#### Brown-Eff Maps source("http://intersci.ss.uci.edu/~drwhite/0/world_countries_24.rda")

Error creating thumbnail: Invalid thumbnail parameters
Moral gods v238

### No run HiGods without Snarey codes

HiGod4 - > sccs$v238 no_rain_dry=sccs$Rain+(sccs$v855>4)*1, wet=1+((sccs$v929>8)*1) +((sccs$v929>3)*1) ##+((sccs$v929>7)*1) #DOES NOT CORRELATE with SNAREY$Rain sccs$v1913 # NoRain=3 - (sccs$v1913>24)*1) + (sccs$v1913>101.6)*1)
[1]  11.155  35.425  86.950  69.790 128.050 165.890 123.740 106.625  85.725 100.525 107.050 146.270 193.155
[14] 180.015 263.695 175.779 240.685 104.220 142.300 370.240 100.385  93.480 115.110  22.820  55.072 132.085
[27]  35.515 140.315  31.295  84.780  73.260 123.120 145.335 120.220  74.420  20.699 151.620  37.755   0.175
[40]   4.055   4.515  56.790   0.140  58.845  12.220  12.702  35.661 127.505  80.654  34.700  99.617  52.625
[53]  39.305  52.881 172.292  59.708  36.794  33.668  69.512 159.650 159.079 153.555 104.037 167.888  57.365
[66]   8.635 118.979  40.900 240.914 253.535  84.120 183.220 185.858 161.188 131.469 152.819 263.390 296.760
[79] 296.760 149.565 271.965 185.433 178.310 178.059 330.675 206.810 265.060 214.060  92.995 129.056  25.870
[92] 448.100 447.380 244.500 246.420 330.500 237.571 212.786 366.063 398.167 310.360 271.260 199.565 139.445
[105] 113.015 273.550 303.625 578.967 363.800 303.976 373.205 328.980 164.560 111.360  49.991 107.475 177.705
[118] 103.579  59.456  29.054  34.602  51.790  83.560  25.465  83.794 105.772  49.742  31.314  40.090 259.290
[131] 131.769 121.357 321.934 103.848  90.974  14.437  29.210  34.558  51.640  33.624  43.635  57.105  83.880
[144]  78.140 113.646 131.885  54.425  39.725  36.797  60.771  21.752  83.996  68.951 266.653  82.590 319.786
[157] 372.220 245.580  46.472 131.722 198.440 120.614 246.085 224.531 303.106 271.825 344.485 202.070  48.700
[170] 148.030  62.220  58.915 142.695 121.593 182.300 107.461 118.015 159.780 163.379 192.237 164.705 121.579
[183] 129.545  56.626  12.966  55.375


over 40" = over 101.6 cm 10" = 25.4 cm

NoRain=3 - ((sccs$v1913>24)*1) + ((sccs$v1913>101.6)*1)
NoRain= 1+((sccs$v1913>24)*1) + ((sccs$v1913>101.6)*1)
wet=1+((sccs$v929>8)*1) +((sccs$v929>3)*1) ##+((sccs$v929>7)*1) #DOES NOT CORRELATE with SNAREY$Rain


### Source files

source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageSuperjh.R") source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2Missions.R") source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R")

# THE DATA in comb.Rdata are not yet released for publication
# IF YOU DONT HAVE comb.Rdata edit out $Missions and$lo_water from your create file and use a new Which(sccs$v....) setwd('/Users/drwhite/Documents/3sls/sccs/') #change to YOUR directory # includes library(foreign) load('comb.Rdata') #alt: just load('sccs.Rdata') #for comb.Rdata you have added new variables such as Missions sccs<-data.frame(sapply(comb,function(x) as.numeric(x))) #data(sccs) dep_var<-sccs$HiGod4                      #sccs$v238 if you start with load('sccs.Rdata') z<-which(sccs$Missions!=2)             #option to modify for a subsample dependent on selection variable
is.na(dep_var[z])<-TRUE                     #option to set chosen selection variable values to missing
dep_var                                              #option to use this modified dep_var in
#sccs$v238=dep_var #optional but woulf change sccs until reloaded # (73 observations will be deleted due to missionization) #Access to cached software source("http://intersci.ss.uci.edu/wiki/R/create_EduR_1.5DistBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R") source("http://intersci.ss.uci.edu/wiki/R/two_stage_ols.R") #source("http://intersci.ss.uci.edu/wiki/R/run_model_79.R") #option for Train-Test inferential states  source("http://intersci.ss.uci.edu/wiki/R/run_model_100.R") #OR: full sample source("http://intersci.ss.uci.edu/wiki/R/averageAll.R") #OR# as configured on your own directories source("examples/create/create_EduR_1/create_EduR_1.5DistLangBrownEff100slim711HiGod4OLSWageRestrict2NoMissions.R") source("R_3_s_ols/two_stage_ols.R") source("examples/create/R_run/run_model_100.R") source("examples/create/R_run/averageAll.R") ##averages  average(ols_stats$imputed_data)               #independent variables average for imputed missing data
z=which(output$Missions==2)#1) input[z,] depvar= input[z,] #N=73  ## Doing OLS on the Imputed data • It appears that the imputed data is already being passed out of the function. All you need to say is ols_stats$imputed_data and

• To get a specific instance you can say
'imputed_data[which(imputed_data$.imp==i),]' where is the ith dataset.  • To extract a column you just say: ols_stats$imputed_data[,j] where j is the jth column or you can say
ols_stats$imputed_data[,"column_name"]  • But its not that name I want out=lm(sccs$v238 ~ ols_stats$imputed_data[,"Missions"]) Error in model.frame.default(formula = sccs$v238 ~ ols_stats$imputed_data[, : variable lengths differ (found for 'ols_stats$imputed_data[, "Missions"]')

• Make sure output is a data frame by saying
'class(output)'
output = as.data.frame(output)    #Once it's a data frame it should [almost] behave just like my_sccs.

lmout=lm(sccs$v238 ~ output$fratgrpstr +output$Missions +output$socplex+output$milk +output$AnimXbwealth +output$pigs) lmout summary(lmout)  • After summary do vif(lmout) vif(lmout)  ## Handling files with factor categories (no longer numeric) comb=cbind(sccsfac,outDat) comb$Missions
comb$v1 table(comb$Missions,comb$v1)  setwd('/Users/drwhite/Documents/3sls/sccs') load('outDat.Rdata') #Snarey factors table(outDat$Missions,outDat$HiGod4) sccsout<-data.frame(sapply(outDat,function(x) as.numeric(x))) #Convert factor to numeric table(sccsout$Missions,sccsout$HiGod4) 1 2 3 4 1 33 26 5 9 2 36 25 18 34 table(sccsnew$Mission,sccsnew$v238) #Missions more high gods 1 2 3 4 1 33 22 3 7 2 35 25 10 33 save(sccsnew, file = "sccsnew.RData")  http://stat.ethz.ch/R-manual/R-patched/library/base/html/factor.html (try ?base) http://www.ats.ucla.edu/stat/R/modules/factor_variables.htm  • The only required argument is a vector of values which can be either string or numeric. • Optional arguments include the levels argument, which determines the categories of the factor variable, and the default is the sorted list of all the distinct values of the data vector. • The labels argument is another optional argument which is a vector of values that will be the labels of the categories in the levels argument. • The exclude argument is also optional; it defines which levels will be classified as NA in any output using the factor variable. • Graphics are another area that benefits from the use of factor variables. As in the tables the factor variable will indicate a better ordering of the graphs as well as add useful labels. comb=cbind(sccsfac,outDat) comb$Mission
comb$v1 table(comb$Mission,comb$v1) New: You can convert every field to numeric this way: sccs<-data.frame(sapply(sccsfac,function(x) as.numeric(x))) But some fields will be nonsense (society name, etc.) result: sccsnew<-data.frame(sapply(comb,function(x) as.numeric(x))) > table(sccs$v1,sccs$v2) 1 2 3 4 5 1 0 0 0 0 0 2 0 0 0 0 0 3 3 0 1 0 0 4 36 5 15 22 2 5 10 5 11 12 0 7 0 0 1 0 1  Yay! With this we can use sccsfac.Rdata to replace sccs.Rdata I attached a file that created the factors like I was creating a new variable. A command for creating the Rain factor would look like such: FactorRain= factor(Snarey186$Rain,levels = c(1,2,3),labels = c(">40in/year", "Moderate", "<10in/year")) You can also factor out Rain by just dropping FactorRain= from the above command, however this does not create values. However, when I create these new values as mentioned in the first command above, they do not show up in the dataset when viewing it in RStudio.

### New data with factors

we have a new version of sccsfac.Rdata that has categories at http://eclectic.ss.uci.edu/~drwhite/sccs. Cross-tabs can be done with category labels.
example
> data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
> fdata = factor(data)
> fdata
[1] 1 2 2 3 1 2 3 3 1 2 3 3 1
Levels: 1 2 3
> rdata = factor(data,labels=c("I","II","III"))
> rdata
[1] I   II  II  III I   II  III III I   II  III III I
Levels: I II III

Here is the code that I used, just so that you have it: Victoria


Moralizing_gods#Codebook_for_Moralizing_gods MUST ADD #1. Raininch #7. Snarey Year

# To make factors for variables: Rain, Terrain, Water, Missions, HiGod4

Rain.f <- factor(Rain, labels = c(">40in/year", "Moderate", "<10in/year"))

Terrain.f <- factor(Terrain, labels = c("Surface Water", "Moderate","Surface H20 Scarce"))

Water.f <- factor(Water, labels = c("Neither Rain nor Surface H20 high","Either Rain nor Surface H20 high"))

Missions.f <- factor(Missions, labels = c("Premissionized", "Missionized"))

HiGod4.f <- factor(HiGod4, labels = c("Absent", "Present, inactive,unconcerned","Present, active, unconcerned with humans","Present, active,supportive of morality"))

#TO OUTPUT THE FILE
outDat<-cbind(Rain.f,Terrain.f,Water.f,Missions.f,data.frame(HiGod4.f))
write.csv(outDat,"SnareyFactors.csv")
save(outDat,file="outDat.Rdata")

-----------------------------

Create factors for AddSnarey,csv conversion to AddSnarey.Rdata (no factors for SCCS	Raininch or YearSnarey
Rain	Terrain	Water	Missions	HiGod4
Rain
1= >40"/year
2= Moderate
3= <10"/year
Terrain
1= Surface water
2= Moderate
3= Surface H2O scarce
Water budget
1= Neither rain nor surface H2O high
2= Either rain or surface H2O high
Missions
1=Premissionized
2=Missionized
HiGod4
1= Absent
2= Present, inactive, unconcerned
3= Present, active, unconcerned with humans
4= Present, active, supportive of morality


Do you think that you can take a look at this and see if what I did was what you wanted?

####sccs<-sapply(sccsfac,function(x) as.numeric(x))
#### Error in sccs$v1 :$ operator is invalid for atomic vectors


### Splice new variables, e.g., Premissionized

setwd('/Users/drwhite/Documents/3sls')
aux_data = outData
sccsmsn<-data.frame(cbind(aux_data,sccs))
#save(newsccs2, file ="sccsmsn.Rdata")
save(sccsmsn, file ="sccsmsn.Rdata")  #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott)
sccsmsn$Missions  NOT WORKING 3. I am trying to splice the AddSnarey.csv to the end of your sccs.Rdata It has those V1-V9 variables and those are the variable labels underneath and the result is weird. I had to take off the variable labels to make a 186-186 congruence between the two data files. But when done I didnt know the way to insert my variable names. So my problem is HOW DO I GET THE VARIABLE LABELS INTO THE NEW MERGED FILE, either the first row or the second....? do ?read.table  V1____V2______V3______V4____V5______V6_______V7___V8 SCCS Raininch Rain Terrain Water Missions HiGod4 Year  setwd('/Users/drwhite/Documents/3sls') ###aux_data = read.table("AddSnarey.csv", header = TRUE, sep = ",", col.names) #Snarey186.csv aux_data = read.csv("AddSnarey.csv", header = TRUE, sep = ",") #Snarey186.csv sccs<-load("SCCS.Rdata",.GlobalEnv) #newsccs2<-data.frame(cbind(aux_data,sccs)) sccsmsn<-data.frame(cbind(aux_data,sccs)) #save(newsccs2, file ="sccsmsn.Rdata") save(sccsmsn, file ="sccsmsn.Rdata") #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott) sccsmsn$Missions

Splice in WldRelig=cbind(sccs$v2001,sccs$v2002) -- Its already in sccs!!!!
Forget Var1918-2000.cvs Grief and Mourning
World Religion v2001 v2002 is in sccs


 z<-which(my_sccs$Missions==1) #nonMissionized my_sccs[z,] zz<-my_sccs[z,] my_sccs<-zz  Select rows in my_sccs as above -Can be done more easily by having a data name and value like Missions==1 in the create file. If this is not null, then instead of taking a 79% or whatever size subsample in TRAIN you use this variable to select the subsample, measure the %cases and use the remaining sample to report on in TEST. This gives forr ways to select samples: total % select by variable (gives imputation) m_sccs (no imputation)  ## data(sccs) data(sccs) only lives in the source files that was under the 'examples" sub-directory. There is no reference to sccs in the actual two stage ols code. So the refactoring I did earlier this year was already supposed to do this. So the examples code can be thought of as 'client code' and the two stage ols as 'library code.' How clients wish to refer to their data is up to them. My understanding is that you started making copies of the create model value children file which used data(sccs) and then you found it hard to modify all of them not to depend on data(sccs). But this does not mean other people need to do it that way. Does that help? ## To extract a column you just say ols_stats$imputed_data[,j] #where j is the jth column or you can say
ols_stats$imputed_data[,"column_name"] > ols_stats$imputed_data[,"anim"]
[1369]  4  4  2  2  3  3  3  2  4  2  1  2  1  1  3  2  3  4  3  1  3  1 2  1  1  2  1  3  2  2  2  2  3  1  2  1  2  1
[1407]  2  2  2  1  1  2  2  2  2  2  3  3  2  1  1  1  6  1  1  1  1  1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[1445]  1  1  1  1  1  1  1  1  1  2  2  1  2  2  2  1  8  3  1  1  1  1 1  1  1  2  1  1  3  4  1  1  1  1  1  1  1  1
[1483]  1  1  2  3  1  1


1488/8=186 cases with imp=8, but N=168 for the nonmissing cases on dep_var which are those to impute if data were missing.

 source("examples/create/create_EduR_1/create_EduR_1.5DistB_Eff.DryMoneyLE4LowPopden&ANIMxBWEALh2oMSNS_.R")
source("R_3_s_ols/two_stage_ols.R")
source("examples/create/R_run/run_model_100.R")


## Doug TODO

• Scott: Install package library(sccs) to NewMac
• Merge Snarey.cvs with sccs.Rdata, requires variable labels and add to a newsccs.Rdata.
• Run on Missions only, using z<-which(sccs$vMissions==1) # No missions THESE LINES DO FIND my_sccs ###z<-which(sccs$Missions==1)  #No
z<-which(my_sccs$Missions==1) #nonMissionized my_sccs[z,] zz<-my_sccs[z,] my_sccs<-zz  • Run without autocorrelation (for paper) - BUT WHAT ABOUT SIGNIFICANCE? Effective Sample size? Bonferroni corrected significance tests $P_i \le \alpha / (1 + k - i)$? Use effective sample size for pairs of variables using covariance -- Project for Victoria Pevarnic? --needs covariance in the equation http://www.r-tutor.com/elementary-statistics/numerical-measures/covariance 0) rename file to have extension 'tgz' (for security reasons Google won't let me attach a file with extension ".tgz") 1) Decompress sccs.tgz 2) cd Statistical-Inference-With-SCCS  Documents/sccs-1.tgz 3 ) R 4) load('examples/data/sccs.Rdata') 5) load('examples/data/vaux.Rdata') 6) source('examples/src/create_model_value_children.R') 7) source('sccs/R/two_stage_ols.R') 8) source('examples/src/run_model.R')  ## Scott TO CONSIDER the new program could be generalized to include subsample selection just after my_sccs if you set, just after my_sccs (and thus whatever row selection follows), not only  data=my_sccs, but aux_data=my_aux, prox_list=list(distance=Wdd), #distancelang=distlang), #,language=Wll,distance=Wdd,language=Wll  To do this there would be a default zz = all rows, with an alternative set by rows numbers selected just after my_sccs z<-which(my_sccs$Missions==1)
my_sccs[z,]
zz<-my_sccs[z,]
my_sccs<-zz


The zz rows would be used to similarly reset the rows in my_aux and the rows and columns in the prox_list w matrices

A 2nd ALTERNATIVE is that I could write a little program that could take whatever *.Rdata is used, along with the W matrices and Vaux.Rdata and write more generic matrices for these sets of inputs

z<-which(my_sccs$Missions==1) my_sccs[z,] zz<-my_sccs[z,] my_sccs<-zz  That would require keeping different subdirectories of the three reset datasets in different subdirectories. Seems to me the first alternative is simpler and more generic. ## Scott Possible TODO 1. Scott: After 4) load('examples/data/sccs.Rdata')  can sccs be used to make function data(sccs)??? If so then we will have a version for installation without R CMD... in PC and a package installation version. Victoria and I, however, need installations, she on her PC and me on my NewMac. Hi Dr. White, To fix the problem, you have to do the following: • 1. Open the following document in a text reader: two_stage_ols which is found in Statistical-Inference-With-SCCS->sccs->R • 2. Copy the text into R into a new script frame • 3. Save the new script with a .R extension named slightly differently • 4. Edit the following line in the script:  #--the imputand is placed as a field in imputed_data and named-- imputed_data <-cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)])) Change it to the following: #--need to add +2 after NCOL(combine_input) and stringsAsFactors=FALSE)) imputed_data <-cbind(imputed_data,data.frame(imputation[,NCOL(combined_input)+2],stringsAsFactors=FALSE))  • 5. Save the new script and close R • 6. Re run the programs with a change in the last line of the code as follows: This is the original code: load('examples/data/sccs.Rdata') load('examples/data/vaux.Rdata') source('examples/src/create_model_value_children.R') source('sccs/R/two_stage_ols_vp.R') #Change this line to include the name of the new script file that you created above source('examples/src/run_model.R')  The program should work now. I was able to write this up based on some corrections that Giorgio had previously made but never written up. 2&3 have high priority for the article I am writing ### Select subsample 2. z<-which(sccs$V7==2) is a statement that, if z is the data selected because the depvar is not-missing, can further restrict the data to be used to those cases with certain values (V7=2) on a control variable. We are able to make this selection in create.model and pass this into two_stage_ols then run_model.

NOW run_model has 73 rows for incoming my_sccs but 186 for a version called otherwise.

### Splice new variables, e.g., Premissionized

3. I am trying to splice the AddSnarey.csv to the end of your sccs.Rdata It has those V1-V9 variables and those are the variable labels underneath and the result is weird. I had to take off the variable labels to make a 186-186 congruence between the two data files. But when done I didnt know the way to insert my variable names. So my problem is HOW DO I GET THE VARIABLE LABELS INTO THE NEW MERGED FILE, either the first row or the second....?

  V1____V2______V3______V4____V5______V6_______V7___V8
SCCS Raininch Rain Terrain Water Missions HiGod4 Year

setwd('/Users/drwhite/Documents/3sls')
#newsccs2<-data.frame(cbind(aux_data,sccs))
sccsmsn<-data.frame(cbind(aux_data,sccs))
#save(newsccs2, file ="sccsmsn.Rdata")
save(sccsmsn, file ="sccsmsn.Rdata")  #Ideally THESE NAMES SHOULD BE THE SAME (msg from Scott)
sccsmsn$Missions  Splice in WldRelig=cbind(sccs$v2001,sccs$v2002) -- Its already in sccs!!!! Forget Var1918-2000.cvs Grief and Mourning World Religion v2001 v2002 is in sccs  table(sccsmsn$Missions,sccsmsn$v2001)  ?read.table  newsccs2  4. This item has very high priority for SFI and the article and it would be good to implement. I thought of a solution for the 100% option. Just allow a choice of 100%. But when the program comes to the test sample you just reset the 0 to 30%. The program will run as usual. No major reprogramming needed. What I really need is to that that option soon. Victoria can take care of the rest I think. It's a suggestion that might entail minimum program change. If you see a problem lets discuss. 5. Output the full inputed dataset. ## Splitting the sample various ways, i.e., USING CONTROL VARIABLES Great!! And what if I want to split based on the value of a dichotomized variable z<-which(((SCCS$v238==4)*1)==TRUE)
for example?  (Doug)


Even easier: z<-which(SCCS$v238==4) If you want to run just on the odd-numbered (as so many coders did), then you can do the following: #z<-which(odd(SCCS$"sccs#")==FALSE)  #gives even numbered
z<-which(odd(SCCS$"sccs#")==TRUE) #gives odd numbered oddSCCS<-SCCS[z,]  Now oddSCCS will be a data frame with only the odd-numbered societies. If you want to randomly split it: z<-which(rank(rnorm(186))<=93) randSCCS<-SCCS[z,]  And randSCCS will contain 93 societies. ### Investigate adjust ANSWER: "Controlling for some set of variables" would just mean including those as independent variables in the regression (so the effect of the variable of interest is not obscured by omitted variable bias). This is always done in a good regression model. (DRW: See H. White on core variables vs. controls) "Fixing certain variables at their mean values", in their paper, simply means that when they calculated the values for their charts, they used the mean values of all the variables EXCEPT the variable they wanted to show the effect of--for that variable they used the actual values. This is an issue of making charts, not an estimation issue. Another option for "adjust" is to have an auxiliary program that takes a set of independent variables and constructs a subsNsccs.Rdata, either selecting a sample with certain values, or setting certain variables to their means or other values. -- NOT NEEDED ---------------------------- I put this question to Anthon. But "adjust" is a standard option in Stata and other regression programs http://www.stata.com/support/faqs/stat/adjust.html If you have a look at that page what I want to discuss with you briefly is how to implement controls and adjustments in our program. Having the type of selection where z<-which(SCCS$v238==4)
is a start and will do for now but we should think ahead
---------------------------- Original Message ----------------------------
Subject: [Fwd: Re: splitting the SCCS]
From:    drwhite@uci.edu
Date:    Sun, July 24, 2011 6:59 am
To:      "E. Anthon Eff" <eaeff@mtsu.edu>
--------------------------------------------------------------------------



Thanks : In addition to making it possible to model for a selected control category (like modeling polygyny for the large subsample of Missionized societies) I am trying to get me head around this:

what is meant when a regression is reported (as in Henrich, et al. 2010. Markets, Religion, Community Size, and the Evolution of Fairness and Punishment Science 19 March 2010: 1480-1484.) as fixing certain variables at their mean values? Or controlling for some set of variables? How would that be done in Eff and Dow, generally speaking?

Original Message ----------------------------

Subject: Re: splitting the SCCS From: "E. Anthon Eff" <eaeff@mtsu.edu> Date: Sat, July 23, 2011 1:37 pm To: drwhite@uci.edu

Even easier:

z<-which(SCCS$v238==4) ## Moralizing gods ## What we need now Save variables after imputation e.g. for PCA 100% sample' Align print labels Hausman test Output results from stage 1  Instructions include ols_stats$
DW should code Paige and Paige
Load(sccs.Rdata)
rather than *gz installation


For PCA can this be passed to two_stage_ols - so it does not have to be changed? E.g., put the variables in create and the code in 2sls?

Hi Doug. We're back. Still haven't found the time to look at the problem Scott mentioned...


The principal components are all done within the loop over the imputed data sets. It's in the script we submitted with the paper. Looks like this:

#--composite variable for size--
o<-princomp(m1[z,c("superjh","commsize")],cor=TRUE)
o1<-data.frame(psychometric::alpha(m1[z,names(o$center)])) names(o1)<-"sizAlpha" oj<-cor(m1[z,names(o$center)])
o1<-cbind(o1,oj[1,2])
names(o1)[NCOL(o1)]<-paste(rownames(oj)[1],colnames(oj)[2],sep="|")
siz<-(o$sdev/sum(o$sdev))[1]
o4<-cbind(o4,siz,o1)
m1[z,"PCsize"]<-sign(o$loadings[1,1])*o$scores[,1]


Not in the main program:

Latest attempt at change in the code (see: John Fox) If you added the lines I showed you then lm.restricted is a part of ols_stats, so try ols_stats$lm.restricted. You can also try names(ols_stats) to see all the sub-components of ols_stats. Hal: The challenge for me as an outsider to anthropolgy and as a semi-outsider to causal analysis is that I can only comment on the science and not on the crucial sociological/anthropological issues of what folks embedded in this area will find appealing or exciting. The latter issues are unfortunately the most determinative factors in the funding decision, so I fear I can't be of much help in that crucial dimension. As far as the science goes, I can offer the following. 2sls-iv is only one of a considerable number of approaches to recovering causal/structural information. I attach a copy of a paper I think I sent you earlier that encompasses quite a few of these. ("An Extended Class ...") Proposing to extend the software to these contexts might have some appeal. Another extension is to consider nonparametric structural systems. I attach two papers that examine the nonparametric versions of the standard instrument (2sls) and conditioning instrument cases. ("Local Indirect Least Squares ..." - LILS_SUPP.pdf and "Estimating Nonseparable Models ...".) Your use of regression diagnostics is appealing. The attached paper "Robustness Checks and Robustness Tests ..." could be extended to the 2sls framework. This might also be appealing to reviewers. Finally, although I have a great deal of respect for Judea and his contributions, I find the do operator awkward and unnecessary. Rather than focusing on which variables are fixed and which are subject to intervention, as the do operator requires, if one instead focuses on which variables are jointly free to vary as functions of all other system variables, as the partitioning device of settable systems allows, one gains access to causal discourse in a variety of contexts where it is not otherwise possible. This also make distinctions between partial equilibrium effects and full equilibrium effects natural, which is especially important in economics. I attach one of my latest items ( "Causal Discourse ...") that pursues this approach and applies it to an auction game. I apologize for sending along such a wad of papers, but hopefully some of the ideas here may be useful in your nsf proposal or elsewhere. There is a concert with the Big Band and Jazz Hall of Fame Orchestra coming up at the end of August where I am slated to play. I will let you know the details soon. Also, I sometimes sub in the big band at Clay's (the old Elario's); if so I will let you know. We still need to get together for lunch, and now that I am finished traveling for a bit, that is feasible. Are you in town? Or are you in Santa Fe? I'd love to get together! Warm regards, Hal (see Halbert_White#On Causal modeling and Xun Lu for Matlab code on testing robustness that could be translated into R) ## Curating the code for successful models Tolga and Giorgio: if not already, you should download textpad to edit results pages on the wiki by insertion of a first column of spaces http://intersci.ss.uci.edu/wiki/0/create_EduR_1.5DistBrownEff.R (EduR_1.2) EduR_1.2 Comments: Replicates http://intersci.ss.uci.edu/wiki/pdf/eScholarshipBrown&Eff.pdf http://intersci.ss.uci.edu/wiki/0/create_model_value_childrenLangOnly.R (EduR_2) Comments: Does not replicate http://intersci.ss.uci.edu/wiki/pdf/eScholarshipEff&Dow2009.pdf http://intersci.ss.uci.edu/wiki/0/create_EduR_1.5DistLangBrownEff.R (EduR_1.3) same as below: http://intersci.ss.uci.edu/wiki/0/create_EduR_1.5DistLangBrownEffNewMoralGods.R (EduR_1.1) Comments: continued below Comments: incomplete: one case to be deleted Society 186, Yahgan Comments: Compared to Brown and Eff (2010), substitutes ecorich (which groups sccs$v857 categories to “harsher” codes 1&2&6, “temperate or rainforest” codes 3&4, and “savannah” code 5) for PCAP and foodscarc, substitutes animXbridewealth for anim, and has 24 of 28 significance tests at p<.10. That model, with an average R2 =.494, 6.5% higher than the seven-variable model, posits a poor environment, small communities, a pastoral economy with exchange of brides for animals, and caste stratification (the weakest variable), such as occupational castes in pastoral economies, as predictors of high gods that also tend to be more concerned with human morality.
Comments: needs to be run and modified
Comments: Good model, replicates at 79%
• [
• [

## April 2011 instructions from Github (temporarily inactive)

### Basics

This short guide will walk you through the steps of installing the package, building a model, and fitting it.

• Install sccs_1.0.tar.gz to your working directory
• On linux type, ./install.sh; on windows type, ./install.bat
• That requires the "R CMD INSTALL sccs_1.0.tar.gz" command
• which requires that R commander be installed
• In R, set the working directory by typing: setwd("<working dir>") as in the example below. You create your model in the file that you will name in the first source line below. Once it is finished you copy and paste the nine lines below into R, either from R in the unix terminal or the R Gui.
 NEW
setwd('/Users/drwhite/Documents/3sls/sccs/') #Macbook
setwd('C:/My Documents/2sls-is/') #PC
library(sccs)
data(sccs)
source("examples/create/create_EduR_1/create_EduR_1.5DistBrownEff.R")
source("R_3_s_ols/two_stage_ols.R")
#source("examples/create/R_run/run_model_79.R")
source("examples/create/R_run/run_model.R")

 OLD
setwd('/Users/drwhite/Documents/sccs-latest')
library(sccs)
data(sccs)
source("examples/src/Create-EduR-S.R")
source("sccs/R/two_stage_ols.R")
source("examples/src/run_model.R")
ols_stats$restrict_stats ols_stats$r2
ols_stats$restrict_diagnostics  • Create-EduR-S.R - start modifying a sample program (change "S" to an unused number in this wiki) • EduR-0 outlines how the file system is organized at installation and WHERE TO PUT EduR-n.R programs where n is a number so as to keep track of models in this installation. ### Creating a system • DRW: In Create-EduR-S.R I am trying to give the new user a starting point for the simplest possible working model. IT DOESNT WORK AS YET. NOT CLEAR HOW TO DEBUG IT. E.g., debug(run_model.R). • EduR-0 outlines how the file system is organized at installation and WHERE TO PUT EduR-n.R programs where n is a number so as to keep track of models in this installation. • BELOW there are some working models.  setwd('/Users/drwhite/Documents/sccs/sccs-2011/models/src') library(sccs) data(sccs) #You do not debug files but functions defined within files, type ?debug at the command line. setwd('/Users/drwhite/Documents/sccs/sccs-2011') library(sccs) data(sccs) source("models/src/Edu-3.R") debug(two_stage_ols_mi) source("sccs/R/two_stage_ols.R") source("examples/src/run_model.R")   warnings Spatial weights matrix not row standardized 6: In lm.LMtests(lm.restricted, W_mat, test = c("LMlag")) :   setwd('/Users/drwhite/Documents/sccs/sccs-2011) #/models/src') library(sccs) data(sccs) #You do not debug files but functions defined within files, type ?debug at the command line. setwd('/Users/drwhite/Documents/sccs/sccs-2011') library(sccs) data(sccs) source("models/src/create-edumac-1.R") debug(two_stage_ols_mi) source("sccs/R/two_stage_ols.R") isdebugged(two_stage_ols_mi) source("examples/src/run_model.R")   EduR-0 - the new series setwd('/Users/drwhite/Documents/sccs-latest') library(sccs) data(sccs) source("examples/src/create_model_value_children.R") source("sccs/R/two_stage_ols.R") source("examples/src/run_model.R") ols_stats$restrict_stats
ols_stats$r2 ols_stats$restrict_diagnostics


 EduR-0 - the new series
setwd('/Users/drwhite/Documents/sccs-latest')
library(sccs)
data(sccs)
source("examples/src/create_model_value_children.R")
source("sccs/R/two_stage_ols_split_halves.R")
source("examples/src/run_model.R")
ols_stats$restrict_stats ols_stats$r2
ols_stats$restrict_diagnostics   EduR-1 - the new series 1 setwd('/Users/drwhite/Documents/sccs/sccs-2011') library(sccs) data(sccs) source("models/src/create-edumac-1.R") source("sccs/R/two_stage_ols.R") source("examples/src/run_model.R") ols_stats$restrict_stats
ols_stats$r2 ols_stats$restrict_diagnostics

DEBUGGED THE NEW  two_stage_ols.R WITH
o <- linear.hypothesis(lm.unrestricted,drop_vars,test="Chisq")$"Pr(>Chisq)"[2] was linear.Hypothesis or linearhypothesis   setwd('/Users/drwhite/Documents/sccs/sccs-2011) #/models/src') library(sccs) data(sccs) debug(create-edumac-1.R) #You do not debug files but functions defined within files, type ?debug at the command line. setwd('/Users/drwhite/Documents/sccs/sccs-2011') library(sccs) data(sccs) source("models/src/create-edumac-1.R") debug(two_stage_ols_mi) source("sccs/R/two_stage_ols.R") source("examples/src/run_model.R")   ols_stats$restrict_stats
ols_stats$r2 ols_stats$restrict_diagnostics


 setwd('/Users/drwhite/Documents/sccs/sccs-2011')
library(sccs)
data(sccs)
source("models/src/create-edumac-1dichAliased.R")
source("sccs/R/two_stage_ols.R")
source("examples/src/run_model.R")  ##the error will occur here, put in debug()

DEBUGGED THE NEW  two_stage_ols.R WITH

### Error messages

1. We need an error message when the program does not run. Would also help if the program STOPPED when there is an error. And an error message is needed when depvar is not defined, a restrict_var was not in indep_var or in my_sccs or an indep_var is not in my_sccs. The lack of an error statement is confusing to students, e.g., when an restrict_ or indep_ var change gives an error the program just reprints the previous results and students think it is a new correct result.
2. We need something like the coef(xR) and coef (xUr) Error messages for Eff and Dow 2009 that will work for my_data: Error messages for Rccs 2010.
3. After the my_data program runs with some part of the variables in restrict_vars, if the WALD Test diagnostic is significant, have the program cycle through the remaining indep_vars and find the ones that are significant if added (singly) to the current model, the print those results (one or more models).
4. If the RESET diagnostic is significant restart run through and print each regresstion with all but one restrict_vars but the final variables squared, then square rooted. Stop one one is found. User will have a choice what to do next.

### Automation

1. Would it be hard to have an option that allows an exchange of the depvar with an indepvar, gives the new depvarname the name of the old indepvar? ---- One reason why we find little reciprocal causation is that the indepvars are not investigated as depvars.
2. With the new restrVar.exe program we need some way (like my FORTRAN program) to autocopy all the fx_ or my_sccs into indep_var then all but the last into restrict_var. ---- If the vNums and the vNames in Fx could be saved, it would be good to put the vNums at the right in the line for each causal variable (restrict_var results).
3. Can we 'automate so it can collect new bunches of variables that work together. Or add one variable at a time.
4. Ditto, if the RESET Test diagnostic is significant, have the program cycle through the significant indep_vars and find the ones that predict better with log(1+old var) or oldvar2 (squared).
• When there are more then 30 indep_vars, and when many high VIFs, automatically fill the restrict_vars with several overlapping contrast sets, run each separately, and print as many results as there are sets. This will reduce the VIFs and by having fewer "congested variables" increase the likelihood that a significant variable WILL be found to be significant in one or more contrast sets.

### The package

We need to settle on a version 1.0 package among these choices, make it a standard installation from the [R] site (find out how to do this), and thus make the installation easy and not so complex, so students or researchers can easily install at home. And we need to give Mike Migalski an installable version by mid-summer for the fall classes. He does one big installation, not dribs and drabs, so is important to get him our installable code by the end of August.

### Creation of a program to convert my_sccs to indep_var and restrict_var lists

• DRW did one in FORTRAN. Uses SCCS 2SLS Variable Names - SCCS R package Install
• Program is in c:\Core and called restrVar.for /.exe. You give it the filename restr1 to read and it returns the results to restr2. Created Oct 4 2010.

## Debug

The problem I mentioned happens here:


#### Run multiple imputation

• Within R, run the following command to run multiple imputation:
source("examples/src/run_impute.R")

• Verify the dataset "sccs_imputed" was created and saved (in working directory, e.g., "source")

#### Run two stage ols

• Within R, run the following command to run two stage ols:
source("examples/src/run_two_stage_ols.R")

• Check output of OLSresults.csv #in "source" directory: highlight the numeric fields in excel, right click highlighted area, click "format cells" / "numeric" / 3 decimal places (or higher for pval)

## Hosting

• GitHub - Secure source code hosting and collaborative development - Online project hosting using Git. Includes source-code browser, in-line editing, wikis, and ticketing. Free for public open-source code.

## Agenda: Scott

Fix multiple matrices
Document wiki
EasyChange dep_var
Normalization of values option


## Normalized re-expression of ordinal variables

### Basic ideas

1. Normal re-expression (probit analysis) simply reassigns the o(i) values of ordinal variables‎ to ones that are normally distributed in the range 1 ≤ x=integer[1 + k(probit(p(o(i)))] ≤ k, where p(o(i)) is the cumulative percentage of cases in category i, k is a multiplier, and x is an integer truncation of the re-expressed value.
2. This option can transform each ordinal variable in sccs.Rdata to a common scale 1 to k to create normalized variables, save an sccsNorm.Rdata file, and use this file for the linear regression analysis.
3. A second option can raise each such variable to a power t to minimize least-square fit to the dependent variable. This option is specified in the program and does not require a new dataset.

### Formuli

Probit function: 0 ≤ x ≤ 1 click to enlarge
click to enlarge the table

Let each of i=1,...,k categories of an ordinal variable have frq(i), sample size n, and let $p(i)=\frac{frq(i)}{n}$. In the standard normal distribution, $\mu=0$ and $\sigma^2=1$. The cumulative distribution function and its inverse, the probit function, is

  $cdf = \frac12\Big[1 + \operatorname{erf}\Big( \frac{x-\mu}{\sqrt{2\sigma^2}}\Big)\Big]$

  ${probit}(p) = cdf^{-1} = \sqrt{2}\,\operatorname{erf}^{-1}(2p-1)$


To compute in [R] see pnorm and erf. I propose THAT

1. OUR TRANSFORMATION HAVE A LOWER DEFAULT Zmin=1 and an UPPER Zmax=7 that can be reset as parameters.
2. we replace SCCS.Rdata with SCCSN.Rdata. If this were done for integers but now between Zmin-Zmax, normally distributed, I cant see why it wouldnt (1) work and (2) give better Rsq (3) provide for more useful diagnostics esp. regarding normal distribution of residual errors if the regression model is correctly specified. Doug 12:18, 8 June 2010 (PDT)

### Pseudo-code

Compute 0 ≤ Z(i) = probit(i) ≤ 1.

Note that the Z(i) are breakpoints between intervals. What we want are the centroids of these intervalues

$Zinterv(1)=Zmin+\frac{Z(1)-1}{2}$. Compute Z-intervals for i=2 to n-1, $Zinterv(i)=Z(i+1)-Z(i)$.

Let the Z-centroids be $Zcent(1)=Zmin+\frac{Z(1)-1}{2}$.

for i=2,...,n-1 $Zcent(i)=(Zmin+\frac{Z(1)-1}{2})+(\sum_{j=2,...i}Zinterv(j))$.

and $Zcent(n) = (Z(n-1)+\frac{Z(n)-Z(n-1)}{2})$.

last term in excel=G3+(Zmax-G4)/2

Consider the normal curve as having unit volume, bounded left and right at +/-s std.devs rescaled in the range 1-7. Now for each Z-centroid i=1,...,k, Zcen(i) re-expresses a normalized value s.t. cen(i)= f(probit(i)). To compute first get the probit for all values for area of the curve under the unit normal distribution, from .00,.01,.02,...,.99,1.00 (j=0 to 100) for +/-3 std.devs, then iterate the probit values j=1,...,1.00 and for each i set

$Zcen(i)=f(probit(j)) \equiv f(probit(j))\le Zcen(i) < f(probit(j))$

Then for each i the value Z-centroid Zcen(i) has a value between Zmin=1 and Zmax=7.

Create a new .Rdata file. If decimals dont work with Eff and Dow, then truncate Zcen to integer.

### Commentary

This is done for ordinal categories, if k>2, and all others as well (e.g., percentages). TEST IF THIS TRANSFORMATION CAN BE DONE FOR THE ENTIRE DATABASE AND SAVED AS A NEW DATABASE. Values with low frequency are be moved closer to adjacent values in the normalized scale. The normal curve is considered as having unit volume but the Z scale in bounded. Consider the histogram rectangle for an ordinal variable as having unit volume. Consider the rectangle for the "middle" category j of i=1,...,k categories that contains the median of the distribution and prop(j) of the cases in category j. Adjust that rectangle to have a width and height that minimize the difference in shapes with its position superimposed over the normal curve of unit 1. Consider how to do the same for each j rectangle to the right and to the left. The left/right spacing of each rectangle is not yet fixed however, so consider how to do that as follows.

1. The spacing adjustments have to recognize that the right boundary on the -tail / Z score / + tail has to correspond to the negative z score so that the area proportion to the left of the unit area for the normal distribution equals the proportion of cases for that variable. Do each category j of i=1,...,k according will 100% of the unit area is covered. You can solve for the height of each rectangle and that show satisfy our criterion. You dont need calculus for this, just a function that converts z-scores to areas under the normal curve.
2. So all you need is the pdf or cdf for the normal curve, as shown at Wikipedia:Normal Curve
  pdf        = $\frac{1}{\sqrt{2\pi\sigma^2}}\; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }$


The general ideas for re-expression come from Tukey (e.g., 1977: 103), but his methods are all pencil and ruler, and dont require a program.

Tukey, John 1977 Exploratory Data Analysis. Reading, Mass.: Addison-Wesley. This takes each of the variables, after imputation, and including the dep_var, and normalizes them. The second page is from a study that uses his ideas:
Mallows, C. L., & Tukey, J. W. (1982). An overview of techniques of data analysis, emphasizing its exploratory aspects. In J. T. de Oliveira & B. Epstein (Eds.). Some recent advances in statistics (pp. 111-172). London: Academic Press.

### A second idea: transforming the variables prior to regression

Tukey's contribution was to rexpress (each of our independent) variables so that their plot is linear. This is done by minimizing least squares between val(i) and $tval(i)=val(i)^t$ by varying the exponent t. (The t for each independent variable would be given with the results.

### References

1. for dichotomies probit [R] glm(c^y, family=binomial(link="probit")) THIS WE CAN USE FOR DICHOTOMIES. IN GENERAL, HOWEVER, WE CAN DO OUR OWN PROBIT ANALYSIS BY RENORMALIZATION, HAVE OUR [R] PROGRAM RECOGNIZE BINARY VS ORDINAL AND TREAT ACCORDINGLY
2. ordered logit/probit [R] p21 MASS library polr(Sat ~ Infl + Type + Cont, method="probit"), where Sat is an ordered factor vector
3. for multinomials nnet [R]
4. for MCMC p21 MNP [R] library mnp() see 4.2.2
p242: "Following the elementary statistics textbooks for finding centiles (percentiles), express the ranks as cumulative proportions, and refer these to a unit normal curve ... to read off fz, or use the PR column of Table 6.4.2, where 5 has been added to zp to yeild probits."

## Agenda: Doug

NSF Proposal 2010:July see: EduMod-71 - EduMod-52:Imputation and Regression has shortened commands that will run the sample programs

1. The lang and dist matrices have reversible time, similarities from proximities.

Would it make sense to also have a directed time matrix, i.e., asymmetric. That would differ from the variable "dateobs" which can have a symmetric effect, like distance DONE: have need chunks of code that will let students run code on wiki

1. MAKE PARAMETERS TO change nimps --- OK NOW FOR parameters 10,10,3 (iterations, ..., pval)
2. CAN NOW change the dep_var
3. CAN NOW change the indep_vars
4. CAN NOW change the restricted model--xUR
5. CAN NOW change the restricted model--xR
6. CAN NOW have all variables in the model diagnostics--dropt --> wald
7. Comments on imputed vars relative to dep_var
8. CAUTIONS: Cannot run from restricted down
9. CAN RUN FROM dep_var down