SFI2011 project

From InterSciWiki
Jump to: navigation, search

Manual for Dataset Construction and Causal Analysis Doug 07:03, 20 March 2012 (PDT)

SFI2011 project Our causality working group submitted a full proposal March 1 for a Templeton basic research grant on “Cross-cultural adaptive dynamics for theories of evolution of moral gods and ethical principles.” The group will meet at SFI the last week of March to discuss subprojects on “Organizing databases on causes of evolution in the moral aspects of religion in chiefdoms, early states, secular cycles of empires and the Standard Cross-Cultural Sample,” “Effects of structural cohesion in forager networks on the evolution of cooperation,” and to review advances in 2SLS, sem and causal graph software. SFI2012 project


Wish list

  • To do: relaimpo,
  • Hal White Hausman test,
  • option for indep_vars not subject to robustness tests,
  • dot graphs that run sem models from EQS-type commands,
  • Kyono Commander in R for EQS output (Replaced by DagR by Sven, Daggity.
  • results of first-stage OLS (autocorrelation coeffs per WX variable),
  • ?Griffith test of effective sample size using scale() to normalize variables.
  • Option to restrict the X in WX to the restrict_vars leaving the default the full set of independent variables.
In my opinion (and others might disagree) it seems arbitrary to restrict the X in WX. So I like the idea of giving the user an option. But I think the default should be the full set of independent variables in the unrestricted models (again, just my opinion). - Anthon

look inside an object like WYWX using the str() command

Doug: You can look inside an object like WYWX using the str() command: Str(WYWX)

if WYWX is an lm object (i.e., WYWX<-lm(WY~WX) ) , then you can access your original variable WX.milk using:


Str(WYWX) WYWX$model$WX.milk


Forager networks from Kinsources

For further testing in year.2 of kinship network hypotheses, Boehm’s new codes for a set of 53 LPA societies will cover 11 forager cases in our Kinsources database that have genealogical networks for the cohesion analysis and GSR Oztan's thesis: eight Inuit (No. Alaska, NW Alaska, Baffinland, Copper, Iglulik, Labrador, Netsilik, Nunamiut); three others (Kung, Yolngu Murngin, Vedda); three more that could be coded from community-wide genealogies (Agta, Tiwi), and possibly others from the eHRAF files; thus cases for seven or more of the eleven forager regions. The evolutionary hypotheses we want to test are that: 1) those subnetworks with more structurally cohesive ancestral groupings have significantly more offspring on average than others, and 2) those networks with the more cohesive cores have more of Boehm’s indicators of prosociality. We should be able us to start on these questions earlier rather than later in the project by placing these 11 societies at the top of the queue for new coding by Boehm’s coder.

There are 29 forager societies with no animal or cultigen dependence in the SCCS, including five in Kinsources (Agta, Apache, Copper Eskimo, Kung, Netsilik), and seven in Boehm’s forager sample (Copper Eskimo, Eyak, Kung, Mbuti, Semang, Vedda, and Yurok). Type of supernatural sanctions for these societies and for the seven SCCS fully forager societies will also be coded early by Boehm’s research coder. She will be doing further coding of supernatural sanctioning but not of social sanctioning.

Bug Reports

here I attach a detailed bug description, and a R source to replicate it.

Giorgio Ubuntu bug reports W %*% X (following data imputation) produces an error in an R version for Ubuntu linux
But it was one compiled by Giorgio so the solution may not be general
new Ubuntu_run_model.R

Plan for the R scripts to link to SEM and include timeseries with autocorrelation

if you look at http://eclectic.ss.uci.edu/~drwhite/ you'll see that the last of the "book covers" etc has the title of our Leipzig article. I am almost done editing the revision, so that we can cite this article if we are approved by Templeton to submit a grant proposal. Click to see the pdf.

The advantage of a url version of the article is that links to the data and code in the source("....R") etc are live so the reader will be able to copy and paste code into R.

Once all that works (only pp 0-5 are copy edited) I am thinking of inviting John Fox to share our code and add scripts that will convert one of our models into an sem. Since he is author of sem for R and a book on R he might be interested, and then we can all do a book together; this would fit with Giorgio's seminar experience in SEM.

Henry Wright, one of the four faculty on the Templeton project, has been staying with us for 5 days. He'll be at our SFI working group meetings. He has a dataset published in Str&Dyn with coded data on clusters of chiefdoms and states, and is expanding his sample to run with our 2SLS-IS scripts. He is also adding some time-series data on how these chiefdoms and states have evolved over time, based on archaeological data.

So my idea also to add an R script for Turchin's methods of time-series analysis with W matrices for temporal autocorrelation.

Plot with regression line

reg1 <- lm(sccs$v1650~PopdenEff)
#these differ
reg1 <- lm(PopdenEff ~sccs$v1650)
plot(PopdenEff, sccs$v1650)


Correction for missing data in v1650

newvar[which(newvar==0)] = NA     
newvar[which(newvar==88)] = NA                          

How to normalize and average data in a data frame

Ideally I want to scale the values for each column in the range of (-1,1)

#A <- matrix(c(2,3,-2,1,2,2),3,2)
                                          -2 -1  0  1  2
 Absent                                   17 21 13  9  9
 Present, inactive, unconcerned           11  9 12 11  8
 Present, active, unconcerned with humans  4  4  7  4  4
 Present, active, supportive of morality   2  8  8 16  9
AgpotScale=matrix(c(scale(sccs$v921),scale(sccs$v928)),186,2)  #AgriPot 1&2
SizeJHscale=matrix(c(scale(sccs$v63),scale(sccs$v237)),186,2)  #CmtySize & Superjh

Replace PCA from Psychometric package

twoVARS=cbind(sccs$v63,sccs$v237)  #average CmtySize with Superjh

W alignment from Murdock and White 1969


White-Murdock alignment
Wlink=matrix(data = 0, nrow = 186, ncol = 186)
for(i in 1:185) {
for(i in 1:184) {
for(i in 1:183) {
for(i in 1:182) {


Kinship networks and forager fractality - Giorgio Gosti

  • To reexamine Roes 1995 create for the EA a hierarchical-boxes W matrix from the 200 province codes: 1 same province, 2 same of 400 clusters 3 adjacent in the SCCS lineup 4 second order in the SCCS lineup. These are based on similarities and close proximities. Might be better than distances.
  • Recode
see Combine variables

Key Adjustments for subsample differences

Problem: in the SCCS some sets of variables (those of Karen Paige and Jeffrey Paige, see EduMod80, and Peggy Sanday, are PRESTATE SOCIETIES. When included in an SCCS MODEL there need to be bias linkages to other variables that express differences in sample construction: which variables are distorted (see Sander Greenland).

  • take care using Paige variables - coded less than 1/2 the sample
  • The solution is to imput the depvar with MI along others, define a binary 0/1 variable for coded/missing on the Paige&Paige subsample
Paige657=sccs$v657,  # summed in v663#  Paige657 Paige658 Paige659 Paige660 Paige661 Paige662
femproduceND=sccs$v658, #Paige658=sccs$v658,  # summed in v663 
Paige659=sccs$v659,  # summed in v663
Paige660=sccs$v660,  # summed in v663
Paige661=sccs$v661,  # summed in v663
Paige662=sccs$v662,  # summed in v663

Adjusting for Confounders

  • Shpitser, Ilya, Tyler VanderWeete, James M. Robins. On the Validity of Covariate Adjustment for Estimating Causal Effects 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010), Catalina Island, California, on July 8-11, a UCLA event.
  • Ilya Shpitser, Tyler J Vanderweele. 2010. A Complete Graphical Criterion for the Adjustment Formula in Mediation Analysis. The international journal of biostatistics 7, 2, 1.


Manual for Dataset Construction and Causal Analysis

Manual for Dataset Construction and Causal Analysis

We have 2000+ variables in the 186 society Standard Cross Cultural Sample data that I initiated in 1969, very widely used (1000's of studies using these data). Studies of 100 variables completed (as dependent variables, over 600 independent variables enter into the causal graph). Also 45+ variables in an enormously detailed cross-cultural study of the worlds major (N=339) foraging societies, useful for reconstructing as well early human evolutionary patterns.


Abstract: Structural equation modeling (SEM) is the leading method of causal inference in the behavioral and social sciences and, although SEM was first designed with causality in mind, the causal component has been obscured and lost over time due to a lack of adequate formalization. The objective of this thesis is to introduce and integrate recent advancements in graphical models into SEM through user-friendly software modules, in hopes of providing SEM researchers valuable information, extracted from path diagrams, to guide analysis prior to obtaining data. This thesis presents a software package called Commentator that assists users of EQS, a leading SEM tool. The primary function of Commentator is to take a path diagram as input, perform analysis on the graphical input, and provide users with relevant causal and statistical information that can subsequently be used once data is gathered. The methods used in Commentator are based on the d-separation criterion, which enables Commentator to detect and list: (i) identifiable parameters, (ii) identifiable total effects, (iii) instrumental variables, (iv) minimal sets of covariates necessary for estimating causal effects, and (v) statistical tests to ensure the compatibility of the model with the data. These lists assist SEM practitioners in deciding what test to run, what variables to measure, what claims can be derived from the data, and how to modify models that fail the tests.


Dont use this, its expensive.

Distance matrix geosphere [R] EduMod84

replaced: see Manual


http://cran.r-project.org/web/packages/geosphere/geosphere.pdf - distm [R] uses install(geosphere)

Read.csv Write.csv

Instead see Rdata
setwd("C:/My Documents/Binford")
forag <- read.csv("Binford47vars339.csv", header=TRUE)
  • A final version of Binford_2001_data.xlsx (original Binford 2001 data.xlsx will have N/S and E/W lat and long
  • StatTransfer.0.rar
#install stat transfer by double click stdemo9.exe
#open stat/transfer
#click on the tab "about" on the up-right side.
#click on "install License", then there will be a pop-out window.
#open "stattransfer9lic.txt" copy all of them, by hit "Ctrl+a" then "Ctrl+c"
#Paste "Ctrl+v" on that pop-out window. then kit enter.

NOW, you are done. it will expire in year 2050... enjoy it. pywolf


New results

pastoralExch=((sccs$v208==1)*1)*(sccs$v858==6)*1 #bridewealth*pastoralism is the new variable substituted as an indep_var in the EduMod78, EduMod79, EduMod80 models. The effect is to define a "pastoralExch" variable that parallels the variable "money" that together define the two exchange systems associated with "moral gods" v238 and "evil eye" v1188. Actually pastoralExch does not knock out milking for Moral gods (less signif) or Money (nonsignif negative)

New stuff while at SFI

Lets use partial regression plots for confounders
Lets use pair plots to plot all pairs of dep/indep and indep/indep variables
15.17. How is the regression coefficient interpreted in multiple regression?
In this case the unstandardized multiple regression coefficient is interpreted as the predicted change in Y (i.e., the DV) given a one unit change in X (i.e., the IV) while controlling for the other independent variables included in the equation.
The regression coefficient in multiple regression is called the partial regression coefficient because the effects of the other independent variables have been statistically removed or taken out (“partialled out”) of the relationship.
If the standardized partial regression coefficient is being used, the coefficients can be compared for an indicator of the relative importance of the independent variables (i.e., the coefficient with the largest absolute value is the most important variable, the second is the second most important, and so on.)
Google: partial regression Pearl
  • for Pearl 2000: p 150 fn. 13 partial regression rYX.Z = alpha + IYX.Z



see Sprites et al 1998 - This is it: Scheines, Richard; Peter Spirtes; Thomas Richardson; Christopher Meek; Richard Scheines.Using path diagrams as a structural equation modeling tool
see Sprites et al 1998 - different but related - Scheines, Richard; Peter Spirtes; Clark Glymour; Christopher Meek; Thomas Richardson The TETRAD Project: Constraint Based Aids to Causal Model Specification --- pdf
see Pearl 1998a
  • Pearl 2009:41 on Sprites, Peter, Clark N. Glymour, and Richard Scheines. 1993. Causation, Prediction, and Search. New York: Springer-Verlag. Lecture Notes in Statistics 81.
  • Read: Repast kinsim - Padgett sim - Padget 2010 - use ergm to find cycles? - email UW Mark Handcock
  • Read: Clip from IntlAssessnebt - find addl maps
  • Examine: Kinsources - Foragers - Denham archives
  • Invite Chris Boehm to give uci video conf talk Oct 22?

Install sccs

If you decompress sccs_latest.tar.gz , then you can do the following to test that you get the same results as eff and dow:

R commander

Run this from command line (DOS, UNIX) in the parent directory above sccs2 and examples:

setwd("C:/My Documents/sccs2")
R CMD check sccs
R CMD build sccs
R CMD INSTALL sccs_1.0.tar.gz

Then in R, run:

setwd("c:/My Documents/sccs2")
#setwd("/Users/scottblanc/research/Statistical-Inference-In-SCCS") // change your path here
source("./examples/src/create_model_value_children.R") //this is the model Eff and Dow use in their paper
and then inspect the output of the file: children_ols_summary_results.csv

I added documentation for installing sccs here:


I also created a new project on github which is publicly downloadable here: http://github.com/drwhite/Statistical-Inference-SCCS

This supersedes the old project. Inside the instructions above it mentions where to download sccs from which should be from the github link.

At the bottom of the page: http://github.com/drwhite/Statistical-Inference-SCCS, is README.txt which gives instructions on "Getting Started" which includes how to install sccs as well as how to run the example models from the examples/ directory.

Please try running this on a fresh Windows machine and let me know if you run into any issues.

SFI Aug 23 - Sept 13th 2010


Possible sources of funding

Doug, I gave it quick read and it looks OK to me. My role will be contributing the

historical database of socio-cultural evolution - fine with me.

Doug, some guidelines on budget: http://www.templeton.org/what-we-fund/our-grantmaking-process/frequently-asked-questions.
  • Build on EvileyeMoralgodsBook3b.docx
  • The John Templeton Foundation serves as a philanthropic catalyst for research and discoveries relating to the Big Questions of human purpose and ultimate reality. Life Sciences Paul Wason, PhD, works with Chris
SFI joins John Templeton Foundation to assess complexity research
  1. 2011 Funding Cycle 2:
  2. OFI Deadline: Oct 14, 2011
  3. OFI Decisions: Nov 23, 2011
  4. Full Proposal Deadline: Mar 1, 2012
  5. Funding Decisions: Jun 22, 2012

Talk and Powerpoint

right click powerpoint: Multilevel networks and world ethnography BUT change the extension to .pptx rather than .zip (its not a zip!)

Paper and abstract

right click Causal Inference for Multilevel Networks of Early Ethnographically Well-studied Populations BUT change the extension to .docx rather than .zip (its not a zip!)

NEW noon 9/4/2010 corrected Table 3 and from Fig. 3 up to the conclusion.
pdf THE Pdf: authors Scott D. White, Douglas R. White, Feng Ren, and B. Tolga Oztan
  • (if you make edits pick your color - DRW is red - Tolga is Indigo/Blue - Ren Feng is dark green - and make them in your color)

right click NSF abstract BUT change the extension to .docx rather than .zip (its not a zip!) bio



Jajmani system with two time periods for a test of causal prediction

EduMod85 - next step EduMod86 with the new program package.

    • Allows 1 or 2 W matrices
    • Run restricted models
    • Compute regression coefficients
    • Compute total effects (adding indirect path)
    • Is the reciprocal effect x-->y plus y-->x-->y as the indirect path?
    • Correlate total effects to delta x <--> delta y (time lagged correlations)

Beidelman, Thomas J. 1959. Toward a Comparative Analysis of the Hindu Jajmani System. Monographs for the Association of Asian Studies, VIII. New York: J. J. Augustin.

Kolenda, Pauline. 1963. (critique of Beidelman) Toward a Model of the Hindu Jajmani System Human Organization 22(1):11-31. - Cited by 54 - Related articles

Orans, Martin. 1968, Maximizing in Jajmaniland: A Model of Caste Relations American Anthropologist 70(5): 875–897. pdf This articles has 7 variables coded for one or both of two time periods for 39 Indian villages, located geographically, and with probably calculable linguistic phylogeny. FIFTEEN ARE CODED FOR BOTH TIME PERIODS (p 891). Can we predict causality from time period 1 and predict what will happen at time 2?

90 society North America

Central states Amerindians


Scott D. White - Tolga Oztan - Ren Feng - Douglas R. White, Organizer

Key publication and Exercise with [R] code

kmz field site mapping example

Woodrow_W._Denham#Kmz_Site Sample kmz field site mapping


The three samples for the project are Foragers, Cultivators, and Contemporary Ethnographies.

Kinship networks and simulation

Then read John_Padgett#Open_Elite.3? and review the analysis of Structural cohesion, running an example or two in R.
  • Kinsources is a part of the NSF proposal. There is a project here to work with Mark Altaweel who has implemented my JASSS random marriage simulations. We will want to compare the random simulations to simulations with Repast that vary the rule of marriage "within generations" (it is the parents' generation that is held constant, the marriage of the children of that generation are permuted).
  • We need to select out those for study. All of the sources from Denham are foragers. We need to sort out, according to Boehm, which are pre-cultivators and which are solely foragers. Three lists, then:

EthnoAtlas societies in Kinsources

Other societies in Kinsources

Other Samples and Coding, including World-system coding

Societies with Kinsources data

  • 12 foragers with Atlas data
  • 12 agriculturalists with Atlas data
  • 17 Dravidianate - it is for these that Leaf says "organizational"
  • 40 other cases - large datasets, some urban - no Atlas data

Post 1944 and post 1965 ethnographies

i.e., Selection of post-1944 Ethnographies
Laura_Fortunato#Ethnographic_Database_Project, Codes included in the EDP
if so, we also need to find a more open source survey system for questionnaires

State societies

  • Might be good to code for "Variables: 561-575" in the SCCS codebook for the state societies not coded by Paige and Paige, particularly the "Fraternal Interest" variables not otherwise coded. That's more to the issue of the domain of dependent variables compared to others. Topic for discussion.



Binford data

  • (Examine how: Kinsources - Foragers - Denham archives - relate to Binford sample)
  • Stephen Shennan. 2004. Forty Years On: Review of Binford, Constructing Frames of Reference. Journal of Human Evolution 46: 507-515.
  • Binford forager data Tolga and Ren Feng put all the Binford xls files into an *.Rdata file --> has latitude/longitude for conversion into W matrices; language families available from the Ethnographic Atlas, when indexed for Binford's data. Binford himself did not consider peer effects of distance or language.

Boehm's coding protocol for foragers

  • Boehm will join the group the 2nd week, Thursday, perhaps friday.
  • Focus here is on foragers lacking neolithic domesticates. Boehm's tables need to be entered into spreadsheets and then converted to *.Rdata format. The first set of studies emphasizes sanctions on leaders.
  • "CHRIS BOEHM" <cboehm1@msn.com>
Date:   Sun, August 22, 2010 11:05 am 
To:   Douglas.White@uci.edu  

Here's the protocol, Doug. All 53 Pleistocene-appropriate hgs have been coded using it. Obviously, it applies only to social behavior exclusive of kinship. Hoping to see you in a week or two, Thurs or Fri or if I'm really lucky, both. C

Misc. Readings

Abstract. Among foragers, men's foods are often shared widely outside the household, undercutting variation in the benefit their wives and children receive. This means polygyny may not be due to variation in household provisioning. Some have even suggested that bonds in general, whether polygynous or monogamous, may have less to do with male provisioning than male-male contest competition. However, an analysis of foragers in the Standard Cross-Cultural Sample reveals that male provisioning does affect the mating system. Societies with higher male contribution to subsistence are more monogamous. The author argues that women value male provisioning less where males bring in less food, which results in greater polygyny. Where it is difficult for women to acquire food, they value male provisioning more, forcing males to compete via food acquisition. Food sharing prevents the polygyny threshold from being reached but does not completely erase the benefit of pair bonding with a good forager.

sanctions on leaders for foragers generally

sanctions for foragers lacking neolithic domesticates

Foragers with post-neolithic domesticates

Here we need to compare our list with those of Chris Boehm

Excel datasets

- Marlowe - Hill - Boehm

Food producers

Peer effects two-stage OLS

  • Run EduMod page in R
  • Run ordinal to normalized re-expression (does an entire dataset, converts the *.Rdata file)
  • Can be run at several levels:
Kinsources: Interfamily and IntraFamily
General sample
Food producers

World community baseline maps

Scott's new code

SCCS R package

Changes to the R code (Brown & Eff 2010 --> Scott White)

Chris Brown and Tony Eff have a new article with some additions to their [R] code and a real example with R2=.50 or so. New items:

  1. Ways of referencing variable names, e.g. "dummy", PC... for principal component, ... squared.
  2. The code is attached as "supplementary"
  3. In the code the better way of drawing maps
  4. A reference to a way of determining the optimal combination of the language and distance matrices: Dow and Eff 2009a. Cultural trait transmission and missing data as sources of bias in cross-cultural research: Explanations of polygyny re-examined. Cross-Cultural Research 43: 134-151.
"Employing the composite weight matrix method

presented in Dow and Eff (2009a:142), we combine the two weight matrices in order to find the linear combination of the two which best explains the transmission process (i.e., results in higher model R2); the resulting weights provide a way of assessing the relative importance of the two cultural transmission channels."

  1. Fuller explanation of the diagnostics, output labels for each, and a new Hausman test for each variable
  2. TWO MORE COEFFICIENTS: TN ADDITION TO OLS, a stdcoef in terms of #std deviations, and R^2p