Comparative research tools

From InterSciWiki
Jump to: navigation, search

Wikipedia:Cross-Cultural Studies

Wikipedia:Galton's problem

These are instructional and resource pages so add, reorganize, explain, but keep it on topic!

Back to Tools and Methods

Author(s) Doug White, Anthon Eff, Jörg Rössel and anonymous edits

Standard Cross-Cultural Sample (Ethnographic data) and Codebook

SCCS index of variables=CODEBOOK FOR SCCS VARIABLES (Codes)

2007 Standard Cross-Cultural Sample. International Encyclopedia of the Social Sciences, 2nd edition.

Using R with the SCCS

Using R for cross-cultural research (James Dow) and R with the SCCS dataset - some of Dow's tutorial is out of date but updates are in the QuikStart:

QuikStart R for comparative research
data and R routines download, all in the same R folder: named sccs.RData (data and R programs).

R for statistics

Loading R packages

Spss version of the SCCS data and tutorials RIGHT CLICK TO DOWNLOAD World Cultures Spss file

Spss for Comparative research RIGHT CLICK TO DOWNLOAD Ethnographic Atlas Spss file, see: Codebook

Correlation and Regression Using Autocorrelation in model specification

Practicum: Run Comparative research tools#Anthon_Eff.27s_SAR_Procedures -- you should get the same results as Eff

The variables in the SCCS can be examined for spatial autocorrelation using statistics such as Moran's I or Geary's C. This kind of examination serves as exploratory data analysis, and has the potential of suggesting hypotheses which can then be specified in regression models.

A modified version of Moran's I can be used to test for the presence of autocorrelated residuals in an ordinary least squares regression model. Autocorrelated residuals indicate that the standard errors of the estimated coefficients are biased, which invalidates any inferences based on the t-statistics. But an even more serious implication of autocorrelated residuals is that there are omitted independent variables, which causes not only biased standard errors, but biased estimated coefficients.

One approach to the presence of autocorrelated residuals is to respecify the model, so that omitted independent variables are now included, in the hope that autocorrelation will disappear. Changes in functional form might also help, such as considering polynomials in some of the independent variables. But since the relevant omitted variables are unknown and may well be unmeasured, this approach has limited applicability.

The most common approach is to add a spatial lag as an independent variable. A spatial lag is a weighted mean of the dependent variable across an observation's neighbors, where the weight is an inverse function of the distance to that neighbor. Thus, the spatial lag will have a value very similar to each observation's closest neighbors. Since closest neighbors are likely to have similar values of omitted variables, the spatial lag can be considered as a way in which to proxy these omitted variables, so that the estimated coefficients are unbiased.

The main problem of the spatial lag model is that the spatial lag is endogenous. This requires that the estimation be done either with two stage least squares (provided that one has appropriate instruments) or with maximum likelihood.

Since maximum-likelihood methods are used to estimate spatial models (except the two-stage least squares methods), an R-squared is not available and one must calculate a pseudo-R-squared. An easy implementation used by Luc Anselin (1988: 212) is the square of the correlation between the fitted value and the actual value of the dependent variable.

  • Anselin, Luc. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer Academic Publishers.


A number of options have emerged for researchers estimating spatial models.

  • R has many excellent procedures for estimating spatial models. The package spdep, including SAR, primarily written by Roger Bivand, has most of the features that a cross-cultural researcher would want: spatial lag and spatial two-stage-least squares models, and local and global measures of autocorrelation. The open-source GIS GRASS interfaces with R. R is the first choice for many researchers with broad interests, since it contains excellent packages for an amazing number of statistical techniques (see, for example, R for networks).
  • Econometrics toolbox consists of MatLab routines, written by James LeSage, with contributions by others. For those comfortable with MatLab, LeSage's toolbox has become an excellent choice, particularly when working with models with very large numbers of observations (a consideration that doesn't apply to the SCCS). One caution: the toolbox works only with MatLab, and not with open-source clones such as GNU Octave, or the MatLab emulator on R.
  • GeoDa is a user-friendly package with an attractive GUI, including maps, developed by Luc Anselin and his collaborators. GeoDa is emerging as a popular choice for researchers who prefer a point-and-click GUI.
  • PySal is a suite of spatial econometric methods, written in Python by Luc Anselin and Sergio Rey and their collaborators. Currently, PySal has not been disseminated beyond the small group that developed it. Python has become widely used in quantitative social science (see Networkx for feedback networks), and especially in spatial analysis: it is now the preferred scripting language for ESRI's ArcGIS products.

Anthon Eff's SAR Procedures

courtesy E. Anthon Eff - emailed to Doug 12:53, 9 July 2007 (PDT). SAR is simultaneous autoregression, when the autoregressive interaction (network effect) terms are inside the regression equation rather than in the error terms.


Dealing with Galton's problem in a spatial model requires that one have a weights matrix for physical distance and another weights matrix for language similarity.

  1. has the SCCS language similarity matrix from the Ethnographic Atlas (E. Anthon Eff), and
  2. has the the SCCS distance matrix (courtesy EAE)

Background for these matrices can be found here:

Eff, E. Anthon. 2004. Does Mr. Galton Still Have a Problem? Autocorrelation in the Standard Cross-Cultural Sample. World Cultures 15(2):153-170. download

SAR code in R

Doug, If you put all of these in the same directory, and then change the working directory in the program to that directory's name, the program will carry out a ML estimation of a spatial lag model. Let me know if there are any problems... Anthon

Doug's substitution of v877 for v79

To install the package spdep. Go to the menu bar and click 'Packages'; on the drop down menu click 'Install package(s)...'. After being prompted for a mirror, you will be presented with a (very long) list of the available packages; scroll down until you find spdep, and click it. Some packages (such as foreign) are part of core R; spdep is not in core R, so you have to install it yourself. 'Load package' only calls in the packages you have already installed.

The model evaluated regression coefficients predicting female contribution to subsistence January 2013

  • femsubs~fishimp+huntimp+pathstress+rainfall+polygamy+eboysxp+fixres+landtrans+polinteg+socstrat
variables having .15 > p > .10: polygamy fixres landtrans
variables having .10 > p > .05: polinteg (+)
variables having .05 > p > .01: socstrat (-)
variables having .01 > p >.001: rainfall eboysxp (+) fishimp (-)
variables having .001>p: hunting path(ogen)stress (-)

Residual autocorrelation is nonsignificant

#Spatial Lag Model, one weight matrix, MLE
#datafile 1:
#This program:
#Change the following to your own working directory (note UNIX slant of slashes--even on Windows machine)
#net=source("C:/Program Files/pajek/PAJEK/PajekR.r")
#setwd("C:/Program Files/pajek/PAJEK/")
setwd("C:/Program Files/pajek/PAJEK/Anthon_Eff-Autocorrelation")
#Read in the dbf format weight matrices-the dbf file is 186x186 (no row names)
#Comment the matrix you do NOT want to use
lds<-read.dbf("langmat.dbf") #language phylogeny
#lds<-read.dbf("distmat.dbf") #great circle distance
#convert to matrix
#take a quick look at the upper left hand corner to see that it is OK
#read in SCCS data. It is in STATA format, since this is numeric--there are problems with the SPSS version,
#since R imports the value labels from SPSS and the variables become non-numeric
length(gg[,1]) #the number of observations
length(gg[1,]) #the number of variables
#create a data frame containing our variables, also give the variables names
#since the estimation doesn't work with missing values, here we identify all observations with non-missing values
#here we restrict the weight matrix and data to include only non-missing values
length(df[,1]) #number of observations before dropping those with missing values
length(ffd[,1]) #number of observations after dropping those with missing values
#We estimate a spatial lag model
#this next displays parameter estimates and diagnostics for the spatial lag model
#could not find function lagsarlm--> install packages -->spdep  library(spdep)


Other research tools

Simultaneous autocorrelation regression


back to InterSciWiki:Community Portal#Tools