From InterSciWiki
Jump to: navigation, search
Screen Shot 2013-08-15 at 7.14.17 AM.png
- .Dow-Eff_Functions_-_DEf is Under Creative Commons License: Attribution-NonCommercial 2.0 Generic (CC BY-NC) Doug (talk) 2013. Install R packages for DEf - Wiki4R_Codebooks.htm --- Here's the key to teaching on-line classes with Dow-Eff software How to share histories and see Dow-Eff References.


DRWDraft2 The Dow-Eff functions are similar to other R packages for regression and sem in some respects but few such packages or commercial programs include controls for autocorrelation using a fast, efficient, and statistically well-tuned two-stage least squares (2sls) approach. DEf is a specific type of IV (Instrumental Variable) regression that can solve autocorrelations models with reasonably large samples and multiple endogenous IVs (see Dow 2013 above), with IV referring to the correlation between the autocorrelation term and the error term which are expected to correlate from the way the language and inverse distance or other IV terms are defined. DEf is not an official R package as yet because it is accessed from a dropbox and is still in development. It uses the standard Donald Rubin MI package (multiple imputation to estimate missing data) in R. No commercial package has the advantages of this approach, which offers a solution to "Galton's problem" of variables that are nonindependent due to network interactions among the observations, i.e., nonindependence among cases not just variables. DEf applies to a rectangular database of cases (rows) and variables (columns) but is necessarily supplemented first, by MI, missing data imputation, which requires a sizeable set of completely coded variables available and analyzed into principal components that measure different systemic aspects of multivariate structure in the dataset from which MI is done probabilistically. Second, network lag solutions to estimating autocorrelation effects require appropriate network data (geographic, linguistic, riverine, road, and political or religious closeness, for example). These proximity W matrices, with zeros in the diagonal and row normalized to unity so that products Wy and Wx -- y being the dependent variable in a linear regression and x the independent variables in a set of X columns -- act as transformations of variables that measure potential effects of autocorrelation. A small number of such proximity W matrices are usually sufficient to eliminate endogeneities between error terms and independent variables that otherwise tend to foil the assumption of independent and identically distributed (iid) error terms. The iid assumption is essential to interpreting regression results. As predictors of "neighborhood" and other network effects, the candidates for influential W matrices can themselves be optimally weighted in a first stage of regression analysis that predicts the extent to which there are weighted network effects of independent-variable characteristics on the dependent variable. In a first stage of regression analysis the parameterized autocorrelation effects are estimated (regression equation 1), and equation 2 defines an estimate of Wy by dropping the error term from equation 1. The Xs are columns of independent variables xi, y the dependent variable, and the dot above Wy and constants in equation 2 and over Wy in equation 3 means the estimate of Wy from equation 1. The products Wy and WX are defined above.

Wy = \alpha_0 + {\alpha_i}(WX_{i=1,n}) + \epsilon (Eqn 1).
\overset{\underset{\mathrm{.}}{}}{Wy} = \overset{\underset{\mathrm{.}}{}}{\alpha_0}  + \overset{\underset{\mathrm{.}}{}}{\alpha_i}(WX_{i=1,n}) (Eqn 2).

The estimated result is added to the raw independent variables in a second-stage least squares (2sls) regression:

y =  \beta_0(\overset{\underset{\mathrm{.}}{}}{Wy}) + \beta_1  + \sum_{i=2,n+1}  (\beta_i X_i)+ \epsilon' (Eqn 3).

The second-stage OLS effectively subtracts the weighted \overset{\underset{\mathrm{.}}{}}{Wy} autocorrelation term from the dependent variable y. If autocorrelation is well specified, the error term \epsilon' in the second equation will be exogenous in relation to the independent variables in X_{\beta_i=2,n}. Hausman significance tests are used to measure the null hypothesis of exogeneity (H0) for each independent variable separately. (See: Quik wikimedia math for the intersciwiki).

The W in the equations 1-3 is derived from solution of a biparametric spatial lag regression model (Brandsma and Ketellepper 1979; Dow 1983; Dow 2007) that calculates weighted additive effects of distance (D), language phyogeny (L) and potentially other W matrices on the dependent variable y (Eff 2008a):

Screen Shot 2014-08-03 at 9.32.27 AM.png
Screen Shot 2013-09-12 at 12.13.04 PM.png
-- y here should read Wy

A third step of analysis uses imputed (X.h) variables from DEf output and may be modeled as an ordinary least-squares (OLS) model with and without the Wy term, thus comparing OLS with and without the correction for autocorrelation:

y =  \beta_0(\overset{\underset{\mathrm{.}}{}}{Wy}) + \beta_1  +\sum_{i=2,n+1}  (\beta_i X.h_i)+ \epsilon' (Eqn 4) or \beta_1  +\sum_{i=2,n+1}  (\beta_i X.h_i)+ \epsilon (Eqn 5)

These two final equations allow an evaluation, over many models and their dependent variables, of how great are the overestimations of significance tests using OLS and other problems.

Rationale: Fields like Cross-cultural research, including most of the social, behavioral and observational studies, which almost invariably involve effects of autocorrelation, require the kinds of statistical analyses provided by DEf R software, as explained above. Like most R software, DEf is provided by its authors, in this case Eff and Dow. Without new open access software like DEf, fields like cross-cultural and behavioral research, or the social sciences generally, will find it very difficult to upgrade the quality of research results from observational studies and samples, which are ubiquitous. The enormous contribution of open-access data, open-access journals that publish data and software, sharable courseware, and open-access to social science gateways that provide researchers, instructors and students with data and research tools, is not costly and is provided here under creative commons licenses that allows data, software, publication, and access to science gateways to diffuse research capabilities freely or at fractional cost underwritten by universities, departments and academic computing. In the present case the costs of $55/month for UCI computing services is underwritten by the funds of benefactors to the MBS Program in Social Dynamics and Complexity . Access to XSEDE is provided through SDSC/UCSD's participation in National Supercomputing underwritten by NSF. Editing a forthcoming Wiley publication, the Companion to Cross-Cultural Research undergirds the four Companion editors in the online DEf project and some other aspects of the CoSSci gateway projects. The Wiley book is slated for earliest publication in 2015 but partly available online. The editors and PI's engaged in the DEf and CoSSci projects look forward to speedy completion helped through contributions by authors whose chapters exemplify the use of the DEf R programs and extension of the available databases for this project.

UNrestricted variables that include the set of restricted variables in the regression

DEf calculates a set of UNrestricted variables, some of which are used in regression, others of which are needed as covariates in the imputation of missing data (variables in the regression are not always sufficient), others are created by the user to give flexibility to the set of variables to be explored, and still others are there because they form parts of alternative theoretical models to be explored. In the part two of the output data, all the Unrestricted variables are correlated with the dependent variable and will show null hypothesis significance tests. Part three of the output data shows regression results. Significance tests will therefor differ between part 2 results and part 3 results as a result of control for autocorrelation.


DEf regression, then, is not a simple approach and cannot be done with simple regression or other R packages but involves a coordination of other packages and new and old R functions. Further, while DEf could take a single set of independent variables, like ordinary regression, Anthon Eff has programmed data entry in four useful "layers" of independent variables: those in the actual regression, those regressors the modeler wants to compare with the independent variables, those to impute (inclusive of the first two sets of variables) and the totally inclusive set of all variables in the dataset. Each DEf regression generates a list of variables "to try" that is compiled from single variables that would fit the model taken one at a time, given the usual caveats including independence of meaning in defining the variables and avoiding problems of colinearity, endogeneity, and the like. Used well, such regressions allow the modeler to optimize, in many iterations, an extensive set of highly significant variables, provided they are related in a meaningful Bayesian theoretical context, and are predictive regressors of the dependent variable. Because it is only the stage-2 OLS variables that are iterated, computational time is very fast for each new test of parameters that helps to improve the regression model. Iterations created by the user strongly tend to converge but may require two dozen or more "to try" changes before there is no further improvement. For this reason we employ UCI Galaxy ( and Virtual ( computers with windows for each of the three types of variables -- independent, comparative, and others to be imputed. The imputed variables can be accessed with function "aa" when requested after execution. Having an external computer that contains the DEf script -- one that is also executable as an R script on a personal computer (see -- makes it much simpler to do a series of model iterations that arrive at a relatively final DEf regression model for a given dependent variable. Iterative modeling of DEf scripts on a PC is rather complicated for students because of the extensive editing required in each iteration but straightforward for those highly experienced in rewriting R scripts.

Galaxy link to VM execution

Model execution on the mainframe UCI Galaxy computer allows two options. One is the choice of offloading the computation to the UCI VM or Virtual Machine ( that does fast computing and returns results quickly is useful for further iteration because it facilitates submission of modified variables in the next iteration. Ten functions, h[1] to h[10], provide results that are useful in obtaining a convergent model. In the final stages of modeling, tests are available that calculate which of the variables with lesser significance can be eliminated according to conservative Holm-Bonferroni tests for groups of variables. Bayesian MCMCregress tests are available on the VM to calculate improvement of goodness-of-fit in eliminating single variables. This allows modelers to arrive at a robust results in successive steps of elimination. It is not unusual for over a dozen plausible variables in the Bonferroni and BayesFactor (MCMCregress) tests to converge to a final model with variables all exceeding significance of p<.005 and R2>.60. Even from cross-cultural variables that are often imprecisely measured, results of this sort reflect robust models after removing autocorrelation effects.

VM, PC and XSEDE execution as Galaxy options

Once robust and theoretically significant models are developed the three windows used for VM modeling (as above) can be used to submit a model to the "CoSSci" Gateway at the Extreme Science and Engineering Discovery Environment (XSEDE) at the SDSC Trestles supercomputer. Models can be submitted separated or form part of a complex analysis along with other models saved within the "CoSSci" Gateway network. The larger jobs may be designed to pose bigger questions about systems of variables, path analysis, panel analysis for temporal sequences, and the like. While the VM facilitates interactive modeling to obtain final DEf models within hours, or a few days if needed to become familiar with DEf, and similarly for other workflow modeling operations. Modeling at the XSEDE site may handle single or larger jobs while raw DEf scripts on a PC are more "hands-on" but require additional time and expertise. For students and classes, including classes, VM modeling is optimal.

Alerts for Squared (sq) Nonlinear variables

Regression analysis can handle nonlinear variables by a squared term, referred to for a sample variable v51 as v51sq. The DEf function doMI AUTOMATICALLY creates squared terms along with their descriptions. The code is currently set so that the only variables so treated are those with at least 7 (6?) unique values, and where the absolute value of the maximum value is less than 100. If this condition is met for v51, for example, v51sq may be entered into an R model as

Tom, if Anthon's criteria are met we can use v277,v277sq and such without New Variable 
My bet is that when they are not met we put v77 in the restricted window and v77sq as new variable if Anthon's criteria not met
  • In any case the original variable must go in as a restricted variable if the sq version goes in. See: Comments56: the function doMI AUTOMATICALLY creates squared terms along with their descriptions. The code is currently set so that the only variables so treated are those with at least 6 unique values, and where the absolute value of the maximum value is less than 100.
  • Anthon: Dont do this for now: I can lower the first if you want, maybe to 3 or 4. The second could perhaps be raised a bit, perhaps to 200 or 300. But changing those parameters bring a cost in that collinearity becomes more common.


-- thanks, Anthon, spot on Doug White

Usage permissions

Funding for the UCI Galaxy and VM that allows free usage is provided by the UCI Principal Investigator Douglas R. White.

Independent Research and Classroom projects may apply to the UCI PI for usage or funded projects or usage in classes. PC scripts for DEf are obtained at .Dow-Eff_Functions_-_DEf or

All usage is subject to Creative Commons License: Attribution-NoDerivs-NonCommercial 1.0 Generic (CC BY-ND-NC 1.0). See: Creative_Commons#NonCommercial

DEf- Fields of Application and Data sharing

See: Data sharing for a glimpse of just how important it is for scientific journals, websites, and online gateways to provide open access to data, which helps to guarantee that research results can be replicated.

DEf is not a special-purpose application for specific fields but does require for each dataset a small number of network W matrices (zero on their diagonals and row normalized to one) to allow for inclusion of autocorrelation results. The product of a W matrix with each variable (thus Wy or a series of WX) in a regression model constitutes a transformation of each x in X into a comparable variable where each new value is the average of other cases' value on that variable weighted by some measure of proximities in a given W. A relatively small number of such measures of proximity may be sufficient to capture the fact that there are clusters of similar cases in samples of observed cases and to insure that the common effects of these clusters is taken into account. This Galton's objection to treating any sample of nonindependent observations as if they came from completely different universes. Dealing with this problem is a sine-qua-non for interpreting statistical results from any observational sample whatsoever. Nonindependence of cases typically reduces the effective "n" of independent cases in a observational samples. Typically the auto-correlations of variables in a sample are far stronger than the first-order correlations.

At one extreme, "Big data" can and should be studied alongside the W network matrix proximities and their products used to locate autocorrelation effects. It is said of "big data" that there are three game changers: from some data to all, from clean to messy, and from causation to correlation (Cukier and Mayer-Schoenberger 2013:31-32). But like any other observational data, when autocorrelation is applied to "big data," correlations divide into what can be predicted from the contexts of observation -- how different (linkages to and types of) neighborhoods of independent variables affect the neighborhoods of a dependent variable, to employ the language of regression -- and what residual variation in the dependent variable can be predicted from the independent variables (at the level of the unit of observation). That is, correlations change depending on whether context is taken into account. "Big data," then does not have all the answers.

DEf is able to address these kinds of questions, and when moved to the supercomputer context needed for "big data" analysis, DEf is highly effective.

But every application has to consider context: autocorrelation results are usually much more predictive -- i.e., larger correlations at the level of the networks of context -- than residual correlations once autocorrelations are calculated into a regression equation (equations 1 and 2 above). Both local attributes and network contexts of the units of observation that carry these attributes have correlations, some of which may reflect causality and others not. It will often be true that large amounts of messy data -- as in Google's massive translation database in 65 languages which provide better new translations that IBM's reliance on fewer high-quality translation templates in producing new translations -- will trump small amounts of clean data but it is the distinction between contexts and isolates that is addressed by DEf and regression with autocorrelation. And it is here that context will tend to trump isolates.

Cross-cultural comparison often relied on the ethnographic case study to provide the units of observation within which variables can be measured. Cross-national or cross-polity comparison uses nations or polities to provide the units of observation within which variables can be measured. Family studies or samples of individuals as the observations provide analogous units from which research results can be extrapolated from the variables measured. Salvador Minuchin introduced the family network as the "case" in psychotherapy. Ragin and Becker introduced the question What is a Case? as a basic foundation of social inquiry. Economics has to locate a network of exchanges in which the predominant methodology has been to assume the autonomy and independence of individuals (methodological individualism). Thus abstracted individuals are the idealized interacting units of classical economics. Network economics is rapidly replacing classical economics by including network autocorrelation in the modeling.

DEf may be oriented to any or all these kinds of observational comparisons. The problems of inferring causality are not much different even extending the units of observation to experimental studies. Whether the nicotine in "smoking" cigarettes causes cancer, for example, cannot be established by correlation. Stronger mechanisms must be specified and tested, like the formation of tar on the lungs and how effects differ with the new "smokeless" cigarettes. Observational studies can find appropriate tests for causation but not without specifying context and mechanisms or without separating predictions from network predictors that link observed cases through time differ from isolated-feature predictors observed in independent cases. That is what DEf is about as it applies to many different kinds of studies. DEf can be extended to include time series by taking either the imputed variables or the stage-2 variables that are imputed and adjusted for autocorrelation and applying elements of the R systemfit package along with the R plm package, attuned to the temporal sequence of cases.

Because of autocorrelation it is absurd for researchers collecting samples of observations to claim that their statistical results are more valid because the sample is "independent" of other samples. To the contrary, because each new sample requires its own specific construction of W matrices, it is far more valuable for researchers to participate using larger samples "in common" such as repeat surveys or common ethnographic cases in cross-cultural surveys in which new variables are added to the database with each new sample survey (reading the same ethnographies to code new variables and contribute them to an open-source database).

Finally, what about large scale surveys? "How will the field of stratification, or social inequality, respond, and what data they might use for W? They normally only have national survey data, which typically contain no good proximity data" asks sociologist Jeroen Bruggeman. One answer is that for large-scale datasets, large-scale W matrices can be used (distance, freeway time, contiguity in blocks of ethnicities, reported countries of origin, etc.). Far-out autocorrelation effects can be as strong as close-up effects. W matrices can be large and still easy to multiply by variables with large n. Virtual Machines and supercomputers are just as easily employed for large surveys as for small.


  • Brandsma, A S, Ketellapper R H, 1979. A biparametric approach to spatial autocorrelation. Environment and Planning A 11(1): 51–58.
  • Cukier, Kenneth, Viktor Mayer-Schoenberger. 2013. The Rise of Big Data: How it's Changing the Way We Think About the World. Foreign Affairs 92(3):28-40.