Fish1

From InterSciWiki
Jump to: navigation, search
dx$fish1<-dx$fishing ; z<-which(dx$hg142!=1) ; is.na(dx$fish1[z])<-TRUE

Purpose of UNrestricted

Amber Johnson - One function of the UNrestricted, if not already resolved by the restricted model variables, is to "--create variables to use as covariates--" here: "dspmov","numfam","numg3","hougrp2" serve that function. In an early stage of DEf01d (talk) 14:11, 22 February 2014 (PST) these variables also played the role of diverse variables with missing data -- four such variables seemed minimal for covariates.

Keeping track of errors (not in dataset)

Doug - those were among the variables I just asked if Anthon could add to LRB. I haven't heard when he will have time to do that. At this point it seems best to wait for variables to be added to that file. - Amber On Feb 28, 2014 3:18 PM, "Douglas White" <douglas.white@uci.edu> wrote:

   There were no coklm, cvtemp, ptoae, ptorun, waccess, or sucstab2 variables. Lukasz has very nicely fixed the output to immediately see input variables with (i), which is "view errors." Thats good!
   you dont see all the errors until you also click stderr inside (i).
   so we all have to remember:
   (1) check http://intersci.ss.uci.edu/wiki/index.php/List_of_LRB_variables before you give variable lists, checking all windows -- the variable thought existed might not be there
   (2) click stderr inside (i) as a 2nd stage to locate CoSSci errors
   (3) once missing variable errors are surmounted, I got an "Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : \n  NA/NaN/Inf in 'x'\nCalls: lm -> lm.fit\n" but that was because cvtemp did not exist. New round I took out ptoae.

# --identify variables to keep for model building

Comments163#Variables

http://socscicompute.ss.uci.edu/u/drwhite/h/dougforlrbamber

evm<-c("fish1","et","growc", "dspmov","numfam","numg3","hougrp2", "bio.3","bio.4","bio.12","bio.13","bio.16","bio.18","bio.9Sq") #bio.12Sq","bio.13Sq","bio.3Sq","bio.8Sq","et:bio.8","et:Wy","growc:Wy","meanaltSq","mnnppSq","sdalt","dspmov"

 smi <- doMI(evm, nimp = 3, maxit = 4) 
#dim(smi)
No problem of decimals in

Script

library(mice)
library(foreign)
library(stringr)
library(AER)
library(spdep)
library(psych)
library(geosphere)
library(relaimpo)
library(linprog)
library(dismo)
library(forward)
library(pastecs)
library(classInt)
library(maps)
library(dismo)
library(plyr)
library(aod)
library(reshape)
library(mapproj) 
#library(map) used by Eff
library(RColorBrewer)
 library(XML)
 library(tm)
 library(mlogit)
#The Dow-Eff functions, as well as the four ethnological datasets, are contained in an R-workspace, located in the cloud.
#load(url("http://dl.dropbox.com/u/9256203/DEf01.Rdata"), .GlobalEnv)
#1/1/2014load(url("http://dl.dropbox.com/u/9256203/DEf01c.Rdata"), .GlobalEnv) #with more libraries
load(url("http://dl.dropbox.com/u/9256203/DEf01d.Rdata"), .GlobalEnv) 
ls()  #-can see the objects contained in DEf01.Rdata
#The setDS( xx ) command sets one of the four ethnological datasets as the source for the subsequent analysis. The four valid options are: “WNAI”, “LRB”, “EA”, “SCCS”. The setDS() command creates objects:
setDS("LRB")
# ===list and modify variables for use in model===
# --make new variables--
# DEf01d LRB DEf01d
dx$bio.3Sq<-dx$bio.15^2
addesc("bio.3Sq","bio.3Sq")
dx$bio.4Sq<-dx$bio.3^2
addesc("bio.4Sq","bio.4Sq")
dx$bio.8Sq<-dx$bio.4^2
addesc("bio.8Sq","bio.8Sq")
dx$bio.12Sq<-dx$bio.8^2
addesc("bio.12Sq","bio.12Sq")
dx$bio.13Sq<-dx$bio.13^2
addesc("bio.13Sq","bio.13Sq")
dx$bio.14Sq<-dx$bio.13^2
addesc("bio.14Sq","bio.14Sq")
dx$bio.15Sq<-dx$bio.14^2
addesc("bio.15Sq","bio.15Sq")
dx$bio.16Sq<-dx$bio.15^2
addesc("bio.16Sq","bio.16Sq")
dx$bio.17Sq<-dx$bio.16^2
addesc("bio.17Sq","bio.17Sq")
dx$bio.18Sq<-dx$bio.17^2
addesc("bio.18Sq","bio.18Sq")
#"et:bio.8","et:Wy","growc:Wy","meanaltSq","mnnppSq","sdalt","dspmov"

# --identify variables to keep for model building--
dx$fish1<-dx$fishing
z<-which(dx$hg142!=1)
is.na(dx$fish1[z])<-TRUE
#dx$fishing<-dx$drain #to try to eliminate Warning messages: 1: attempting model selection on an essentially perfect fit is nonsense 
evm<-c("fish1","et","growc","hougrp2",    "dspmov","numfam","numg3",    "bio.3","bio.4","bio.12","bio.13","bio.16","bio.18","bio.9Sq",  "lati","long","meanalt","sdalt","mnnppSq",

"defper", "gatherin", "ptorun", "rrcorr2", "sdtemp", "wret", "lcoklm", "ldefper", "lptorun", "lsnowac")

# these have the needed missing values: "fish1","dspmov","numfam","numg3",
#bio.12Sq","bio.13Sq","bio.3Sq","bio.4Sq","bio.8Sq","et:bio.8","et:Wy","growc:Wy","meanaltSq","sdalt","dspmov"
 smi <- doMI(evm, nimp = 6, maxit = 7) 
 smi <- doMI(evm, nimp = 3, maxit = 4) 
#dim(smi)
dim(smi)  # dimensions of new dataframe sm
#smi[1:2, ]  # first two rows of new dataframe smi.
#Missing values of these variables are then imputed, using the command doMI(). Below, the number of imputed datasets is 5, and 7 iterations are used to estimate each imputed value (these values are too low: nimp=10 and maxit=7 are the defaults and are reasonable for most purposes). The stacked imputed datasets are collected into a single dataframe which here is called smi.
#This new dataframe smi will contain not only the variables in evm, but also a set of normalized (mean=0, sd=1) variables related to climate, location, and ecology (these are used in the OLS analysis to address problems of endogeneity). In addition, squared values are calculated automatically for variables with at least three discrete values and maximum absolute values no more than 300. These  squared variables are given names in the format variable name+“Sq”.
#Finally, smi contains a variable called “.imp”, which identifies the imputed dataset, and a variable called “.id” which gives the society name. 
#smi <- doMI(evm, nimp = 5, maxit = 7) 
#smi <- doMI(evm, nimp = 2, maxit = 3)
  1. All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.
  1. The command doOLS() estimates the model on each of the imputed datasets, collecting output from each estimation and processing them to obtain final results. To control for Galton's Problem, a network lag model is used, with the user able to choose a combination of geographic proximity (dw), linguistic proximity (lw), and ecological similarity (ew) weight matrices. In most cases, the user should choose the default of dw=TRUE, lw=TRUE, ew=FALSE.
  1. All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.
# --dependent variable--
dpV <- "fish1"
oxog <- c("NULL")
#-independent variables in UNrestricted model--
UiV<-c("fish1","et","growc","hougrp2",      "dspmov","numfam","numg3"     ,"bio.3","bio.4","bio.12","bio.13","bio.16","bio.18","bio.9Sq",   "lati","long","meanalt","sdalt",  

"defper", "gatherin", "ptorun", "rrcorr2", "sdtemp", "wret", "lcoklm", "ldefper", "lptorun", "lsnowac") #, "bio.12Sq","bio.13Sq","bio.3Sq","bio.4Sq","bio.8Sq",,"bio.18")

#--independent variables in restricted model--
RiV<-c("defper", "gatherin", "ptorun", "rrcorr2", "sdtemp", "wret", "lcoklm", "ldefper", "lptorun", "lsnowac") 
RiV<-c("defper", "gatherin", "ptorun", "rrcorr2", "sdtemp", "lcoklm", "ldefper", "lptorun", "lsnowac") # "wret",
RiV<-c("defper", "gatherin", "ptorun", "rrcorr2", "sdtemp", "lcoklm", "ldefper","lsnowac") #  "lptorun", "wret",  ADD: dspmov	meanalt	numg3
RiV<-c("defper", "gatherin", "ptorun", "rrcorr2", "sdtemp", "lcoklm", "ldefper","lsnowac","numg3")  #"dspmov","meanalt",
h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = NULL, dw = TRUE, lw = TRUE, ew = TRUE, stepW = TRUE, boxcox = FALSE, getismat = FALSE, relimp = TRUE, slmtests = FALSE, haustest = NULL, mean.data = TRUE, doboot = 1000) #Works with DEf01c and DEf01d
CSVwrite(h, "/Users/drwhite/Documents/Fish1.ew", FALSE)
#STOP HERE
h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = NULL, dw = TRUE, lw = TRUE, ew = FALSE, stepW = TRUE, boxcox = FALSE, getismat = FALSE, relimp = TRUE, slmtests = FALSE, haustest = NULL, mean.data = TRUE, doboot = 1000) #Works with DEf01c 
CSVwrite(h, "/Users/drwhite/Documents/Fish1No.ew", FALSE) # DEf01b WNAI DEf01c WNAI ----------------------- ew = FALSE

The script works when there are no variables for SMI -- i.e., no missing data. What if there is only one variable with missing data?

Error in doMI(evm, nimp = 3, maxit = 4) : 
 You need at least two variables with missing values to run this procedure properly""
  • DRW: In this example "hougrp2" has pval=0.13 and is no longer significant although significant when 3 other variables with missing data are added
# --identify variables to keep for model building--
evm<-c("fish1","et","growc", "hougrp2",   "bio.3","bio.4","bio.12","bio.13","bio.16","bio.18","bio.9Sq",  "lati","long","meanalt","sdalt",    "mnnppSq") #bio.12Sq","bio.13Sq","bio.3Sq","bio.4Sq","bio.8Sq","et:bio.8","et:Wy","growc:Wy","meanaltSq","sdalt","dspmov"
## deleted  NO"hougrp2",    "dspmov","numfam","numg3",  
 smi <- doMI(evm, nimp = 3, maxit = 4) 
#dim(smi)
dim(smi)  # dimensions of new dataframe sm
#smi[1:2, ]  # first two rows of new dataframe smi.
#Missing values of these variables are then imputed, using the command doMI(). Below, the number of imputed datasets is 5, and 7 iterations are used to estimate each imputed value (these values are too low: nimp=10 and maxit=7 are the defaults and are reasonable for most purposes). The stacked imputed datasets are collected into a single dataframe which here is called smi.
#This new dataframe smi will contain not only the variables in evm, but also a set of normalized (mean=0, sd=1) variables related to climate, location, and ecology (these are used in the OLS analysis to address problems of endogeneity). In addition, squared values are calculated automatically for variables with at least three discrete values and maximum absolute values no more than 300. These  squared variables are given names in the format variable name+“Sq”.
#Finally, smi contains a variable called “.imp”, which identifies the imputed dataset, and a variable called “.id” which gives the society name. 
#smi <- doMI(evm, nimp = 5, maxit = 7) 
#smi <- doMI(evm, nimp = 2, maxit = 3)
  1. All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.
  1. The command doOLS() estimates the model on each of the imputed datasets, collecting output from each estimation and processing them to obtain final results. To control for Galton's Problem, a network lag model is used, with the user able to choose a combination of geographic proximity (dw), linguistic proximity (lw), and ecological similarity (ew) weight matrices. In most cases, the user should choose the default of dw=TRUE, lw=TRUE, ew=FALSE.
  1. All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.
# --dependent variable--
dpV <- "fish1"
oxog <- c("NULL")
#-independent variables in UNrestricted model--
UiV<-c("fish1","et","growc",   "hougrp2"    ,"bio.3","bio.4","bio.12","bio.13","bio.16","bio.18","bio.9Sq",   "lati","long","meanalt","sdalt") #,    "bio.12Sq","bio.13Sq","bio.3Sq","bio.4Sq","bio.8Sq",,"bio.18")
##"hougrp2",      "dspmov","numfam","numg3" 
#--independent variables in restricted model--
#RiV<-c("meanalt","sdalt","bio.3","bio.4","bio.12","bio.13","bio.16","bio.18","bio.9Sq")            #   ,"lati","et","long","growc","bio.12Sq","bio.13Sq","bio.3Sq","bio.4Sq","bio.9Sq") #ToTry ,"lati","long","meanalt","sdalt"
#To Rey bio.12	bio.13	bio.16	bio.18	bio.3	bio.4	bio.4Sq:Wy	gatherinSq	hougrp2Sq	mnnppSq	sdalt:Wy	growc	hougrp2	lati
RiV<-c("hougrp2",  "sdalt","bio.3","bio.4","bio.18")  #"bio.18","bio.3"  #To Try bio.12	bio.13	bio.16	bio.18	bio.3	 bio.4
## "hougrp2", 
h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = NULL, dw = TRUE, lw = TRUE, ew = TRUE, stepW = TRUE, boxcox = FALSE, getismat = FALSE, relimp = TRUE, slmtests = FALSE, haustest = NULL, mean.data = TRUE, doboot = 1000) #Works with DEf01c and DEf01d
CSVwrite(h, "/Users/drwhite/Documents/Fishing.ew", FALSE)

Maps of the restricted variables

Working model with CoSSci using ; ; semicolons for New Variables

depvar indepvar none of the l (log) variables run New variable

  • Lfishing bio.18,bio.3,bio.4,cvtemp,elev,perwret,snowac,temp,lbar5,lcoklm,lcvtemp,lptoae,lptorun,lsnowac,lwacess dx$fishing; dx$Lfishing <- log(dx$fishing)
  • http://intersci.ss.uci.edu/wiki/index.php/List_of_LRB_variables
  • Lfishing bio.18,bio.3,bio.4,cvtemp,elev,perwret,snowac,temp,lbar5, lcvtemp,lptoae,lptorun,lsnowac,lwacess
load(url("http://dl.dropbox.com/u/9256203/DEf01.Rdata"), .GlobalEnv)
setDS("LRB")
sort(names(dx)

Subsetted sample

Amber: In CFR regression equations were run using both the full 339 case file and the 142 case sample. Lew discusses the 142 case sample in detail in Chapter 5 of CFR. Where he introduces this sample (p. 144) he describes it as “a subset of those cases designed to correspond as closely as possible to the proportions of the earth’s surface covered by the twenty-eight different plant communities adapted in this study from Eyre’s classification (1968).” These cases were selected to be a proportional sample of hunter-gatherers from vegetation types.

  • To subset use:
dx$fish1<-dx$fishing
z<-which(dx$hg142!=1) #<- for hg142 unclustered subsample. Binford p142.
is.na(dx$gath1[z])<-TRUE
  • then run everything as usual, with dx$gath1 as the dependent variable.
  • THERE IS NO dx$coklm ; dx$lcoklm <- log(dx$coklm) VARIABLE