Paul Rodriguez

From InterSciWiki
Jump to: navigation, search


Hi Doug,

You are right, there is currently no active ECSS allocation for Doug. He had one for several years to get the gateway going and this has now been transitioned to UCI staff. Paul does have to report on his “Novel and Innovative Projects” activities. If some of that included work with Doug, it’s fine to describe that. From an ECSS project perspective, no formal reports are needed at this time.

Nancy

Doug: where in CoSSci would BNlearn be activated?

It never made it into the production instance—it’s only an option on the development instance at http://socscicompute.ss.uci.edu:8081/

If it doesn’t work for you, let me know—it’s possible that there’s an environment setting on comet for the ucissg user that needs to be tweaked to get the right R environment—I don’t remember if I left that turned on or not…

Eric

The Bootstrap

Next, a bootstrap procedure was used to explore the distribution of possible network models (Efron & Tishbrini, 1986). One thousand bootstrap resamples were taken by sampling the original dataset with replacement. For each new sample dataset, a bayes network was found using the grow-shrink algorithm. The binary valued adjacency matrix for each network was saved and then averaged across all 1000 networks, thereby producing an expectation for the presence of every edge (Figure with graph in file named 'BNwboot_nowy_05thresh'). This approach has proved very useful in biological network discovery (e.g. Marbach, etal. 2012). The expectation serves as a weight on the edge, but it does not indicate what typical networks appear in the bootstrap samples. Therefore, we also sorted and counted the adjacency matrices (Figure with network counts), and printed out the most frequent networks (Figures with specific networks found, where the name has the figure counts, 154, or 72, etc.)

  • Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR; The DREAM5 Consortium (authors), Kellis M, Collins JJ, Stolovitzky G. 2012. Wisdom of crowds for robust gene network inference. Nature Methods, 9(8):796-804. 58 Collaborators.

Grow-Shrink algorithm: D. Margaritis and S. Thrun. 1999. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12. Abstract: In recent years, Bayesian networks have become highly successful tool for diagnosis, analysis, and decision making in realworld domains. We present an efficient algorithm for learning Bayes networks from data. Our approach constructs Bayesian networks by first identifying each node’s Markov blankets, then connecting nodes in a maximally consistent way. In contrast to the majority of work, which typically uses hillclimbing approaches that may produce dense and causally incorrect nets, our approach yields much more compact causal networks by heeding independencies in the data. Compact causal networks facilitate fast inference and are also easier to understand. We prove that under mild assumptions, our approach requires time polynomial in the size of the data and the number of nodes. A randomized variant, also presented here, yields comparable results at much higher speeds.

Paul MyVarimportance4v626.xlsx MyVarimportance4v621.xlsx

Paul Rodriguez -- Random Forests -- p4rodriguez@ucsd.edu LEG -8326 - now see Gradient boosting or [R package for randomForest] both rec'd by Scott Screen Shot 2013-10-18 at 5.34.07 PM.png Screen Shot 2013-10-18 at 5.33.43 PM.png

References for Random Forests

Wikipedia:Leo Breiman (d.2005). 1999. Random Forests.

Andy Liaw and Matthew Wiener. 2002. Classification by Regression and random forests

R programs to interface with FORTRAN

Google: Random forest exploration

Query and Response

http://intersci.ss.uci.edu/wiki/pdf/WileyCh5CCRNetsofVarsModels2blackDRW.pdf You can skip the part about Reiss's many errors but p11 gives advice about iterations the later Tables 4-7 give results after many iterations of the independent variables.

http://intersci.ss.uci.edu/wiki/index.php/Ev676.4 shows a simple R gui for the model. right after aa in the lines below the variables defined, e.g., by aa$v149 are imputed (Tolga take note: this allows you to make maps with aa variables that have the imputed variables)

smi <- doMI(evm, nimp = 2, maxit = 3)
aa<-aggregate(smi[,sapply(smi,function(x) is.numeric(x))],list(smi$.id),mean) # lists imputed variables from evm
randomForest can sample indepvars for the depvar from aa and run them in the rest of the code, e.g.,
# --dependent variable--
dpV <- "originsymb"
# --independent variables in UNrestricted model--
UiV <- c("bio.5","v149",       "v203", "v204","v205","v21","v53","v665","v670", "v7")
# --additional exogenous variables (use in Hausman tests)--
oxog <- c("v1260")
# --independent variables in restricted model (all must be in UiV above)--
RiV <- c("bio.5",          "v149","v205","v21","v53","v665","v670", "v7")
h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = oxog, dw = TRUE,  lw = TRUE, ew = FALSE, stepW = TRUE, relimp = TRUE, slmtests = FALSE)
CSVwrite(h, "olsresults", FALSE)

where the relevant h[] functions are these:

h[3]
$Rmodel
                coef  stdcoef     VIF  relimp  hcpval    pval star
(Intercept) -2.27744      NaN     NaN     NaN 0.11289 0.13189
bio.5       -0.14059 -0.18097 1.32254 0.01936 0.06356 0.06373    *
v149         0.08520  0.15346 1.15557 0.02218 0.03501 0.09053    *
v205        -0.13857 -0.30926 4.37880 0.02043 0.06274 0.07966    *
v21         -0.21204 -0.20842 1.14447 0.03256 0.01106 0.02101   **
v53         -0.18567 -0.21594 1.15876 0.06654 0.01620 0.02238   **
v665         0.28478  0.15407 1.10206 0.03014 0.12412 0.10677
v670         0.13772  0.14106 1.05644 0.03435 0.12572 0.12171
v7           0.23610  0.33753 4.22484 0.01822 0.03433 0.05166    *
Wy           1.81951  0.30057 1.13778 0.11985 0.00068 0.00128  ***
h[4]
$EndogeneityTests
       weakidF p.Sargan n.IV Fstat       df pvalue star
bio.5  39.307    0.243   10 1.789    23999  0.181
v149    5.785    0.000    5 1.308      126  0.000  ***
v205   11.000    0.202    4 0.000     1546  0.995
v21     5.045    0.147    5 5.000      724  0.027   **
v53     4.351    0.085    4 0.261      136  0.610
v665    3.024      NaN    2 0.039 23313436  1.000
v670    2.000    0.509    2 0.373        4  0.574
v7      7.339    0.251    2 1.000    24893  0.324
h[5]
$Diagnostics
                                                         Fstat       df pvalue star
RESET test. H0: model has correct functional form        0.213  148.003  0.645
Wald test. H0: appropriate variables dropped             0.000 2782.000  1.000
Breusch-Pagan test. H0: residuals homoskedastic          1.234    7.085  0.303
Shapiro-Wilkes test. H0: residuals normal                5.938   13.708  0.029   **
Hausman test. H0: Wy is exogenous                        8.000   54.000  0.000  ***
Sargan test. H0: residuals uncorrelated with instruments 0.683 2025.584  0.409
h[6]
$OtherStats
    d   l e Weak.Identification.Fstat R2.final.model R2.UR.model nimp nobs
1 0.2 0.8 0                  36.95694      0.2751023   0.2868821    2  112
h[10]
$dfbetas
     dfbeta variable    observation  dataValue depvarValue
8   -0.5711       Wy         Yapese  1.9827050           3
3   -0.4146       Wy          Huron  2.4776751           1
2   -0.4086       Wy  Copper Eskimo  2.4707232           1
6   -0.2762       Wy     Nambicuara  2.4294929           1
... etc

and h[8], h[9] give variables "ToTry" and that "DidWell"


On 8/9/13 11:38 PM, Paul Rodriguez wrote:


hey Doug,

If I understand the situation correctly, the R randomForest should be able to handle the size nicely on the HPC systems.

One thing to keep in mind is that randomForest will help select variables according to an importance measure, but not really give you p-values.

If you send me a paper or example of the data and methods, that would make it more clear - I'd be happy to chat some more, next week(s) are fairly good,

Paul ________________________________________ From: Doug White [douglas.white@uci.edu] Sent: Friday, August 09, 2013 5:45 PM To: Paul Rodriquez; B_Tolga Subject: randomForest

I've been looking with interest, given your presentation, at AndyLiaw and Matt Weiner R news 2002 and Leo Breitman

In our Trestles R application we have a hundred or more variables of interest, each of which could be evaluated with hundreds of potential independent variables, often six to a dozen, of interest. If randomForests were to work for sets of predictive variables, would this many variables be manageable? Right how we have chapter authors for a book, other researchers, and students in online classes starting with limited hypotheses using regression (with missing data imputed and controls for autocorrelation), and then getting iterations for successive model improvements, often in 10-15 steps. This many steps with queues of 20 minute wait have led us to use a Virtual Machine or the R gui on laptops and save the big composite jobs for Trestles. Would the use of randomForests programmed in R on our Galaxy (and VM) be able to give us quicker results for better model improvements or are there too many variables?Our number of cases run from 186 to 585 to 1270 with declining quality and completeness of data in the latter. I imagine it would not be difficult to connect R functions to the Fortran processor, or has that changed since 2002?

Would like to talk to you about these possibilites, my phone # is 858 457 0077, as I live close we could get together at SDSC.

Great talking to you, sorry we missed your talk but we were busy with our project on the VM and trestles and Tolga has limited time in his visits to La Jolla as he lives at Irvine.

-- best Doug White http://intersci.ss.uci.edu/wiki/index.php/DW_home

Plans for randomForest

Tom Uram