# Paul Rodriguez

Hi Doug,

**You are right, there is currently no active ECSS allocation for Doug. He had one for several years to get the gateway going and this has now been transitioned to UCI staff. Paul does have to report on his “Novel and Innovative Projects” activities. If some of that included work with Doug, it’s fine to describe that. From an ECSS project perspective, no formal reports are needed at this time.**

Nancy

**Doug: where in CoSSci would BNlearn be activated?**

It never made it into the production instance—it’s only an option on the development instance at http://socscicompute.ss.uci.edu:8081/

If it doesn’t work for you, let me know—**it’s possible that there’s an environment setting on comet for the ucissg user that needs to be tweaked to get the right R environment—I don’t remember if I left that turned on or not…**

Eric

- Use of library(bootstrap) Comments299
**bnlearn R program** - Paul Rodriquez LEG 8326 -- DDP gwm0
- SDSC User support

## Contents

## The Bootstrap

Next, a bootstrap procedure was used to explore the distribution of possible network models (Efron & Tishbrini, 1986). One thousand bootstrap resamples were taken by sampling the original dataset with replacement. For each new sample dataset, a bayes network was found using the grow-shrink algorithm. The binary valued adjacency matrix for each network was saved and then averaged across all 1000 networks, thereby producing an expectation for the presence of every edge (Figure with graph in file named 'BNwboot_nowy_05thresh'). This approach has proved very useful in biological network discovery (e.g. Marbach, etal. 2012). The expectation serves as a weight on the edge, but it does not indicate what typical networks appear in the bootstrap samples. Therefore, we also sorted and counted the adjacency matrices (Figure with network counts), and printed out the most frequent networks (Figures with specific networks found, where the name has the figure counts, 154, or 72, etc.)

- Efron, B.; Tibshirani, R. 1993.
**An Introduction to the Bootstrap**. Boca Raton, FL: Chapman & Hall/CRC.

- Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR; The DREAM5 Consortium (authors), Kellis M, Collins JJ, Stolovitzky G. 2012. Wisdom of crowds for robust gene network inference.
**Nature Methods**, 9(8):796-804. 58 Collaborators.

Grow-Shrink algorithm: D. Margaritis and S. Thrun. 1999. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12. Abstract: In recent years, Bayesian networks have become highly successful tool for diagnosis, analysis, and decision making in realworld domains. We present an efficient algorithm for learning Bayes networks from data. Our approach constructs Bayesian networks by first identifying each node’s Markov blankets, then connecting nodes in a maximally consistent way. In contrast to the majority of work, which typically uses hillclimbing approaches that may produce dense and causally incorrect nets, our approach yields much more compact causal networks by heeding independencies in the data. Compact causal networks facilitate fast inference and are also easier to understand. We prove that under mild assumptions, our approach requires time polynomial in the size of the data and the number of nodes. A randomized variant, also presented here, yields comparable results at much higher speeds.

## Paul MyVarimportance4v626.xlsx MyVarimportance4v621.xlsx

Paul Rodriguez -- Random Forests -- p4rodriguez@ucsd.edu LEG -8326 - now see **Gradient boosting** or
[R package for randomForest] both rec'd by Scott

## References for Random Forests

Wikipedia:Leo Breiman (d.2005). 1999. Random Forests.

Andy Liaw and Matthew Wiener. 2002. Classification by Regression and random forests

- R programs to interface with FORTRAN

Google: Random forest exploration

## Query and Response

http://intersci.ss.uci.edu/wiki/pdf/WileyCh5CCRNetsofVarsModels2blackDRW.pdf You can skip the part about Reiss's many errors but p11 gives advice about iterations the later Tables 4-7 give results after many iterations of the independent variables.

http://intersci.ss.uci.edu/wiki/index.php/Ev676.4 shows a simple R gui for the model. right after aa in the lines below the variables defined, e.g., by aa$v149 are imputed (Tolga take note: this allows you to make maps with aa variables that have the imputed variables)

smi <- doMI(evm, nimp = 2, maxit = 3) aa<-aggregate(smi[,sapply(smi,function(x) is.numeric(x))],list(smi$.id),mean) # lists imputed variables from evm

randomForest can sample indepvars for the depvar from aa and run them in the rest of the code, e.g., # --dependent variable-- dpV <- "originsymb" # --independent variables in UNrestricted model-- UiV <- c("bio.5","v149", "v203", "v204","v205","v21","v53","v665","v670", "v7") # --additional exogenous variables (use in Hausman tests)-- oxog <- c("v1260") # --independent variables in restricted model (all must be in UiV above)-- RiV <- c("bio.5", "v149","v205","v21","v53","v665","v670", "v7") h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = oxog, dw = TRUE, lw = TRUE, ew = FALSE, stepW = TRUE, relimp = TRUE, slmtests = FALSE) CSVwrite(h, "olsresults", FALSE)

where the relevant h[] functions are these:

h[3] $Rmodel coef stdcoef VIF relimp hcpval pval star (Intercept) -2.27744 NaN NaN NaN 0.11289 0.13189 bio.5 -0.14059 -0.18097 1.32254 0.01936 0.06356 0.06373 * v149 0.08520 0.15346 1.15557 0.02218 0.03501 0.09053 * v205 -0.13857 -0.30926 4.37880 0.02043 0.06274 0.07966 * v21 -0.21204 -0.20842 1.14447 0.03256 0.01106 0.02101 ** v53 -0.18567 -0.21594 1.15876 0.06654 0.01620 0.02238 ** v665 0.28478 0.15407 1.10206 0.03014 0.12412 0.10677 v670 0.13772 0.14106 1.05644 0.03435 0.12572 0.12171 v7 0.23610 0.33753 4.22484 0.01822 0.03433 0.05166 * Wy 1.81951 0.30057 1.13778 0.11985 0.00068 0.00128 *** h[4] $EndogeneityTests weakidF p.Sargan n.IV Fstat df pvalue star bio.5 39.307 0.243 10 1.789 23999 0.181 v149 5.785 0.000 5 1.308 126 0.000 *** v205 11.000 0.202 4 0.000 1546 0.995 v21 5.045 0.147 5 5.000 724 0.027 ** v53 4.351 0.085 4 0.261 136 0.610 v665 3.024 NaN 2 0.039 23313436 1.000 v670 2.000 0.509 2 0.373 4 0.574 v7 7.339 0.251 2 1.000 24893 0.324 h[5] $Diagnostics Fstat df pvalue star RESET test. H0: model has correct functional form 0.213 148.003 0.645 Wald test. H0: appropriate variables dropped 0.000 2782.000 1.000 Breusch-Pagan test. H0: residuals homoskedastic 1.234 7.085 0.303 Shapiro-Wilkes test. H0: residuals normal 5.938 13.708 0.029 ** Hausman test. H0: Wy is exogenous 8.000 54.000 0.000 *** Sargan test. H0: residuals uncorrelated with instruments 0.683 2025.584 0.409 h[6] $OtherStats d l e Weak.Identification.Fstat R2.final.model R2.UR.model nimp nobs 1 0.2 0.8 0 36.95694 0.2751023 0.2868821 2 112 h[10] $dfbetas dfbeta variable observation dataValue depvarValue 8 -0.5711 Wy Yapese 1.9827050 3 3 -0.4146 Wy Huron 2.4776751 1 2 -0.4086 Wy Copper Eskimo 2.4707232 1 6 -0.2762 Wy Nambicuara 2.4294929 1 ... etc

and h[8], h[9] give variables "ToTry" and that "DidWell"

**On 8/9/13 11:38 PM, Paul Rodriguez wrote:**

hey Doug,

If I understand the situation correctly, the R randomForest should be able to handle the size nicely on the HPC systems.

One thing to keep in mind is that randomForest will help select variables according to an importance measure, but not really give you p-values.

If you send me a paper or example of the data and methods, that would make it more clear - I'd be happy to chat some more, next week(s) are fairly good,

Paul ________________________________________ From: Doug White [douglas.white@uci.edu] Sent: Friday, August 09, 2013 5:45 PM To: Paul Rodriquez; B_Tolga Subject: randomForest

I've been looking with interest, given your presentation, at AndyLiaw and Matt Weiner R news 2002 and Leo Breitman

In our Trestles R application we have a hundred or more variables of
interest, each of which could be evaluated with hundreds of potential
independent variables, often six to a dozen, of interest. If
randomForests were to work for sets of predictive variables, would this
many variables be manageable? Right how we have chapter authors for a
book, other researchers, and students in online classes starting with
limited hypotheses using regression (with missing data imputed and
controls for autocorrelation), and then getting iterations for
successive model improvements, often in 10-15 steps. **This many steps**
with queues of 20 minute wait have led us to use a Virtual Machine or
the R gui on laptops and save the big composite jobs for Trestles.** Would**
the use of randomForests programmed in R on our Galaxy (and VM) be able
to give us quicker results for better model improvements or are there
too many variables?Our number of cases run from 186 to 585 to 1270
with declining quality and completeness of data in the latter. I imagine
it would not be difficult to connect R functions to the Fortran
processor, or has that changed since 2002?

Would like to talk to you about these possibilites, my phone # is 858 457 0077, as I live close we could get together at SDSC.

Great talking to you, sorry we missed your talk but we were busy with our project on the VM and trestles and Tolga has limited time in his visits to La Jolla as he lives at Irvine.

-- best Doug White http://intersci.ss.uci.edu/wiki/index.php/DW_home