Instructions and workflows in development

From InterSciWiki
Jump to: navigation, search
  • Here: See CS Class instructions which supersedes these few comments
  • Solutions for Subsets samples 4/22/15 Sample subsets -> how to limit your sample to a subset of the data
  • For students and researchers at Galaxy CoSSci: http://socscicompute.ss.uci.edu
  • CS If you use Mac OS X and you want to save an image of a map, such as v215.d2 Matrilocality, please click the 'eye' icon in the history pane (on the right side of the text 'DEf01d-f Map' or so). Then, you should see images in the center pane. To save an image, press the 'control' button on the keyboard and while keeping it pressed click the image with the mouse pointer. A menu with the first item 'Save Image As...' should pop up. - Lukasz Comments110
  • If you add Sq to one of your variables, e.g., v123Sq (squared value of a variable) it often works. If not try making at a New variable dx$v123Sq <- dx$v123*dx$v123
  • CS ? Instructor: Try using HPC then DEf01f - it is much slower but handles many students at the same time and class time can be reserved for fast use by many students.
  • Students: If you have 15 or more independent variables, use the Galaxy CoSSci: http://socscicompute.ss.uci.edu HPC (that's the supercomputer) option.
  • This applies to Paul's project on Comet See Holm-Bonferroni for group significance tests calculated in R if you use RStudio and enter your group of significance tests for your significant variables. - Paul Rodriguez's very simple R code.
  • How long can my history be? Comments222 e.g., up to 200 or more
  • CS How to share histories Comments101
  • Notes XSEDE Aug 21 -- SIU So Illinois. Maybe a gateway in humanities. Names: took picture of poster.
Need to talk to Marlon and Nancy - what will work. Working on the Globus - doesn't align strategically. Other things aligns better. Amazon ec2. Significant. ec2 might do the works. 
Operational costs for gateway fix then the app/execution costs (CoSSci analysis). Other costs allowing them to run on the structure. Varying costs. Might ... operational needs run reasonably. Gateway one instance. Run on separate hosts just for jobs like XSEDE ... could be when running on XSEDE. Whats the complexity. Comet .... Relociate. May be in a short wait queue. One Node.  Multiple nodes.   
  • Stu: jobs bigger not submit without their own allocation. Concurrent jobs. Those can be done at UCI or EC2 or submitted to Comet because they are small. Reguire a login for NOVA analysis.
  • Eric: How much is ok to just do it ... if a user is consuming the allocation... leave it open for now. Six months extension. BIG DIFFERENCE Ecss assistance Other Comet compute site allocation. --- is a process for that for cycles.
would advise XSEDE cycles are there -- getting collaborations from other groups - maintaining, operating -- other Universities, Humanities, Groups, Computing HRAF----- CS programs -- collaboration within the university
What do others think about a Gateway -- where they can fund this ... Suresh ... --> eHRAF ...
  • WORKFLOWS IN DEVELOPMENT
  • Aug 23, 2015 Implement e.g. v23g8 in the dummy variables as NewVar dx$v23g8<-1+(dx$v23==8)*1 +(dx$v23==9)*1 etc
  • Aug 18, 2015 Work with Paul to Implement Bayesian Networks of Variables on cosecs PRIORITY #1
  • Aug 18, 2015 sent email: Seemed to be a problem in [ ] - send the [[[ ]]]]] locally --> will -- to Eric
  • Aug 18, 2015 possible idea to complement [X] Wy of [ ] Wy (i.e., regression with imputation only no Wy) rather tricky, perhaps circumvent h[ TRUE FALSE items], truncate to a smaller alternative hregress[ ] to run minimum of items followed by running the imputation.
  • Aug 1, 2015 setDS("SCCS") or its counterparts EA, LRB, WNAI, XC need to be reset before each model is run. THE PROBLEM WAS FOUND WITH v2013 is that setDS("SCCS") does not restart so dx$v2013 will have the original NAs. What happens is that the NAs convert to value 6
  • July 28, 2015 Use Holm-Bonferroni test for a workflow to recover the variables significant at p=0.05 or less from the Galaxy...csv
  • July 27, 2015 Crucial new options: clickable box [ ] scale (convert all variables except dummies to automatically approximate a normal distribution with R function Scale. ONCE FINISHED then define [ ] integers to approximate the spread of each scale variable.
  • July 27, 2015 Implement the part of this that generates integer variables from Scaled variables How_to_normalize_and_integer-scale_a_variable_in_a_data_frame in CoSSci. Essential: suppresses decimals for bnlearn bio.decimals, maps; makes models much stronger in Rsquared.
Enlarge Dummy variables to a mutually exclusive Dummy .d or Normalized .n variables window example, v1650.n12, that transforms such variables to integer variables > 0 that approximate a normal distribution. For better results, most researchers with take this option. This should Improve color maps generated by the Rworldmap package. E.g., user can reduce the ordinal categories to maximum 9 values and 9 corresponding coloring of nodes.
  • Completed: Cosecs and :8081 --> #1 yes: now rectangles extended #2 ? runs on trestles rather than comet #3 ? did we fix the authorization of Eric #4 will jobs work on Comet ? any problem #5 not on my allocation #6 walk through with Stu on giving Eric access #7 manual still developing by Eric #8 -- is direct link to Comet possible, separate from UCI VM???
  • Completed May 25 '15 leaving Full set [ ] empty or Full set [x] checked should control whether the maps are for the original variables or governed by the choice of dependent variable for both Mkmapping and for Color map
If "Full set [x]" is clicked then for both Mkmapping and for Color map what we want are the FULL variables for what is named in the database or as specified by a DUMMY variable, e.g., v34.d4, v75.
If "Full set [ ]" in this new format. the two outputs should correspond to the DEPENDENT variable in the codebook as usual.
  • New May 18 '15 Eric working on this When making color maps, limit the variable to 12 or so categories -> with decimals, round(that variable, 1) # ie. one decimal color map
  • New May 18 '15 Tricky?? will need to change the R script -- lets galaxy how to tell it -- not done -- Let's put the name of the computer (Comet/UCI) and the execution time on an early line in *.csv if possible ; we discussed letting the Dummy window be enlarged;
  • Paul is doing this at SDSC (Comet and Gordon) New Mar 24 '15 Workflow for causal analysis with library bnlearn
  • Completed: Mar 10 '15 Open up the Dummy variables rectangle AND THE NEW VARIABLES rectangle to the largest width in CoSSci
  • Check map dot size and labels If you look at the map at http://intersci.ss.uci.edu/wiki/index.php/Carlos_A._Botero#EA_Variables_and_Supplementary_Materials and compare it to our maps at http://intersci.ss.uci.edu/wiki/index.php/EA34d4 we can see that our dots are too big for this sample of 784 (from EA). So we might add to our list of improvements to find an adjustment so that the size of dots decreases say logarithmically at an appropriate log scale for different sample sizes....
  • Add code for Mkcatmappng
  1. Łukasz Lacinski
  • For implementation (integers 1,2,3 new on CoSSci options)
  • New Feb 16 '15 When a CoSSci option is chosen after clicking DE01f and any of SCCS EA LRB or EA are chosen, that symbol should show up on the next page where the model begins.th node sizes proportional to 1+units of .10
      • May 24 '15 Likely not do-able to color maps permanently to match longitudes with the world maps of Mkmapping by Eff using 24.5° W to define the left edge of the color map.
  • Might this work if the *.Rdata had 186 cases for SCCS, etc.: New May 20 '15 Eric is working on making a new window, above "Variables to Plot" the option at Comments284 to insert such lines as load("WEd.GIS.Rdata") ; setDS("EA") ; dx<-data.frame(dx,WEd.GIS[rownames(dx),c("AmphibianDiv","BirdDiv","MammalDiv","VascPlant")]) at that window to add new variables. Or to add Elizabeth Cashdan codes on Pathogen prevalence.
Lukasz Post
  • DONE: Posted a link on CoSSci http://socscicompute.ss.uci.edu giving access to: http://capone.mtsu.edu/eaeff/DEf_SCCS.html as a separate link to that of CoSSci if possible, ignored if that link is down. I.e., Even better if socscicompute opened the codebook line at the same time.
  • DONE: Changed code so variables like mkdummy("v239",6) were 0,1 now have + 1 added creating values 1,2, conforming to most variables having a minimal value of 1. Essential: will also make maps more uniform.
  • DONE Added map code for mkmappng(h$data, "v1649", "v1649FrequencyInternalWarfare", show = "data", numnb.lg = 3, numnb.lm = 20, numch = 5, pvlm = 0.05, dfbeta.show = TRUE) #Łukasz Lacinski needs work - Comments120 --- and Similarly for
  • DONE 2. Box-Cox transformation option on CoSSci for implementing a lambda power coefficient for the dependent variable. Comments220. h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = NULL, dw = TRUE, lw = TRUE, ew = TRUE, stepW = TRUE, box cox = FALSE, getismat = FALSE, relimp = TRUE, slmtests = FALSE, hastiest = NULL, mean.data = TRUE, do boot = 1000, full.set=FALSE) # ew = TRUE converted to ew = FALSE as an option in CoSSci. CoSSci should have an option for box cox = TRUE.
  • Jan 15.14 Expanded values of nimp and maxim in the DEf01f code only for Trestles Comments231 smi <- doMI(dvm, nimp = 5, maxit = 7) at CoSSci local but set nimp = 8, maxit = 11 at Trestles only
  • DONE full.set=TRUEversus full.set=FALSE in function h e.g. full.set=TRUE generates nobs=1258 while full.set=FALSE generates nobs=748
  • 3. Add Option on CoSSci for MI rather than normal MID which at the same time changes the DEf01f code thus: smi <- doMI(dvm, nimp = 5, maxit = 7) to nimp = 20, maxit = 28
  • 4. MI versus MID is discussed at THIS_IS_ALL_IN_THE_*.csv_file#Imputation, "The 14th item in list h[ ] is a data frame containing mean values of variables across imputations. This will greatly improve maps because they will expand to world scale, employing the functions mkmappng() (for ordinal data) or mkcatmappng() (for categorical data). Comments120 -- this was DEf01d not completed
  • Add scaling code to support the fv4scale and mkscale functions implemented in DEf01f R-workspace. -->
---> http://capone.mtsu.edu/eaeff/DEf_SCCS.html has the right libraries and DEf01f revised code for this
  1. Łukasz Lacinski needs work
The big innovation from DEf01d-f forward is function fv4scale -- only Lukasz has the skills for this but Anthon Eff may be revising this code for DEf01g. "The variables for a scale can be combined using the function mkscale. The function can calculate three different kinds of scales: 1) based on linear programming as described in Eff (2010); 2) the mean of the standardized values; 3) the first principal component of the standardized values. Below the variables contained in femecon are combined into a scale based on linear programming." Comments120
A question is whether, given a combined W matrix for a specific model and dataset: can we get a world map like that on p15 Figure 3, "General," to be included in the Map output. [x] Moran ::Wi
  • 4. Add Option once we work out the details for later (below) we can use the variables in the h[ ] imputed data frame for making maps.
  • New Oct 16th: re: next item. We need to know in csv output (say at the top) the date, name of user if logged in, name of History, the computer requested (Trestles, VM, aux VM), processor count and wall clock time, .... what else? Why not do our own collection of average usage by times and days of the week for Trestles usage and post that efficiently in our csvs? Then we could encourage all usage to go over to Trestles. Here's what we want: We have access to Trestles by a Galaxy site clicking HPC tools at http://socscicompute.ss.uci.edu -- our ECSS programmer is Lukasz Lacinski. We dont need more than 1 node and ppn < 32. How do we make this the option coming from the Galaxy? We might want to also have a passworded short pool reservation for instructors so they can use Trestles in their classroom if all else fails. Could you tell us how to do both? Mahidhar Tatineni <mahidhar@sdsc.edu>. Its 10:00AM here and run time at HPC is still 1.5 minutes. Looks like the big queues we used to have might be a thing of the past -- Doug Oct 16 2014
  • Extend information printed to output CSV files. #Łukasz Lacinski needs work -- Doug: what were our ideas here?'
  • DID (Not do-able at present) We need to be able to define variables such as the Games of Strategy variable V239 as a unions of distinct dichotomies such as mkdummy("v239",4) and mkdummy("v239",6) or even making unions of more than two discrete categories within a single variable.
DID dx$Strategy<-(dx$v239.d4 + dx$v239.d6)*1 Comments:233 V239

Work out details for later

Standalone R packages with the imputed data

Cites: http://CRAN.R-project.org/package=semPLS Armin Monecke <armin.monecke at stat.uni-muenchen.de> Fits structural equation models using partial least squares (PLS). The PLS approach is referred to as 'soft-modeling' technique requiring no distributional assumptions on the observed data. Comments226
Our phone call Thurs 11AM Comments234 We can use the following for the call: Dial-in: 1 716 274 3400 Access code: 555175 Rachana
Decisions on options -- also send Rachana report on Science Impacts: Book in process or chapters in process.... for both projects -- not detailed
  1. Move up HPC to top of Galaxy CoSSci
  2. Look into the generic Galaxy python code: has its own Rscript -- that is for the + function for mkdummy +1 also New variable definitions and color maps
  3. look into the h[ ] function as option on Galaxy: 1) Box-Cox 2) Mi versus MID
  4. expand nimp and maxim values for Trestles to 50, 70

I am not familiar with how Galaxy sets the parameters so I will let Lukasz answer that one. The short pool reservation is for all jobs (can't customize that). However, we can always reserve nodes during class hours on particular days of the week (can't do a standing one since that will idle resources). Is that an o.k. option? -- Mahidhar Tatineni<mahidhar@sdsc.edu>


Personal tools