Robert Sinkovits- SDSC Supercomputing

From InterSciWiki
(Redirected from Robert Sinkovits)
Jump to: navigation, search

UCI Social Science Gateway

Pubs

  • Pubs
  • [Robert S Sinkovits published this conference paper: Comet: Tales from the Long Tail: Two Years In and 10,000 Users Later
  • [Robert S Sinkovits published this conference paper: Performance Characterization and Optimization Assessment of Bioinformatics Applications

Github

k-components identification software

for NSF

  • an anecdote about how the problem of identifying the k-components in the sociology co-authorship network was considered unsolvable might be a nice touch.
  • NSF no longer want “letters of support”, which tend to be of a laudatory nature, but “letters of collaboration”.
    • The letter of collaboration must read as follows (see https://www.nsf.gov/pubs/2017/nsf17526/nsf17526.htm#prep)
    • If the proposal submitted by Dr. Robert Sinkovits entitled “SI2-SSE: Robust software for identification of cohesive subgroups in large networks” is selected for funding by NSF, it is my intent to collaborate and/or commit resources as detailed in the Project Description or the Facilities, Equipment or Other Resources section of the proposal.
  • Moody and White’s (2003) studies of Structural Cohesion provide rigorous ways of identifying hierarchically overlapping sets of social nodes of all sorts that define contexts predictive of cohesive overlapping units in social networks – defined as k-components in any given network that are defined by specific subsets of groups sharing common levels of shared and overlapping connections – and thus “socially defined” groups containing such overlaps as structural outcomes. Such structural outcomes may often entail causal consequences because of larger cohesive subgroups where structural consequences of cohesive subsets may be studied formally or mathematically through large and small network analyses used to identify many kinds of social consequences. These structures, which are often “predictive” of social outcomes in networks with differential overlaps in network subgroups often define social groups and their interactions, and may have powerful causal outcomes. Sociologists and Anthropologists often study causalities and outcomes of social structures and dynamics at small and medium size social networks, but these approaches are also scalable in networks of unlimited scale. Networks from small, medium and gargantuan scale, however, studied at any scale, may yield equally strong benefits of analysis. The concept and measurement of k-components in networks includes analysis of networks of any scale containing units with boundaries of largest k (and smaller, potentially overlapping) subsets of nodes sharing k or more independent links between them.

Very large-scale analyses, however, are computationally intensive. The recent work of Sinkovits develops an iterative exact solution to finding k-components that made it possible for the first time to analyze much larger graphs

Moody, James, and Douglas R. White. 2003. Structural Cohesion and Embeddedness: A Hierarchical Conception of Social Groups. American Sociological Review 68(1):1-25. 2004 Outstanding Article Award in Mathematical Sociology. American Sociological Association. Testing the Causal Effects of Structurally k-Cohesive Groups

Menger theorem

  • Menger theorem. The external k­cohesion of a group is its structural invulnerability to disconnection by removal of k of its members. The Menger (1927) theorem is that the external and the internal k-­cohesion of every subgraph are identical, which holds from their maximal or most extensive exemplars, that is, every vertex that has k links to a maximal k­-cohesive group is a member of that group.
  • Hi Doug,
  • Happy to hear that you’re feeling better. PDF of article is attached. I need to run, but we need to get together sometime in November to discuss the next project – parallel algorithm for determining k-connectivity of a graph
  • -- Bob
Sinkovitz -  June 1  - vertex connectivity v2.docx - 
  • Hi Doug,
  • As you know, I submitted our k-components paper last week. I got an email a few days ago that it was assigned to an editor and I expect that the review process will start very soon
  • While developing the software for the k-components paper, I realized that with some minor tweaks one of my algorithms could be adapted to be an easily parallelized**, general algorithm for determining the vertex connectivity of a graph. As far as I can tell no one has done this before – there are general/serial algorithms, special purpose algorithms and brute force/parallel algorithms, but nothing that is general, parallel and efficient.
  • The efficiency of the code is astounding. When applied to our largest 4-component (1832 vertices) using a single core, it takes about 5 seconds. For comparison, the connectivity() function in igraph didn’t complete in an hour.
  • When you have a chance, take a look at the very early draft of the vertex connectivity paper. A lot still needs to be fleshed out, but the key ideas are in place
Regarding authorship, I think that this should be limited to the two of us since it is a purely algorithmic paper.
— Bob
 Jeff Sale  Hi - 

Here are comments on the paper, I think this is great. -- we need to suggest three reviewers (see screenshot). We also need to submit brief biographies (100 word max) along with a passport-type photo for each author. I’m attaching a screenshot from a recent paper in this journal along with my bio and photo.

As for actual applications in social science, this cite might be good:

Mani, Dalhia and James Moody 2014. “Moving beyond stylized economic network models The Hybrid World of the Indian Firm Ownership Network” American Journal of sociology 119:1629-1669

Kreager, Derek A. Kelly Rulison, James Moody. 2011. “Delinquency and the Structure of Adolescent Peer Groups” Criminology 49:95-127

Moody, James & Peter J. Mucha. 2013. “Portrait of Political Party Polarization” Network Science.1:119-121

PTs Jim James Moody

Thanks Jim!

Btw, anyone who looked carefully at the pseudocode might have noticed some very weird notation. For example

@@v1, v2, v3$$

That was my hack to get around EndNote's interpretation of curly braces as manually entered references. After I’m all done with the manuscript, the above will be replaced with

{v1, v2, v3}

— Bob

End time: 16 Oct, 2012 14:00 EDT

Gordon Research Highlights June 6 2013

Modeling Human Societies http://www.sdsc.edu/News%20Items/PR060613_gordon_highlights.html

Doug White, a professor of anthropology at UC Irvine, is interested in how societies, cultures, social roles, and historical agents of change evolve dynamically out of multiple networks of social action. In the 1990s, he and his colleagues discovered that networks could be meaningfully analyzed by a new algorithm that derives from one of the basic theorems in graph theory: the Menger vertex-connectivity theorem announced in 1927.

Computer scientists had coded Menger's edge-connectivity theorem, known as the Max-Flow (Ford-Fulkerson) network algorithm, used to optimize the routing of traffic. But Menger's theorem also proved that vertex-connectivity – the internal cohesion of some minimum number of disjointed paths between every pair of members in a group – is equivalent to the group’s external cohesion, or resistance to threat of disconnection by removal of group members.

Finding the boundaries of cohesive subgroups, however, remain a challenge for computation, far more challenging than finding cliques of fully connected sets of vertices within a graph. White and his team are currently testing their recently developed software on synthetic networks constructed from well-known models, and plan to apply their methods to data sets derived from networks of co-‐authorship, benefits of marriage alliances, and splits in the scientific citation networks in periods of contention.

“The most computationally difficult step in the team's design of new high-performance software requires the construction of the pairwise cohesion matrix (PCM), which in turn requires the calculation of the number of vertex-independent paths between all pairs of vertices in the network,” said White.

The researchers ported the application to Gordon’s vSMP nodes following guidelines provided by ScaleMP. For larger graphs, such as a 2400 vertex network built using the Watts-Strogatz small-world model, near linear scalability was achieved when using all 256 cores (an estimated 243x speedup relative to one core).

“We’re coordinating teams at UC Irvine, UC San Diego, the Santa Fe Institute, and Argonne National Laboratory so that after 130 years, we finally put comparative research on a solid footing of replicable scientific findings that are comparable from small to large and super-large datasets,” said White.

More on White’s research can be found at DRW Google Scholar

Story: Jan Zverina

CoSSci4HamburgINSNA-3dw.pptx

Longitudinal Network Studies and Predictive Social Cohesion Theory

Finding k-cohesive groups in large graphs

Nov 3 2014

Hi Doug,

I've made a little more progress, but would prefer to wait until next week to meet since I still have a few things that I'm working on and some other obligations this week. In the meantime, here's a quick update.

--- 3-components --- The technique that I used (described in previous email) for finding the 3-components is "mostly" reliable. The reason I say "mostly" is that it's still a little overly generous in deciding when a group of nodes is a 3-component. That said, it's not a serious problem since I go back and explicitly confirm that we really found 3-components. The largest one is correct (10177 nodes, contains no 2-vertex separators). This just affects the small clusters that are trimmed off the fringes of the graph. None of the other 3-components contain more than about 20 nodes and we are completely dominated by the 10177-node gorilla in the room.

--- 4-components --- Starting with the 10177-node 3-component, I take a similar strategy to search for the 4-components. For now, I'm just ignoring all of the smaller potential 4-components along the way. They are all tiny (< 20 vertices, with most much smaller)

   Trim off all of the "isolated" nodes belonging to cliques with three "worldy" vertices
   Do a quick search for 3-vertex separators. Make the assumption that most 3-separators will contain a pair of neighboring nodes. So, for each edge in the graph, grab the two end-point nodes, subtract from the graph and look for articulation points. Use the identified separators to continue nibbling away at fringes of graph.
   Reduce the results to 4-cores
   Look for the largest bicomponent in the largest 4-core

At this point, the upper limit on the size of the largest 4-component is 4076 nodes. This is about half of what we thought in our original incorrect analysis. The next steps will be to find the largest 3-component within the largest bicomponent and then do an explicit search for all possible 3-vertex separators. The latter step is now doable in a reasonable amount of time since the search space has been reduced so much.

--- 5-components --- The search for 5-components will be a bit more challenging and I still need to think about strategy. One encouraging point is that the upper limit on size is fairly small (2472 nodes, found by 5-coring the largest bicomponent in the largest 4-core described above). Undoubtedly it will come down a bit more.

-- Bob

From: Doug White <douglas.white@uci.edu> Date: Wednesday, October 29, 2014 2:32 PM To: Robert Sinkovits <sinkovit@sdsc.edu>, B_Tolga <boztan@uci.edu> Subject: Re: Progress on k-components


This is great news, Bob ... am just back from my son Scott's house in Seattle (w grandkids...) let me know when I should visit again

On 10/29/14 2:12 PM, Sinkovits, Robert wrote:


Hi Doug,

Just want to let you know that I made some recent progress on the k-components problem. I need to make a few minor tweaks to my code, but I believe that I'm at the point where I can identify all of the 3-components and therefore fill in all of the 2s in the pairwise cohesion matrix. It appears that my previous approach (2 weeks ago) was missing a lot of the 2s for reasons that I now understand.

The new approach involves iteratively

  • Breaking the graph into 2-components
  • Identifying 2-separators within the 2-components
  • Breaking up the 2-components at the 2-separator boundaries
  • Looking for resulting clusters that do not contain any 2-separators (these are the 3-components)


As 3-components are identified, edges that can't contribute to the remaining potential 3-components are selectively trimmed from the graph. In this manner, we basically nibble away at fringes of the graph until we find the largest 3-components at the heart.

Assuming that I've done everything correctly, our primary 3-component has 10177 nodes.

I'll keep you posted once I know more.

-- Bob

Late 2013 Early 2014 Doug had asked if there was a two mode version of the data. There is now. I pulled the full data (so not restricted to the largest component) from the base file used to create the file you’ve been working with and generated two files, first is simple .txt the 2nd is the same but tab delimited. The files are here:

http://www.soc.duke.edu/~jmoody77/SocAbs/twomode_fordrw

http://www.soc.duke.edu/~jmoody77/SocAbs/twomode_fordrw_tabdl (tab delim)

The first column is the unique AUTHORID (no promise this matches the CA Clean file you are currently working with – it might, but I can’t recall if I renumbered that old file to be 1-n within the largest component or not … sorry!)

Hi James,

I've attached a brief summary of the work that I've been doing on induing k-cohesive groups in large graphs. This is still a work in progress, but I'm hoping that in the next few weeks I'll start the computationally intensive part of the project.

Tolga, Doug and Steve - this summarizes work that was spread across multiple emails.

-- Bob

Finding k-cohesive groups in large graphs - Large Graph Processing.doc

Symposia and talks

XSEDE13 talk submitted

The next ECSS symposium is Tuesday 10/16, 10am Pacific/noon Central/1pm Eastern. These are held the third Tuesday of every month and feature the work of XSEDE’s Extended Collaborative Support staff.
Login with the passcode at http://www.atconference.com/web-conferencing/login.php
Connection instructions and abstracts for the talks are also at http://www.xsede.org/ecss-symposium.
There are two 30-minute talks this month. The first features Yifeng Cui from SDSC discussing his work with PI Thomas Jordan at the University of Southern California. High frequency calculations requiring millions of CPU hours are necessary to generate physics-based seismic hazard models. The team has developed a highly scalable 3D finite difference GPU code and will present the performance results on Fermi M2090, the new XSEDE Keeneland system. The title of the talk is “Accelerating Cybershake Calculations on Heterogeneous Architectures”.
Our second talk features Alan Craig from NCSA who has been added to the ECSS team for his expertise in the digital humanities. Alan will provide an overview of several projects that are currently underway, address challenges that confront researchers from these disciplines, and assess lessons learned. The title of this talk is "Humanities Computing on XSEDE".

These talks will be recorded for asynchronous viewing. ECSS retrieval Passcode: 4906909# see email

Gateway process

CoSSci

Complex Social Science Gateway --> http://socscigate.oit.uci.edu <--

Help XSEDE

Posted on 11 Oct, 2012 18:45 UTC by Nancy Wilkins-Diehr How to Turn Your Project into a Science Gateway / How to Turn Your Project into a Science Gateway
https://pops-submit.xsede.org/auth/TGUP_POPS/main/cgi/proposal_top.cgi?prop_id=1153673&mtg_id=530227&mode=Edit&login=drwhite&POPSperson_id=1735201&supported_action=Edit&style=new_entry&data_reentry=0

p_p_id=adduser_WAR_adduserportlet_INSTANCE_5wP8&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_adduser_WAR_adduserportlet_INSTANCE_5wP8_ACTION_REQUEST=SUBMIT_ADD_REMOVE_REQUEST&_adduser_WAR_adduserportlet_INSTANCE_5wP8_CONFIRM_TYPE=ADD_USERS Add users]

TG-DMS120027 Project PI: White, Douglas Charge No.: TG-DMS120005

Expanding allocation requests for tasks

Project PI: White, Douglas Charge No.: TG-DBS120027

               port the codefor k-cohesion and test the Fortran clique detection algorithm on Gordon

CV

XSEDE Education Allocation

Enter my site for revisions: https://portal.xsede.org/submit-request
https://www.xsede.org/group/xup/profile/-/profile/view
Start at https:https://www.xsede.org/web/guest/new-allocation
New allocation:https://www.xsede.org/web/guest/new-allocation#opts 
POPS Welcome:- https://portal.xsede.org/submit-request
Enter Portal: choose Educational  Resources: The new SDSC Appro Cluster, Trestles, went into production on Jan, 1 2011. System Info: 100TFLOPS; 324 compute nodes, 10368 cores (4 8-core 2.4 GHz AMD Magny Cours processors/node), 20.25TB of RAM (64 GB/node, 2GB/core), and 38TB of flash disk (120GB/node); QDR IB. Research requests are limited to 1.5M SUs, except for Gateway requests, and a maximum use of 1024 cores(32 nodes).
https://portal.xsede.org/web/xup/allocation-request-steps
Login 2 Begin: https://portal.xsede.org/web/xup/allocation-request-steps#opts <--
Click: Renew/Submit request, middle of page
Click Enter at: https://portal.xsede.org/submit-request
YOU'RE NOW AT:        https://pops-submit.xsede.org/auth/TGUP_POPS/main/cgi/index.cgi
Click EDUCATIONAL at: https://pops-submit.xsede.org/auth/TGUP_POPS/main/cgi/index.cgi POPS Submit 
USE Safari to fill: https://pops-submit.xsede.org/auth/TGUP_POPS/main/cgi/proposal_top.cgi?POPSperson_id=1735201&login=drwhite&mtg_id=530217&style=new&prop_type_id=2&prop_type=New&mtg_type=Continuous&mode=Submit
https://pops-submit.xsede.org/auth/TGUP_POPS/main/cgi/proposal_top.cgi?
https://POPSperson_id=1735201&login=drwhite&mtg_id=530217&style=new&prop_type_id=2&prop_type=New&mtg_type=Continuous&mode=Submit
https://pops-submit.xsede.org/auth/TGUP_POPS/main/cgi/proposal_top.cgi?POPSperson_id=1735201&login=drwhite&mtg_id=530217&style=new&prop_type_id=2&prop_type=New&mtg_type=Continuous&mode=Submit
https://portal.xsede.org/group/xup/allocations/usage
https://portal.xsede.org/

Community Certificate Application Screenshots

Screen Shot 2013-02-22 at 4.04.49 PM.png
Screen Shot 2013-02-22 at 4.01.38 PM.png
Screen Shot 2013-02-22 at 4.22.01 PM.png

Change in Project title to "UCI Social Science Gateway"

Ticket Subject: 'TG-DBS120005 - change project title to "UCI Social Science Gateway"'

Ticket number 223554 has been opened for you Ticket Result Summary: DONE

Courses

Predictive Analytics for Big Data Dec 11 at SDSC

Posted on 05 Dec, 2012 03:37 UTC by Diane Baxter A new message has been posted to XSEDE User News.

Categories: Training, Gordon (SDSC)

Start time: 11 Dec, 2012 09:00 PST End time: 11 Dec, 2012 11:00 PST

San Diego Supercomputer Center offers an introduction for researchers seeking to extract meaningful predictive information from within massive volumes of data. During the 2-hour workshop, participants will get an introduction to the field of predictive analytics and learn to use a variety of data analysis tools to discover patterns and relationships in data that can contribute to building valid predictions.

This is the first in a series of workshops in Predictive Data Analytics presented by the SDSC Predictive Analytics Center of Excellence.

    • Agenda**
  • 9 AM – 9:40 AM: Overview
    • Data Mining Trends and Applications
    • Data Exploration / Statistics
  • 9:40-10:40 AM: Data Mining Tools
    • Weka / R Tutorial
  • 10:40 – 11:00 AM: Questions and Disc

For registration: https://www.xsede.org/web/xup/course-calendar


This message was generated by XSEDE User News v3.0. You can view this message in the XSEDE User Portal at https://portal.xsede.org/web/xup/user-news/-/news/item/6080.

You can manage your XSEDE User News subscriptions at https://portal.xsede.org/group/xup/user-news/-/news/manage-subscription.

Summer School

SDCS Summer School 2012

Our UCI-SDSC team results on k-cohesion

Irvine Social Science Gateway at XSEDE

  • Cross-Cultural Research with multiple datasets SCCS, etc.
  • Multiconnectivity in Networks
  • Large Kinship networks
  • Big History

http://intersci.ss.uci.edu/wiki/index.php/Dow_and_Eff_Simple_Functions_Moral_Gods_9A

This could be copied and pasted into R for now. It should work for any recent R installation. Later on we can re-tune what is passed to a Trestles "Irvine Social Science Gateway"

Bob, could you give me an instruction on how to paste this script into Trestles R (which may require installation of the packages that are called)? thanks

-- Doug White http://intersci.ss.uci.edu/wiki/index.php/DW_home

Steve

SDSC Summer 2012 Workshop Doubleday notes

Gordon k-cohesion

Our Key project #1 for Aug 6-10th SDSC Workshop

Bob and I discussed last week.  Steve
On Wed, Feb 20, 2013 at 11:57 AM:
   3    1    2    ok 3 nips between 1 and 2
   -99    4    3   
   4    4    5    ok 3 nips between 4 and 5
   3    5    6    ok 3 nips between 5 and 6
   3    6    7    ok 3 nips between 6 and 7
   and thats a format for input to the clique detection --> then entered into David Epstein program

Dear Bob, as you will recall, one of the things we want to do this week is to compute matrices of pairwise cohesion for large undirected uniform edge-weighted networks. Then for any pair of nodes i,j, the pairwise cohesion is the minimum number of nodes on i-j paths whose removal disconnects them (Menger Theorem). It turns out that if each of two adjacent nodes on an i-j path is replaced by two nodes with a four-cycle among them and the new pairs of nodes adjacent to i and j have an incoming and outgoing arc respectively, then pairwise cohesion is calculated by a Ford-Fulkerman maxflow algorithm.

SECOND STEP PROPOSAL (Sept 14 2012). we whollopped the pairwise cohesion computation (pcc) on Gordon and now I think the solution to part 2: from pcc to structural cohesion is this (not a clique detection algorithm but:

White, Douglas R.,and Mark E.J. Newman. 2001 Fast Approximation Algorithms for Finding (Multiple) Node-Independent Paths in Networks Santa Fe Institute Working Paper 01-07-035.

I.e., Newman's approximation algorithm (naa) executed thus:

As a test of an hypothesis about "your" type of these problem, e.g., in the 100K sociology coauthorship article that we have (I can ask for all the supplemental data): "the k-cohesive groups with the highest k are the most (intellectually and otherwise) productive.

We could also modify depth first naa and the pseudocode to specify a maximal depth.

ASK MARK: mejn@umich.edu Dear Mark: Our IMBS grad student Tolga Oztan has programmed in R the iGraph computation of k-connectivity into Ford-Fulkerson pairwise cohesion PC matrix
Would it be easy for us then - to use your pairwise cohesion code (NPCC) via Rcpp as in http://dirk.eddelbuettel.com/code/rcpp.html R code?
If so, could you send the C+ code you wrote for our article?

pseudocode:

1 take the subgraph with nodes having the highest integer h from the PC matrix
2 run clique detection on that subgraph (all such PC cohesive h cliques of size m =< h are m-cohesive).
3 reduce h by 1 to h', if h'=2 stop (bicomponents, including giant bicomponent)
if h>2 go to 1

I managed to get the code working correctly. For some of the large test problems (n=2000,3000), the program exactly reproduces the results (e.g. Largest clique, together with membership) listed in the paper. Note that this only works exactly for density=0.2. My guess is this is due to difference in the floating point implementations back when this paper was published. It wasn't until the late 90's that everyone started using the IEEE standard.

I'll keep you posted. -- Bob

REJECT: In 2011 Felix Halim, Roland H.C. Yap, and Yongzheng Wu devised a parallelized MapReduce-Based Maximum-Flow Algorithm for Large Small-World Network Graph in their ICDCS11 paper. The paper shows for a relatively small-world graph (moderate average distance) the the order of this computation is O(|f*|/A) where A is the augmenting paths per round and f* is the maxflow value which in our case is the number of node-independent paths. Parallelized, this is on the order O(|f*|), i.e., extremely efficient.

What we need is someone to examine this paper and help us parallelize this computation for one of the languages we can use in the workshop.

Our Key project #2 for Aug 6-10th SDSC Workshop

Parallelized Clique Detection: An all-black pdf so do not print.

We will apply these two algorithms successively for two large internet datasets and for a very large network of overlapping social science authorships (128151 authors of ca. 29,426 articles - White, Douglas R., Jason Owen-Smith, James Moody, and Walter W. Powell. 2004. Networks, Fields and Organizations. Computational and Mathematical Organization Theory 10:95-117). They will also apply, even more easily, to genealogical networks from smaller size to 100,000 couples.

Our Key datasets #1 for internet from Maksim Kitsig

Our Key dataset #2: Social Science articles database from James Moody

We have the database for this graph.
see: Tutzauer?2011SunbeltEmbeddednessBlind.doc (newMAC:desktop)

A dataset #3 for N-P-d-C genealogies

1

Hi Doug,

We can port the code and do a few benchmark runs using either my staff account or a small director's discretionary account that was setup for CAIDA. For anything more extensive you'll need to apply for a startup account. If you're interested, I can give point you in the right direction.

Cheers, Bob

Generation_permutation_code#Tolga_Oztan_code_Nov_11-11 - Tolga_Oztan_code_Nov_11-11 Tolga Oztan Tolga_Oztan_code_Node Independent Paths May 2011

2

But -

I didn't quite parse your initial comment to me correctly

> to Robert should I want to run R or Fortran for cohesion algorithms

Since R is an interpreted language, there are more pitfalls that can impact performance. For example, doing a loop over array elements can be a lot slower than using vector expressions. Also, I'm not sure how to parallelize R code. Fortran gets you closer to the hardware and the Intel compilers should give you great performance. It would also be easy to do a shared memory parallelization of Fortran codes using OpenMP directives. C++ Fortran R

That said, if you already have a functioning R code, I don't know if it would be worth the effort to rewrite in Fortran. If you're willing to bear with us, it can be a great opportunity for us to learn about parallel R. As we attract more users from the social sciences, this is something that we would need to master anyway.

-- Bob

3

Hi Doug,

Here's a link to the book

http://nostarch.com/artofr.htm

-- Bob

4

And here's a link to a great OpenMP tutorial

openMP 45 pages

-- Bob

5

Hi Doug,

Here's the email I had sent you a few weeks ago regarding the SDSC Summer Institute. No guarantee that we have spots left since priority is given to those who apply by June 8th, but we can probably slip in one or two more.

Cheers, Bob

From: Robert Sinkovits <rssinkovits@mail.ucsd.edu> Date: Thu, 7 Jun 2012 21:10:29 -0700 To: Doug White <drwhite@uci.edu> Subject: Gordon Summer Institute

Dear Doug,

Thank you for the excellent seminar today. It was well outside my field, but I found it very interesting and now have a lot of new reading that I want to do over the weekend.

This is late notice, but you may want to consider sending one of your students or postdocs to attend the SDSC Summer Institute being held here on August 6-10. Priority is given to those who apply by June 8th (tomorrow) and we have scholarships available to cover on campus room and board. Feel free to forward this to your colleagues and anyone else at UC Irvine who you think might be interested.

http://www.sdsc.edu/Events/summerinstitute/index.html

Over the course of the week we'll be covering a variety of topic including the SDSC compute hardware, shared memory parallel programming, performance tuning and optimization, visualization, and developing scientific gateways to make applications available through the web.

Sincerely, Bob Sinkovits

Robert Sinkovits, Ph.D.

SDSC Applications Lead & Summer Institute Technical Chair

San Diego Supercomputer Center

University of California, San Diego 9500 Gilman Drive, MC 0505 La Jolla, CA 92093-0505

sinkovit@sdsc.edu

(858) 822 0995

6

SDSC 2012 Summer Institute:
Big Data Supercomputing
Apply
Please enter your information below. If you are accepted, we will contact you with further instructions.

SDSC Summer Institute
Registration Verification
Please review the following information for accuracy and click the "Register" button to complete your registration. If you need to make corrections use your browser's back button to return to the input form.

First Name = Doug
Last Name = White
Organization/Institution = UC Irvine
Department = Math Beh Sci Mathem.Anthropology
Address = 8633 C Via Mallorca, La Jolla, CA 92037
Phone = 858-457-0077
Title = Professor emeritus
Email = drwhite@uci.edu
Graduate student? = no
Postdoc? = no
Scholarship consideration? = 
Summary of work = Am MathSocSCI Prof emeritus at UCI, live 3 blocks from UCSD campus, massive computing projects underway, 
HPC Experience = Massive: world databases did an SDSC project in 1988 http://eclectic.ss.uci.edu/~drwhite/WorldTrade/Cray.pdf have major new projects, worldwide databsses (economy, cross-national etc) reference: Robert Sinkovits FORTRAN R PYTHON need to learn MPI, OPENMP 
Research Problems = Restudy of 1988-1992 world tradenetworks O(n^4), 100s of weighted edges in large internode world trated networks, my REGGE equivalence world network structure massive FrenchNSF kinsources datasets of kinnetworks of 100,000+ individuals in French and Italian cities also use MCMC and latent structure algorithms. Need to parallelize R (ref: Sinkovits) and my Fortran codes. Need to update my 1990s SDSC capabilities

7 XSEDE

XSEDE is a single virtual system that scientists can use to interactively share computing resources, data and expertise.
People around the world use these

Gordon supercomputer

[ ] terminal -> drwhite$    ssh train73@gordon.sdsc.edu
at Gordon: train73$gordon -ln4 ~]   (needs passwork)
Get a portal account : STARTUP account 1 paragraph Allocation: 12 pp application
website tutorial for parallelizing
  incrementally
  OpenMP parallel do
   works on threads, at end join
   or distributed, across machines
R: avoid loops
 not x(i) = y(i) + z(i) 
 but x = y + z
R vectorization
R interpreter: ONCE
I.e., Distribution across machines VS. Sharing memory
Hyperbyte
 16 Gordon compatible
 1 terabyte in memory
OpenMP tutorial

VisIT Visualization workshop Oct 25 2012 Amit Chourasia

first section

[ ] terminal -> drwhite$    ssh train73@gordon.sdsc.edu
  • DO: COLOR MAP (pseudocolor plot) - good for range of data - diagnostics
  • DO: GEOMETRICS - contour maps 2D: contours 3D: Isosurface (matching cubes, marching tetra) color and contour (Tet mesh Tri Mesh) EXPLICIT plot as is, e.g., simulation, mesh adapts, is the mesh what you expect?
  • DO: VOLUMETRIC (transparency EXPENSIVE) - can see through, e.g. jellyfish: density mapped to color; opaque/transparency (transfer function=EXPENSIVE - sampling interval - output resolution) investigate INTERIOR ---
Qn: Voxel e.g. isosurfaces? Dist of temps? data as random variables - discuss later

e.g low temp transparent, hi opaque -> see whatever you are after

  • DO: STREAMLINES immediate flows - say where they start computing, how often, is guesswork, NEW TECHNIQUE STREAM SURFACES (coming) expensive
LIC Line Integral convolution for FLOW hard to quantize; color lines as well
GLYPHS: arrows for vector field (DIRECTIONS) shading on ellopsoid from multidimensional (cant see inside), works with earthquakes, vibration (doesnt work for flow)
  • TOPOLOGICAL rendering underlying data, local maxima minima saddle points. topological tree
  • SKIP: 2D Tensor 3x3 matrix, streyching scaling rotating 9 quantities maps to a hemisphere - highly MATHEMATICAL rotate stretch scale
  • OTHER: DO: PARALLEL COORDINATES e.g., network
HI DIM: 3D find trends Time verticals, voxels 3D cells 2D Piecewise connections (see trends, converge in time)
Isosurface: 3D, what trying to tease out, set something, see something
  • OTHER: Chord
  • OTHER: Tree: e.g., Dendograms, Sunbursts, Treemaps, etc
  • OTHER: Others
  • VIZ APPS
Communication
Confirmation <- key
Inspection and Exploration <- key (also do earlier, find things)
  • CONFIRMATION : e.g., Anscombe's Quartet (conjured up) stats the same KEY EXAMPLE, e.g., regression line scatter, curve, outlier, cluster ACCCR
  • INSPECTION Earthquake movie (boundary failure after some time - shows faiiure of sumulation) 2D color maps "NAN" inspection
Mesh Topology, Decomposition and Data Inspection - recomposition not correct - data dislodged from continuity
port from CPU to GPU simulations appear similar found error found division by 0, some kernals 32 bits 64 bits. Fixed relative error over time
Comparison
Collision and merger
RESOLUTION: Hi Res Monitors
Pipeline data: how to connect and plot data. Ideal processes are all independent: data, filter, etc.
Grey scale often best - DOES INTENSITY BEST. Eyes migrate to e.g., to green rather than higher importance. USE HSV scales rather than rainbow color, INTENSITY most important. http://blog.visual.ly/rainbow-color-scales.
Dont invent your own format hdf mmpf etc.
Write non-technical captions
  • EXPLORATION

2nd section

free help available on all sorts of specialties

Gordon session

Cities

 Nord Pas de Calais 1500-present
 Florence 1200-1500
 Leghorn (Livorno)

Tolga Oztan

Generation_permutation_code#Tolga_Oztan_code_Nov_11-11 - Tolga_Oztan_code_Nov_11-11
Tolga_Oztan_code_Node Independent Paths May 2011
Why not do Clique detection on the NIP matrix?

Computing large network k-components

an additional use for Gordon
since we have integer matrices for node-independent paths on big networks
to do Clique detection for each integer value and get Menger k-components altho this is O(NP hard)?
thats what Gordon can do, right? cut right thru NP hard algorithms?
Is there a Clique detection algorithm for Gordon, already?


paragraph for startup account on Gordon:

Among other things: analyses of large networks (CAIDA internet, world economy, financial industries, intercorporate linkages, interlocking directorates, etc.) using our Menger-theorem algorithm that gives the complete matrix of node-independent paths for each network, then using an NP hard algorithm for clique detection to calculate the k-connected subgraph structure of each network. Hypothesis testing will predict outcome variables from the large and giant k-components of these networks, starting with internet data from the SDSC Caida project (collaboration with Matsim Kitsak, Dmitri Krioukov, and others).

The code to be parallelized is Tolga_Oztan_code_Node_Independent_Paths_May_2011. For clique detection:

http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Steven-Kapturowski-Sp-2009.pdf[PDF]
A Parallel Algorithm For Maximal Clique Detection
www.cse.buffalo.edu/faculty/.../Steven-Kapturowski-Sp-2009.pdf

Startup account

Hi Doug,

Here are some quick instructions for applying for a startup account. Note that you'll have the opportunity to ask for time on systems around the country, but since you'll only need a modest amount of time to get started, you already have a working relationship with SDSC and you don't want to be spending a lot of time moving data, just go ahead and ask for time on our systems.

I've talked mostly about Gordon, but we have a second system, Trestles, that may also be ideal for your needs. My suggestion is that you ask for 50,000 hours on each. That will give you a chance to get some serious work done and to benchmark your applications.

(1) Go to https://www.xsede.org/ and create a portal account. You'll see a link on the main page that takes you to https://portal.xsede.org/. Just click on the "Create Account" link and follow the instructions. Tolga and your other students should create portal accounts too.

(2) After you've created a portal account and signed in, go to the Allocations tab and select "Submit/Review Request" This will take you to the page https://portal.xsede.org/group/xup/submit-request

(3) Click on the "Click to Enter or View a Request" button. This will take you to a new page with a link that is specific to your account. Click on the link labeled "startup".

(4) From here the forms should be pretty obvious. You DO NOT need to write a full proposal. Just an abstract is fine. The abstract will only need to be from a few sentences to a few paragraphs. Describe your work and give some justification for why you need to use a supercomputer. This might include the inability to do the calculation on local resources in a reasonable amount of time, large memory requirements, have (or plan to develop) parallel code. If you expect to use the flash drives for fast scratch space, be sure to mention that too.

(5) When you have a chance, visit the other links under the Allocations tab. You'll see a lot of information about policies, eligibility and writing a full research proposal. For now the key info is

(a) A grad student cannot normally qualify to be the PI, but can be a co-PI - B. Tolga Oztan added as Co-PI

(b) The startup limits for Gordon and Trestles are 100K and 50K, respectively.

Cheers, Bob

Project

https://portal.xsede.org/help-desk

Title: Computational Structurally Cohesive Subgroups in Large Networks

Description: Computational results of Menger's theorem (1928) for k-connected subgraphs equivalent to maximal no k-node subsets are now feasible on supercomputers using results from a new transformation by Gabor Csardi (iGraph R package) that permits Ford-Fulkerson computation of maxflows that result computation of pairwise k-cohesion measured in a matrix of nodes of size n x n for a network of n nodes. This matrix is analyzed for cliques of nodes with 1 to k-connectivities. This is inmensely computationally intensive. We begin with a 90K network of coauthorships in sociology publications, move to internet links, patent networks, biological networks, many kinship networks of ca. 100K nodes, and other datasets. Multiple Cohesive Subgroup measures in 20 previous studies of smaller networks have been found to be highly predictive for multiple "effects of cohesion" hypotheses in different scientific fields.

Fortran code for Erdos graphs fitted formula for giant components, bicomponents, 3-components ||, 4-components, 5-components, 6-components

REGDI.FOR C DIMENSION MTRX (6,3,10000,10000,10000) #if saved -> next line if computed interatively

     DIMENSION  MTRX (6,3,10000,10000) 
     DIMENSION  MTRC (6,3,10000) 

C 1 000 000 000 000 (i.e., 18 Trillion) can be reduced to 1 000 000 is we compute the largest 4-component in each E loop, find longest chain

     COMP=3
     SIZE   =3
     DO C=1,COMP
     SI      =10
     DO S=1,SIZE

C SI=1000, 1000, 10000

     SI=100*S1*S

C MAIN LOOP FILL S by S matrix of diagonal randomly

     DO E=1,2*SI

C IRND1, IRND2 are integers between 1 and SI, the dimension of the matrix

     MTRX(C,S,E,IRND1,IRND2)=1

C don't save each 5-dimensional matrix -- NO NEED FOR SUBROUTINE JUST DO IT HERE WITH NEW IRND1,IRND2 entry COMPUTE FOR 1000x1000 10000x10000 etc SUBSETS OF MTRX (10000,10000) the largest component/bicomponent of order(n^2)/Gabor Gsardi tricomponent of order(E*n^2) C COMPUTE AND SAVE largest new entry in MTRC (6,3,10000)

     END DO
     INITIALIZE MTRX TO ZEROS
     END DO
     END DO

Data

we have two internet datasets from SDSC

Fantastic…I may be able to share our physician network too, beside the Sociology coauthorship; will ask…PTs

  • Jim Moody


Would be cool. Would love to be able to do structural cohesion on large graphs and see if there are new kinds of findings. The new 2012 code from Gabor iGraph is in R, challenge will be to see if we can vectorize R, its a challenge for our supercomputer guys as well.... Doug

From: James Moody Subject: Re: supercomputer for structural cohesion (your facebook post)

Yeah…R is a pain. The underlying architechure just does not scale well … I think for most big things (like the million-plus step MCMCs that ERGMs run); the best way is to port the actual computation to C…. PTs Jim

Theory and Hypotheses

Heading: Increasing Transparency, Trust, and Creativity in Interactive Networks

In the references at http://intersci.ss.uci.edu/wiki/index.php/Topical_publications:_Douglas_R._White (these_numbers_will_change_as_items_are_added) see

25 Structural cohesion/connectivity w Harary
28 Structural cohesion w Moody
31 Biotech w/ Powell et al
29 Biotech w/ Powell Moody et al
45 Networks, Trade & Globalization Policies
44 Cohesion and Resistance
68-95 Kinship networks (also book 40 w/Ulla Johanson)

(for topics above see Structural cohesion, Cohesive blocking, SDSC Supercomputing

  • Each provides evidence that subgroups with structural cohesion in social and biological networks, which are precisely bounded, often overlap and in which k-cohesion for k>1 logically implies k-1 cohesion, are often predictive (if well specified) of "effects of cohesion" that derive from the the levels of integration and resistance to separation specified by their cohesive measure, k.

It follows that in very large networks, where there are typically a maximally k-cohesive group and sometimes independent or overlapping less cohesive groups, the transparency and robustness of these subgroups will covary with levels of cohesion. which may also be a source of greater creativity and productivity.

To test such hypotheses, a range of such groups are identified, and measurements of creativity, productivity and success are devised. For the creativity and productivity. For the patent network, for example, indices are create for a sample of effective usage of the patents, as well as their citations in other patents (which is part of the network). For the 90K network of sociology coauthorship networks (or (N=29,462 authors) of papers listed in Sociological Abstracts from 1963 - 1999),, potential key informants familiar with more than one group are asked to rate levels of creativity and productivity of different groups.


Heading: Economic networks, Demographics, and Dynamical instabilities

33 World Econ w/ Smith (see 21 w/Rietz)
36 World Econ w/ Reichardt
34 Flow centrality w/Borgatti, Freeman (defined in my earlier paper on world economy and shown in a ppt to entail the growth of capitalism in Europe post 1450)
39 unpub World Population
46 Problem of autocorrelation w/…
51 Weak & strong ties (contra universality of Granovetter(
55-56,63 Tokyo production hierarchies & econ theory of prices w/ Nakano
57-58 Global econ problems w/ Schweitzer et al
64 unpub (w others) Cultural Consequences of Regionally Fluctuating Inequality
68-95 Kinship networks (also book 40 w/Ulla Johanson)


Heading: collaborative projects

the core idea here that might be relevant to Steve Doubleday's problem of success of large scale systems programming problem is that structural cohesion and interlock within or between relatively tight cooperating groups have a much greater chance of coordinating their epistemic frameworks a la Collins and other work on epistemic frames and learning

A paper by [[Balazs_Vedres#Structural_Folds}Vedres and Stark]] is relevant to this discussion

Proceed

https://www.xsede.org/using-xsede

https://portal.xsede.org/

https://portal.xsede.org/group/xup/accounts

DW,

When you are in the proposal application form on the left is a link for resource request, if you select this you will be taken to a list of XSEDE resources scroll down you and you should see the resource you are looking for, select it and you will be taken to this resource where you can enter 50,000 for your resource request. Once you are satisfied you have completed all of the needed information select "save to date" on the left and then "final submission" if everything is in order, the box on the left should now saw submitted instead of incomplete.

Let me know if you still are experiencing problems about doing this.

Ken Hackworth XSEDE Allocations

FROM: Hackworth, Ken (Concerning ticket No. 217191)

SDSC Summer Institute

Preparing for the SDSC Summer Institute You will get the most out of the SDSC Summer Institute if you come well prepared. By brushing up on your knowledge of Linux and installing all necessary software on your laptop before arriving, you’ll be able to focus your attention on the skills and topics that are most relevant to high performance computing.

  • Basic Linux skills.

Please remember that basic Linux skills are necessary to complete the hands-­‐on exercises. If it’s been a while since you’ve worked in a Linux environment, be sure to reacquaint yourself with the following topics: copying, listing, deleting and renaming files; using wildcards; navigating directories; changing file permissions; setting environment variables; using common utilities (grep, cat, less, head, sort, tar, gzip). A nice tutorial can be found here http://www.ee.surrey.ac.uk/Teaching/Unix/. You should also be comfortable with one of the standard Linux editors, such as vim, emacs, or nano.

  • Connecting to SDSC’s HPC systems.

Since you will be using your laptop to access SDSC’s HPC systems, it is essential that you be able to run secure shell (ssh) or a similar connection tool with X11 forwarding enabled.

For Mac users, running ssh is trivially easy. Just launch the Terminal application and then connect with ssh from the command line

$ ssh ­‐X username@hostname
  • Windows users

will need to run an X Server and an ssh-­‐like client. Cygwin provides a comprehensive Linux-­‐like environment and an X server (Cygwin/X)

http://www.cygwin.com/
http://x.cygwin.com/
  • Another popular option (for Windows users)

is Xming, which comes with a PuTTY ssh client

http://www.straightrunning.com/XmingNotes
http://sourceforge.net/project/downloading.php?group_id=156984&filename=Xming-­‐6-­‐
9-­‐0-­‐31-­‐setup.exe
  • SDSC accounts.

If you have a pre-­‐existing SDSC account, please make sure that you can login to Gordon and Trestles. If you forgot your password, please visit

http://www.sdsc.edu/us/consulting/password.html

If you do not already have an SDSC account, you will be provided with one on the first day of the Summer Institute.

  • Laptop

software installations While many of the hands-­‐on activities will be run on Gordon and Trestles, you will also be asked to do some of the exercises on your laptop. In preparation for the Summer Institute, we ask that you install the following software before you arrive

  • R (statistical programming language)

Follow “download R” link in “Getting started” section of home page

http://www.r-‐project.org/

Download

  • WEKA

(data mining software) Weka-3-6-7.dmg 27.6 MB click bird "Environment for Knowledge Analysis" Follow “Download” link on left hand side of home page

http://www.cs.waikato.ac.nz/ml/weka/
  • VisIt

(visualization tool)

https://wci.llnl.gov/codes/visit/
https://wci.llnl.gov/codes/visit/executables.html
Mac OS X - Intel 64 bit

Darwin 10.6.3, Darwin Kernel Version 10.3.0, gcc 4.2.1, OpenMPI (Includes parallel VisIt compatible with MacOS X 10.6's default MPI) In install of VisIt they ask to check which supercomputers to use if any ... is that choice important? i.e. from Setup Host Profile list I clicked none.

TAU (profiling and tracing toolkit) v2.21.3 gave a tau_latest.tar.gz and unarchiving but left no executable

http://www.cs.uoregon.edu/Research/tau/home.php
http://www.cs.uoregon.edu/Research/tau/downloads.php
http://www.cs.uoregon.edu/Research/tau/tau-usersguide.pdf   97 pages: Printed 1-2, iii-VIII
  • Java

(required for science gateways tutorial) Most likely you already have Java installed, but you may want to confirm (go to Finder in Mac and click upload programs) - both OK

http://www.java.com

UNIX Tutorial for Beginners