Thanks, Peter, for the very helpful and "chewy" e-mail (below). I've managed to work out the SQLite problem (I think) it requires that you install the whole of R and set home and working directions within user space in Vista. I'll let the process run (it's still going) and will report back on outcomes.
It is not hitting both processors, so I'll probably switch over future work to a single processor system that's a little faster.
Oh, yeah, while I've got you. I found that the read.graph (format="pajek") command chokes on paj time code (e.g. [xxxx-xxxx]) on verts and edges. Not a big deal but perhaps worth mentioning in the documentation?
Thanks again. Jason-
Thanks for the quick response, Doug (and Hi, Peter, good to 'meet' you).
I'm ok posting the question if you snip the first and last paras from the e-mail. In the meantime, I've tried several variants (e.g. db=NULL, cutsetHeuristic=FALSE) none seemed to speed things up so I decided to see how long it would run.
Subject: Re: [Fwd: cohesive blocking in R] From: "Peter McMahan" <email@example.com> Date: Thu, April 23, 2009 1:18 pm To: "Owen-Smith, Jason" <firstname.lastname@example.org> Cc: "Douglas.White@uci.edu" <Douglas.White@uci.edu> Options: View Full Header | View Printable Version | Download this as a file | View Message details
Hi Doug, and good to 'meet' you too, Jason.
The read.graph() bug is something I hadn't come across. That function is not mine (it's from the igraph package), but should be easy enough to fix. I'll try to reproduce it on my machine and mention it on the igraph list. Do you have an example pajek file that doesn't work?
Re: cohesive.blocks performance: The short answer to your questions, Jason, is that that sounds about right. Your best bet for right now is to find a powerful machine that you can leave for several days and let it run. If you do this, I recommend getting the SQLite functionality working -- you don't want to get a memory allocation error two days into a long run.
The longer answer to you questions:
No thorough benchmarking has been done. That would indeed be a very useful thing to do, but to get an accurate picture you couldn't just use simple random graphs of various sizes/densities. The performance of the algorithm depends very much on the cohesion structure of the graph. Maybe some standardized way of iteratively constructing small- world graphs would be useful. I've put in a vague warning on the function's help file that things get very slow very quickly as size and/or density increase, but some real numbers would be excellent.
I have used the algorithm on a large subset of John Padgett's Florentiine marriage data -- over 700 nodes -- which took the better part of a week to complete. (That was the job that I originally wrote the function for). There are probably some small improvements to its efficiency now, but nothing huge.
There are basically two big choke-points for the algorithm: computation (finding the blocks) and memory (keeping track of what's been searched and what's been found). The SQLite functionality simply uses a database instead of system memory -- this takes care of memory problems but slows the overall perfomance down a bit. However if you run out of memery it will simply crash the function (or more likely crash R).
If processing speed is the main barrier there are two approaches. First, you can try the two alternate cutset algorithms (cutsetHeuristic at TRUE or FALSE). So far I don't have a way to figure out which is better for a given problem except by trying them out -- not much help here I know. The other approach is to use more processors. On your dual-core is the algorithm pegging both processors or just one? This point may be moot: I think the current version of the function doesn't actually include multi-processor functionality because it was creating more bugs than it was solving. I'm thinking about modifying the code for the next version to include some sort of simple multicore functionality (but not distributed computing like it had before). And if R ever gains some GPGPU functionality I could probably throw the graphics card at it too :).
Re: verbose: I almost always have this on. It can be hard to make sense of the output, but the important aspect is if it is changing at all. The computationally intensive part of the algorithm is identifying all of the vertex cutsets within any subgraph, and with verbose==TRUE it will print the results every time it finishes this calculation for one subgraph. If it spends more than a day without the output changing at all then there is probably a bigger problem.
Finally, the most stable version of the function is currently included with the latest release of igraph (0.5.2). To get this you'll need to install the latest R version (2.9.0) and type install.packages("igraph",dep=TRUE)
I hope this helps at least a little bit. Cohesive blocking is computationally just a tricky problem. I'll definitely post to the wiki when/if I get the time to add the multicore functionality, though that doesn't look likely until this summer some time.
On Apr 23, 2009, at 1:02 PM, Owen-Smith, Jason wrote:
> From: email@example.com [firstname.lastname@example.org] > Sent: Thursday, April 23, 2009 1:51 PM > To: Peter McMahan > Cc: Owen-Smith, Jason > Subject: [Fwd: cohesive blocking in R] > > Peter - Problem with Vista and Jason is pointing to a need for > benchmarks > so with larger datasets users dont get in over their head... RSQLite > probably isnt needed either, is it? Any tweaks? Is verbose=TRUE > helping? > Which alternative cut set option would be better for larger > datasets? We > can build some more comments into the wiki. Ok by you, Jason? We might > post your question? > > ---------------------------- Original Message > ---------------------------- > Subject: cohesive blocking in R > From: "Owen-Smith, Jason" <email@example.com> > Date: Thu, April 23, 2009 10:44 am > To: "firstname.lastname@example.org" <email@example.com> > -------------------------------------------------------------------------- > > Hey Doug, > > Woody passed on the link to the intersci wiki igraph cohesive.blocks > routine, which I've been playing with a bit in the context of a > paper he > and are working on and another project. The package seems to be > installed > correctly as I ran the examples on the wiki (though I had some trouble > with RSQLite given permission structures on michigan computers). I > haven't updated R for a while so am using 2.7.2 > > The routine works nicely for smaller <100 sparsely connected nodes but > even moderate sized networks, hundreds of nodes, take a massive > amount of > time. Performance just gets worse with more densely connected graphs. > For instance, I'm currently running cohesive.blocks on a 449 node > 10,547 > edge network. It has been running for almost 24 hours on a dual core > (2.8mhz, 8gb ram system under windows vista). The processor keeps > chunking along but the verbose=TRUE option doesn't help me track > progress, > because vista seems to default to treating the R-gui window as a > non-responsive program. All that brings me to a question I hope you > can > help me with. Has anyone benchmarked the igraph routine on medium > sized > dense graphs? If not are there tweaks that you know of that might > improve > performance here? > > Thanks, in advance, for any help you can give me. > > J.-