Padhraic Smyth

From InterSciWiki
Jump to: navigation, search

Padhraic Smyth Department of Computer Science, UCI --

Social network database project

Emma Spiro Sean Fitzhugh network data project

Additional public datasets

Statistical topic models

"Automated Analysis of Relations between Words, Entities, Topics, and Documents using Statistical Topic Models" (see pdf)

Sample result: Demo of UCI-UCSD Faculty Browser - for alternate software implementation see Statistical topic model project

Slide for example of software disambiguation of the topics of documents from "Automated Analysis of Relations between Words, Entities, Topics, and Documents using Statistical Topic Models" by and with permission of Padhraic Smyth, UC Irvine
The availability of very large online corpora of text in digital form has led in recent years to the development of algorithms that try to automatically extract useful information and relationships from such text. In this talk I will describe a recent statistical approach that has proven to be very useful in this general context. Specifically I will discuss a representation for documents as mixtures of topics, where a topic is a probability distribution over words. The topics can be learned in a completely automated and unsupervised manner using a statistical estimation method called Gibbs sampling. I will illustrate the results of applying this approach to a diverse set of large corpora, including 250,000 emails from the Enron investigation, 300,000 news articles from the New York Times, 12,000 technical papers from UCI and UCSD faculty, and 80,000 articles from the Pennsylvania Gazette (from the 18th century). Once the statistical topic model is estimated for a specific corpus, a wide variety of interesting questions can be posed and answered: for example, how have topics changed over time in a particular corpus? which authors write on a particular topic? and so on. I will conclude with a discussion of how these statistical topic models can provide an interesting basis for automatically constructing large and complex networks from text and how these networks can support interesting inferences and insights that would be difficult (or impossible) to obtain by purely manual means.
Slide showing disambiguation of the topics of documents from "Automated Analysis of Relations between Words, Entities, Topics, and Documents using Statistical Topic Models" by and with permission of Padhraic Smyth, UC Irvine


Padhraic Smyth is one of the leading researchers in statistical pattern detection and data mining, machine learning, and information theory. His book, Advances in Knowledge Discovery and Data Mining AAAI Press, 1996, was followed by the co-authored, Principles of Data Mining, MIT Press, 2001, and Modeling the Internet and the Web: Probabilistic Methods and Algorithms, John Wiley and Sons, 2003. He was a recipient of best paper awards at the 2002 and 1997 ACM SIGKDD Conferences, an IBM Faculty Partnership Award in 2001, an NSF Faculty CAREER award in 1997 and an Award for Excellence in Research at JPL in 1993, where he was a Technical Group Leader at the Jet Propulsion Laboratory, Pasadena. He received a first class honors degree in Electronic Engineering from University College Galway (National University of Ireland) in 1984, and the MSEE and PhD degrees from the Electrical Engineering Department at the California Institute of Technology in 1985 and 1988 respectively. He has been on the UCI faculty since 1996.

He is currently an associate editor for the Journal of the American Statistical Association and for the IEEE Transactions on Knowledge and Data Engineering, has served as an action editor for the Machine Learning Journal, is a founding associate editor for the Journal of Data Mining and Knowledge Discovery, and a founding editorial board member of the Journal of Machine Learning Research. He served as program chair for the 33rd Symposium on Computer Science and Statistics in 2001 and served as general chair for the Sixth International Workshop on AI and Statistics in 1997.

Statistical topic model pojwxr

Smyth and Doug White are planning a Statistical topic model project for open access use of the software, expanding on existing prototype databases such as.

New York Times, 300,000 news articles
The Enron investigation, 250,000 emails
UCI and UCSD faculty specialties, 12,000 technical papers
Pennsylvania Gazette, 80,000 articles from the 18th century.
CiteSeerdigital collection, 750k papers, 500k authors,
MEDLINE collection, 17 million abstracts
US Patent collection

Michael Fischer's implementation

a simple example of what I have in Grok

Search the Anthropological Index Online

Search Paul Stirlings Fieldnotes (or Wenonah Lyon or Mike Fischer Summaries)

These don't display any statistics, just a kind of topic cloud for searching a text. This is all done in the browser, and the data stream coming in has the statistics, so we can customise to display them


Yong Ming Kow

The website looks like a search engine + stats topic modeling tool. First it searches the references using keywords. Then based on the results, it generates a list of keyword in 'bubbles.' Larger bubbles are most relevant.

Is my perception correct?

First enter into the field (see below), and click 'search the AIO.' See: Ymk01.jpg

Then, wait a while, you will see: Ymk02.jpg