Using the new discrete estimator and producing sampling distribution plots

From InterSciWiki
Jump to: navigation, search


These are new (2008-05-19) R codes programmed and submitted to this site by Cosma Shalizi to extend his preliminary package for fitting univariate distributions. The previous package applied only to continuous data. tsal-v0.2.2.R for q-exponential continuous distributions were added later and separately. The two additional codes here apply to q-exponentials using the discrete distribution MLE and the qlog method used by most physicists. The latter estimations are not MLE and have modest bias as will as much greater standard errors than the discrete MLE. No claims or guarantees are made; use as your own risk. These methods are described in q-Exponential Distributions in Empirical Data (c) 2008 Laurent Tambayong, Aaron Clauset, Cosma Shalizi, and Douglas R. White; and in a subsequent working draft on the qlog.

The earlier package includes R and Matlab software and documentation (Updated 29 June 2007) for the 2007 review article, "Power-law distributions in empirical data," by Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. Under review by a statistical journal.

The first extension for q-exponentials is Shalizi, Cosma. 2007 Maximum Likelihood Estimation for q-Exponential (Tsallis) Distributions. http://www.cscs.umich.edu/~crshalizi/research/tsallis-MLE

new paper Tsallis q distribution project (Laurent Tambayong, Aaron Clauset, Cosma Shalizi, and Douglas R. White). Drafted: (c) 2008 q-Exponential Distributions in Empirical Data

### Example of using the new discrete estimator
### and of producing sampling distribution plots
#GO TO CRAN, Package, and install package gsl (Wrappers for the Gnu scientific library), masks sd from stats
utils:::menuInstallPkgs()
library(gsl)  #GNU scientific library
# First, make sure that the old code for the
# continuous q-exponential is loaded - make
# sure to change the address of the file being
# sourced to match where it is on your computer
##source("/Users/crs/html/research/tsallis-MLE/tsal-v0.2.2.R")
source("http://intersci.ss.uci.edu/wiki/pub/Shalizi/tsal-v0.2.2.R")
# Next, load the code for the discrete
# q-exponential - again, you will have to give
# it the location on your computer, not mine
##source("/Users/crs/things-to-work-on/new-q-exp/disc.q.exp.estimation.R")
source("http://intersci.ss.uci.edu/wiki/pub/Shalizi/disc.q.exp.estimation.R")
# Similarly the q-logarithmic estimator
##source("/Users/crs/things-to-work-on/new-q-exp/lnq-qexp-estimate.R")
source("http://intersci.ss.uci.edu/wiki/pub/Shalizi/lnq-qexp-estimate.R")
# And, finally (for sourcing files) the continuous
# power law, just because it contains some
# useful visualization functions
##source("/Users/crs/things-to-work-on/powerlaws/power-law-inference-R-code/pli-R-v0.0.4-2008-03-02/pareto.R")
source("http://intersci.ss.uci.edu/wiki/pub/Shalizi/pareto.R")
# Next, generate some random variates from the
# discrete q-exponential distribution.
## The function to do this is rtsal.disc.  The
## first argument to this is the number of samples;
## then you can either give q and kappa, or shape
## and scale (i.e. theta and sigma in the Pareto II
## notation).
rtsal.disc(50,q=1.7,kappa=0.5)
# This will return the output to the console,
# typically something like this.
#>  [1] 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0
#>[28] 1 0 0 0 2 0 0 0 2 2 0 0 1 0 0 0 1 0 1 0 0 0 0
# Unfortunately the method used in this random
# number generator is rather crude and slow, so
# I added an option to precompute some of the
# values it needs (rather than doing so from
# scratch with each new sample)
rtsal.disc(50,q=1.7,kappa=0.5,precompute=5000)
# To check how well the random variate generation
# is working, plot the upper CDF (or "survival
# function") of the random variates against the
# theoretical distribution
#### Note the "plus one", since the values start
#### at zero but we're plotting on a log-log scale
plot.survival.loglog(rtsal.disc(1e4,q=1.7,kappa=0.5,precompute=5000)+1)
# Overlay the true CDF as blue dots
points(1:200,ptsal.disc(0:199,q=1.7,kappa=0.5,lower.tail=FALSE),col="blue")
# Generate a random data set from the discrete
# q-exponential
rts <- rtsal.disc(50,q=1.7,kappa=0.5,precompute=500) #corrected
# Fit it using the discrete MLE
tsal.fit.disc(rts)
# Fit it using the continuous MLE
tsal.fit(rts)
# Fit it using the q-logarithmic estimator
qlog.estimate(rts)
# Extract just the estimated q value, ignoring
# everything else
tsal.fit.disc(rts)$q
# etc. for the others
# Now replicate this - generate random data, fit
# it, extract the q value from the fit - many
# times, here 100.  Store the results in a vector
# for later use.
many.ml.estimates.q.small <- replicate(1e2,tsal.fit.disc(rtsal.disc(50,q=1.7,kappa=0.5,precompute=500))$q)
# Ditto for the q-logarithmic estimates.  Note that this isn't using the exact same
# random data sets, but with 100 of them per method this should still be fair.
many.qlog.estimates.q.small <- replicate(1e2,qlog.estimate(rtsal.disc(50,q=1.7,kappa=0.5,precompute=500))$q)
many.ml.estimates.q.small.continuous <- replicate(1e2,tsal.fit(rtsal.disc(50,q=1.7,kappa=0.5,precompute=500))$q)
##DRW got: Warning messages:
##1: In log(scale) : NaNs produced
##2: In log(1 + (x/scale)) : NaNs produced
##There were 50 or more warnings (use warnings() to see the first 50) -- maybe because my q~1.00 ie. 1.007
# The density() function, applied to a vector, e.g., density(v), produces a kernel
# density estimate of the vector.  This has a number of components, including a
# set of x points at which the density is evaluated and a set of y values for the density
# there.  (Do help(density) for more.)
help(density)

Plots and insert your own Sample data for q-exponential

Other ways to plot univariate distributions

# Plot the density of the discrete MLE's values of q.  Since the true q is unchanging, all of the variation comes from sampling.

plot(density(many.ml.estimates.q.small))

# To overlay the other densities, we need to store them as separate objects       
Discrete MLE outperforms qlog

qlog.temp <- density(many.qlog.estimates.q.small) cmle.temp <- density(many.ml.estimates.q.small.continuous)

#  Now do the nice plotting - start wth the discrete MLE's sampling distribution

plot(density(many.ml.estimates.q.small),xlim=c(0,60),main="Sampling distribution of q estimates: discrete data",xlab="black discrete MLE, green continuous MLE, red q-logarithm",sub="n=50")

# add lines showing the other two - note using $x to extract the component named "x",
# and $y to extract the component named "y".  Use help(lines) to see more about ways
# of adding lines to plots.

lines(qlog.temp$x,qlog.temp$y,col="red") lines(cmle.temp$x,cmle.temp$y,col="green")

Discrete MLE and qlog vastly outperform Continuous MLE
# for generating q=1.8 or so qlog does poorly, continuous MLE terribly, discrete MLE on target
# add a vertical line at the true q value.  Use help(abline) to see more about ways of adding
# simple lines to plots.

abline(v=1.7,col="blue")

# Now re-do this with just the discrete MLE and the q-logarithm estimator

plot(density(many.ml.estimates.q.small),xlim=c(0.5,5),main="Sampling distribution of q estimates: discrete data",xlab="black discrete MLE, red q-logarithm",sub="n=50") lines(qlog.temp$x,qlog.temp$y,col="red") abline(v=1.7,col="blue")