Testing the model: q exponential

From InterSciWiki
Jump to: navigation, search

Subfiles

  1. Proof of the pudding-Networks and q-exponentials
  2. From Pareto II to q
  3. Estimating Tsallis q
  4. Tsallis q distribution project: Tambayong, Clauset, Shalizi, White
  5. Tsallis q historical cities and city-sizes White, Tambayong, (Shalizi, Clauset)
  6. Testing the model
  7. Phenomenology of testing the model
  8. Estimating Tsallis q for degree distributions

Abstract We extend for q-exponential models the work of Clauset et al 2007, who provide state-of-the-art procedures and programs for testing fit to models of point-data distributions.

Introduction

Common fallacies of model testing include fitting the parameters of a model by minimizing the sum of squared deviations from the data (measuring r2 fit of the data and “model predictions” for each data point) and that the significance test for deviation from r2 fit, according to the null hypothesis, is a good measure of fit. What this ignores, among other things, is (1) formulation of the model in probabilistic terms and measuring the probability that the data observed would have been generated under the model probabilities given the fitted model parameters (hence p ≈ 1.0 is good fit, unlike the significance test), and (2) testing alternative models using log likelihood probabilities that measure the probability the model of interest is a better fit than an alternative (again p ≈ 1.0 is the probability that the data could have been generated from the model probabilities rather than the alternative, again unlike the significance test). Only in the social sciences do the fallacies and common usage stubbornly persist that (1) rejection of the null hypothesis by a significance test can substitute for a direct test of fit to a model, a debate called the significance-test controversy (Neehl 1967) that has never been fully laid to rest, and (2) that correlation between observed data and model “predictions” provide a valid test of fit. The latter case, however, is an error commonly made even by mathematically sophisticated physicists (e.g., Barabási 2002), in the belief that fitting a straight line to a log-log plot by r2 and a significance test provides a good test of fit for a power law (or, in a semi-log plot, for an exponential distribution). Clauset et al (2007), in their Appendix A.2 show how severely these approaches are biased, even if they are in common use in both the social and the physical sciences. Hundreds if not thousands of peer-reviewed papers have been published testing fit, for example, between a 2-parameter power-law and the frequency distribution of nodes in a network by their number of links (aka degree distributions).

Models of point-data such as power laws and other distributions do allow accurate tests of fit when correctly done, as Clauset et al (2007) have laid out and shown empirically in their review article. We follow their procedures and extend their results by testing two MLE (estimators) for the Pareto II distribution, now more widely known as the q-exponential, against other models (exponential, stretched exponential, lognormal; and nested models such as power law plus cutoff). We do so for several continuous distribution sets of data (populations of cities, sales of books, forest fire sizes, blackouts) which Clauset et al (2007) use in their review to exemplify how to evaluate fit to power-law distribution models. As they show, unbiased point data estimates can be obtained and tested, and different model fits compared, in three steps:

  1. Derivation and validation of appropriate maximum likelihood estimation (MLE) formuli for different kinds of distributions;
  2. Use of the Kolmogorov-Smirnov (K-S) test [1] [2] to measure the extent to which two distributions differ. Here, unlike the null significance test, values close to 1 are good fits, and the null hypothesis of no difference is rejected if p is small; e.g., p<.05 means the distributions differ. K-S also returns a difference measure D between distributions.
  3. Likelihood ratio tests of comparative fit as between different models. Clauset et al (2007) do so only for power laws compared to fit of other distributions (exponential, stretched exponential, lognormal, power law plus cutoff).

Procedure for Deriving an MLE

The principle of maximum likelihood parameter estimation is to find the parameter values that make the observed data most likely under a given parameterized model. This requires estimation that is asymptotically unbiased, consistent and efficient, i.e., providing parameter values for large samples (or as sample size goes to infinity) that are almost always accurate (unbiased) within error bounds that can also be accurately calculated (more specifically, asymptotically efficient, i.e, with the lowest possible variance among all unbiased estimators, as shown by the Cramér-Rao inequality for consistent unbiased estimators). If data drawn with unknown parameters from a given distribution can be assumed to be independent and identically distributed (iid: each element belongs to the same probability distribution as the others and all are mutually independent) the problem simplifies because the likelihood \lambda (\pi), for the data and model parameter set \pi, can then be written as a product of univariate probability densities p_\pi{}(x) for each x, and can be reexpressed as a weighted sum of logs of these densities.

Testing the MLE with simulated data

The best way to test the accuracy of an MLE is to generate large samples of synthetic data drawn from the model with different parameter values, and see if the MLE generates these values and their expected variances. For the q-exponential, Aaron Clauset's qpva.m (Matlab) code includes a way of generating synthetic data for the parameters from the Pareto II which, as formulated in Eq 2 of Cosma Shalizi's article, has parameters that are algebraic equivalents of the q-exponential parameters.

Procedure for Estimation with MLE

The MLE is found by fitting the model parameter value(s) to maximize the likelihood function \lambda (\pi) for the empirical data, a sum of logs weighted by these value(s). This maximum can also be found algebraically by setting to zero the derivative of the likelihood function with respect to the parameter(s).

Small sample correction for MLE

For small samples, parameter estimates can be corrected by analysis of the tests of MLE accuracy for smaller samples (e.g., Clauset et al. 2007, Fig. 6 for power laws; Shalizi 2007, Fig. 1, for q-exponentials). Error bounds estimates from asymptotic approximations should be avoided. Instead, parametric bootstrapping (Wasserman 2002:section 9.11) to obtain parameter estimates, standard errors, and confidence limits can be derived by generating many "bootstrap" samples of random numbers with the model density function p_{\pi}(x).

Procedure for testing K-S fit for a model with MLE parameters

The likelihood framework allows tests of parameter values comparing the empirical data (using K-S) to a same-sized sample of simulated data (vary x randomly --> return y(x)) using the model parameters estimates, returning D and p for each comparison. What is interesting here is that each simulated dataset will vary, with D and p varying for each comparison with the empirical data. The appropriate measure of fit is to repeatedly generate point data from the model and its parameters, with the same number of points as the actual data, a sufficient number of times to estimate a convergent variance of the estimate.

Likelihood ratio procedure for comparing different models for the same empirical data

The likelihood ratio (LR) addresses the question of which of two models is better fit to the data. Here the one with the higher ratio R is the better fit, normally given as the logarithm \mathfrak{R} of the ratio.

References

Chakravarti, Laha, and Roy, (1967). Kolmogorov-Smirnov (K-S) test. Handbook of Methods of Applied Statistics, Volume I, John Wiley and Sons, pp. 392-394.

Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2007. Power-law distributions in empirical data see: [R and Matlab software and documentation] (Updated 29 June 2007) for the 2007 review article.

Meehl, Paul. 1967 Theory testing in Psychology and Physics: A Methodological Paradox. Philosophy of Science 34(2)103-115. reprinted

Shalizi, Cosma. 2007 Maximum Likelihood Estimation for q-Exponential (Tsallis) Distributions. These are classically known as Pareto II distributions.

Commentary

The sum of squares is the same as maximum likelihood when you are dealing with gaussian-distributed data. So yes, if your data is not normal, you should not use R-sq. But you cannot just say "never use R-sq". --- Turchin

Then the other thing is that the significance test for Rsq or any other distribution computed by sum of squares is a test of the null hypothesis not a Kolgomorov-Smirnov type test of model fit. It is worth Andrey's time to read the Clauset-et al article carefully on this point and explore the use of their matlab and R programs. Doug 07:44, 26 January 2008 (PST)

Miscellaneous M (Matlab programs)

Open Aaron's powerlaws_full_v0.0.2-2007-09-28.tgz from http://www.santafe.edu/~aaronc/powerlaws/ with UltimateZip

qpva.m (Matlab) is: The best way to test whether your MLE is accurate is to try it out on synthetic data, for which you know the correct q-exponential parameters. The qpva.m code includes a way of generating synthetic data drawn from the Pareto II, formulated as by Cosma's article, so potentially you could use that code to generate synthetic data for your MLE -- just remember that you'll need Eq 2 in Cosma's article to convert the Pareto II parameters into q-exponential parameters.

randht.m (Matlab) is: "The simulation for Random number generators. This function generates continuous values randomly distributed according to one of the five distributions discussed in the article (power law, exponential, log-normal, stretched exponential, and power law with cutoff). Usage information is included in the file; type 'help randht' at the Matlab prompt for more information."-- from the site

plvar.m (Matlab) is for : "Estimating uncertainty in the fitted parameters. This function implements the nonparametric approach for estimating the uncertainty in the estimated parameters for the power-law fit found by the plfit function. It too implements both continuous and discrete versions. Usage information is included in the file; type 'help plvar' at the Matlab prompt for more information."-- from the site