Fitting Distributions for Chinese Migrant Networks
From InterSciWiki
[edit] Intro to Clauset et al 2007
Test Procedures
As Clauset et al (2007)[1] show, unbiased point data estimates can be obtained and tested, and different model fits compared, in four steps:
- 1. Derivation and validation of appropriate maximum likelihood estimation (MLE) formuli for different kinds of distributions;
- 2. Estimating appropriate parameters with the MLE, including small sample corrections.
- 3. Use of the Kolmogorov-Smirnov (KS) test , (Chakravarti, Laha, and J. Roy 1967) to estimate goodness-of-fit for a model with MLE parameters (using synthetic data) and estimate the lower bounds xmin for power-law and q-exponential behavior, or xmin lower-bounds on fit for other distributions that give the minimum KS value D that measures distance between distributions (Clauset et al. 2007:7-9). (Unlike the null significance test, KS probability values close to 1 are good fits, and the null hypothesis of no difference is rejected if p is small; e.g., p<.05 means the distributions differ.) In these cases, the use of the KS test to measure the distance between distributions is applied to the empirical data compared to synthetic data. Unlike the commonly used r2 test the theoretical values do not form a continuous curve but variable “jiggled” lines that only approximate in their general form the theoretical curve of a model. Graphs of the synthetic data for a given model and set parameters can be quite surprising to a novice used to fitting empirical data points to a continuous and singular theoretical curve. Because of the random generation of x values for the synthetic data, no two generated “jiggled” lines are identical. They form a distribution of possible outcomes for a given theoretical distribution, a characteristic shared with “bootstrap” estimates of variance in outcomes from model probabilities. Thus, for example, “even if a data set is drawn from a perfect power-law distribution, the fit between that data set and the true distribution will on average be poorer than the fit to the best-fit [e.g., r2] distribution, because of statistical fluctuations” (p.11)
- 4. Likelihood ratio tests of competing fit as between different models. Clauset et al (2007:13,15,19) do so only for power laws compared to fit of other distributions (exponential, stretched exponential, lognormal, power law plus exponential cut-off).
[edit] q exponential and q logarithm
The solution to
, y(0)=1, is the q-exponential function (Tsallis 2004:5-6). A constant such as λ may be inserted before x and will carry over before x in equations 1 and 2)
if (1 + (1 - q)x) > 0); otherwise
, where
,
The inverse of the q-exponential is the q-logarithmic function (Tsallis 2004:6):
where (ln1 x = ln x)
With an optional constant κ
where
[edit] Probability distribution and Cumulative probability distribution (CDF)
The probability distribution for
(Tsallis 2004:6-8) normalized with
is
The cumulative probability distribution for the q-exponential function in (1) (Tsallis 2004:8) is also a q-exponential function
, with
, where for
is not normalizable. Thus, q cannot be fitted for a range at or above 2. (Note that κ does not enter into this equation).
[edit] The Shalizi MLE of q-exponential by Pareto II
See Cosma Shalizi, 2007 Maximum Likelihood Estimation for q-Exponential (Tsallis) Distributions. http://www.cscs.umich.edu/~crshalizi/research/tsallis-MLE These are classically known as Pareto II distributions. (Examples). Rather than using Shalizi's method, however, we procede to estimate q and κ directly, as follows (the two MLE approaches will be compared).
As related to Tsallis's exposition, above, Shalizi took as his cumulative distribution
,
Substituting to obtain Pareto II parameters, ,θ,σ, we may take his equations (8) and (10) for the MLE estimation of
in terms of
is the q for Tsallis equation 4, while Shalizi's is the qM for Tsallis equation 5.
In should be the case that the cumulative q-exponential asymptotes in the tail to a power-law slope of 1 / (1 − qM).
[edit] Problem of estimating q-exponential for discrete distributions: the MLE is not known
The approximation here is for continuous data. Solution of the discrete case (density function) is still unknown
[edit] Proposed Solution: Use the one parameter solution in KS
Proposal: Use a KS matlab program like that of Aaron Clauset:
- Start with an r2 optimization of linear fit between x and e2x, then test p KS fit
- alter q and retest p KS fit until an optimal solution is found
- alter xmin and retest p KS fit until an optimal solution is found
- back to 2 if changed from last iteration
This should give the equivalent of an MLE solution, by by bootstrap.
We then have a direct comparison of p KS fit for q-exponential, as for power law.
Perform likelihood test between these two p KS fits.
[edit] (Estimating q for the q-exponential: Small Samples) - might not be needed
Shalizi's Figure 1 shows how the MLE for the q-exponential asymptotes to the correct value with larger samples. For samples < 150 a small sample correction is needed. But as the curve fitting method overestimatee by approximately the same amount as the MLE underestimates, we present both in our paper, and take the average.
[edit] Estimating β for power law: MLE
The approximation here is for discrete data for which the density function is known
[edit] Estimating β for power law: Small Samples
[edit] Estimating μ for log-normal: MLE
- versus power-law
- versus q-exponential
[edit] Estimating μ for log-normal: Small Samples
- versus power-law
- versus q-exponential
[edit] Estimating λ for exponential: MLE
- versus power-law
- versus q-exponential
[edit] Estimating λ for exponential: Small Samples
- versus power-law
- versus q-exponential
[edit] Resuls
2006 Spss fitting for q. 2008 RS MLE fitting for power law, with likelihood p, optimal xmin, Clauset. 2008 RS bootstrap fitting for power law, with likelihood p, optimal xqmin, proposed.
[edit] References
Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman. 2007. Power-law distributions in empirical data. http://arxiv.org/abs/0706.1062. R and Matlab software and documentation for the 2007 review article (29 June 2007). http://www.santafe.edu/~aaronc/powerlaws/
Shalizi, Cosma. 2007 Maximum Likelihood Estimation for q-Exponential (Tsallis) Distributions. http://www.cscs.umich.edu/~crshalizi/research/tsallis-MLE
Thurner, Stefan, Fragiskos Kyriakopoulos, Constantino Tsallis. 2007. Unified model for network dynamics exhibiting nonextensive statistics. Phys. Rev. E 76, 036111 (8 pages)
Tsallis, Constantino. 1988. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 52:479-487.
Tsallis, Constantino. 2004. Nonextensive Statistical Mechanics: Construction and Physical Interpretation. Chapter 1, in Murray Gell-Mann and Constantino Tsallis, eds., Nonextensive Entropy: Interdisciplinary Applications. Santa Fe Institute Studies in the Sciences of Complexity. Oxford: Oxford University Press. pp. 1-53.
Thurner, Stefan, Fragiskos Kyriakopoulos, Constantino Tsallis. 2007. Unified model for network dynamics exhibiting nonextensive statistics. Phys. Rev. E 76, 036111 (8 pages) published as http://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal&id=PLEEE8000076000003036111000001&idtype=cvips&gifs=yes
