# Chi-squared

This page reviews the current approaches and new developments in contingency table analysis.

## Categorical Variables: association and significance tests under the Null hypothesis

Categorical variables are not assumed to have any particular ordering of the categories distinguished in each variable. The statistical tests of association and significance described here for cross-tabulation are all related, as they are based on comparison of actual values in each cell of the table as compared to expected values from the laws of probability fot independent variables. If the categories within each variable are mutually exclusive and the assumption of independence ("no correlation") between each variable were valid, then the laws probability law tells us we can compute the expected value in each cell as a product of the independent probabilities for each category. For a cell In a two-way cross-tab , as in the table below, If the total N is known, and f(1,X), f(2,Y) are the cell frequencies of category X variable 1 and category Y variable 2, then we can estimate P(cell X,Y)=f(1,X)/N * f(2,Y)/N. Then the expected frequency in cell X,Y is $\mu_i \equiv N*P(cell X,Y) = 60*100/200 = 60*100/200 = 30$, in this case, i.e.: row total * col total / N. This generalizes to or n-way cross-tabs.

 X=feature present feature absent heavy 70 5 Y=light 25=X,Y 35 absent 5 60

A cell value of 5 or less can be compared to expected value by the Chi-squared distribution, $\sum_{i=1}^k \frac{\left(X_i-\mu_i\right)^2}{\sigma_i^2} = \sum_{i=1}^k \frac{\left(Actual_i-Expected_i\right)^2}{Expected_i}$, then measures the squared difference between the actual and expected value (considered as an average as the second term in the numerator) divided by the variance of the expected value in the denominator, where it can be proven that the variance of this difference relative to the mean is $\sigma_i^2\equiv \sum_{i=1}^k(X_i - \mu_i)^2$ (see Wikipedia:Chi-squared).

The Chi-squared distribution is constructed so that the summation from 1 to k in these equations, where k is the number of cells being considered as a basis for testing departure from independence between the variables (the null hypothesis) for all the cells considered.

## Retrieving the result

The Chi-squared distribution as a sum over multiple cells can be interpreted using degrees of freedom df. For one cell, df=1. For a sum over all cells in a R-row by C-col table is (R-1)*(C-1) because once this number of cell values is known, the remainder can be computed knowing the row and column totals. Using http://faculty.vassar.edu/lowry/tab2x2.html calculator, the appropriate four numbers of the X by Y boxes in the calculator that retrieve the probability result p=.11 in this example are

25 35
75 65


The problem in all cross-tabulation tests is that in observational studies the cases are usually not independent, reducing the observed N to a much smaller effective size Wikipedia:Effect size. If we cut the effective sample size in this example by 1/2 and adjust the actual cell values accordingly the result is p=.27.

12 17
37 32


What does not change with alteration of the effective sample size is the Phi coefficient of correlation, which is constant at Phi = -.11, meaning a small negative correlation. What does change are the probabilities under the null hypotheses, all of which are more than doubled.

## Correlations

A correlation coefficient is usually normalized between -1 (negative correlation) and +1 (positive correlation), with zero for no correlation. Phi is a correlation within these limits for 2 x 2 cross-tabs but for cross-tabs where the maximum number of rows or columns is m, then whereas Phi-squared is $\phi^2 \equiv \chi^2/N$ the adjusted "Cramer's phi" or Cramer's V (squared) is $V^2 \equiv \chi^2/(M-1)$, hence $\mathrm{Cramer's\ V} \equiv \sqrt{\chi^2/(M-1)}$. This coefficient is bounded between -1 and +1.

Cramer's V does not assume any particular ordering of the categories of the two variables, but if there is such an ordering, then it gives a similar result to Wikipedia:Tau-b, which is an ordinal or rank order correlation. If Cramer's V is much (say 20%) greater than tau-b, then there is a non-linear or non-ordinal component to the correlation.

The square of the tau-b correlation approximates the standard measures of percentage variance accounted for in predicting one variable from the other.

A good rule of thumb to guard against the effective sample size problem is to divide all your N and cell values by 2 and reevaluated significant (correlations will not be changed).

## Fisher (exact test)

Instead of Chi-squared, the Fisher exact test must or may be used for cross-tabulations:

1. Must use: When Chi-squared cannot be used because some cell expected values are less than 5
2. May use: For greater statistical power, or ability to discriminate associates that deviate from independence (the "null hypothesis"

The problem of effective sample size or Wikipedia:Effect size remains the same as with Chi-squared.

Using http://faculty.vassar.edu/lowry/tab2x2.html as above, the Fisher test is calculated (and gives lower probabilities for the two-tailed test as expected) and the Fisher test is also explained and illustrated in full at this site.

## Calculation

--- - you must adjust for "effective" independent N sample size versus "noneffective" or the nonindependent N

## The modeling problem for pairwise associations

Models of how variables are related in the case of pairwise measures of correlation (association) take many forms, in this case from measures of how much information in one set of discrete categories is predicted by another without any ordering(lambda, the uncertainty coefficient, the contingency coefficient), at one extreme, to how ordered variation in one set of ranked categories relates to another (Kendall's tau, Somer's d), at the other extreme, along with intermediate possibilities (interval by nominal: eta). For two binary variables: if they attempt to measure the same thing, Kappa measures the percentage of cases of agreement adjusted for differences in the proportions of the two categories that make raw agreement more likely; relative Risk (RR) treats the column variable as a control and the row variable as an outcome or effect; and the McNemar test treats pairs of observations and concordant or discordant in one of two ways (r and s) and evaluates the odds ratio Chi-squared $\chi^2 \equiv \frac{(|r-s|-1)^2}{r+s}$ . We have discussed one of the rank correlations (Kendall's tau-b) which converges to the linear association measure for the 2 x 2 table, for comparison with Cramer's V.

All of these are probability measures. There are three modeling questions:

1. Which of these probability measures are models that fit the research question, if any?
2. Are there alternatives that are not bivariate but involve more complex relationships?
3. Among these alternatives, which of these probability measures has the best fit?

To exemplify question 2, let's take the medical Risk measure, in which one variable is an experimental control. If a given treatment produces an association (potential risk or treatment factor), however, the researcher must consider whether there are other variables that the subjects carry into the experiment, third factors that interact with the treatment whose presence may:

• enhance the observed treatment effect
• nullify the treatment effect
• make no difference as to which subjects are affected by the treatment

Without running post-hoc 3-way tests for column by row by third factor for all measurable third factors on which there is systematic subject data, the experimental result cannot be confirmed. These third factors provide replicability if none of the third factors make a difference. The researcher must guard against spurious non-replicability by the use of multiple-test or group statistics. These guard against the fact that if tests against third factors are repeated N times, the chances are that with random variation, one time in N, a result will have a statistical significance of 1 in N, e.g., if N=100, one result is likely to be "strong enough" to have a significance of p ≤ .01.

The models, then, that are finally derived, whether from experiments or observational data, should incorporate the higher order interaction alternatives as well as the initial, say, pairwise alternatives.

And until third-factor testing is done rigorously to evaluate whether it is only bivariate relationships that occur among a set of variables, one should beware the use of multivariate analyses -- factor analysis, multidimensional scaling -- that assume that the only interactions are bivariabel. Look under the hood for the assumptions underlying such models.

What I have called "group tests" also need to be done to determine whether (1) many variables in a set of variables have constraint relationships of the form X ≤ Y, which cannot be evaluated by the general linear model or by the Chi-squared family of statistics. The Gamma coefficient for paired variables can take such inequalities into account, but cannot distinguish between strong correlations produced by asymmetries of the X ≤ Y form and symmetries of the X ≈ Y form.

Finally, if we have several reasonable alternative or competing models, how do we evaluate fit (item #3 above)?

## The modeling problem is not solved by "significance" under the Null hypothesis

Statistical significance does not provide an evaluation of the goodness-of-fit (item #3 above) of a model or the relative goodness-of-fit of alternate models. Significance is a goodness-of-fit test only of the null hypothesis of no association.

## Goodness-of-fit evaluation of models must be probabilistic (bootstrap)

Probabilities are often used to design and interpret measures of association as discussed above, but the goodness-of-fit of models or of one model in comparison to another require the solution of a new set of problems.

Given a sample of size N, a goodness-of-fit evaluation takes the form of what is called bootstrap modeling: in a univariate or bivariate model, let each case N take a random value one variable, and assign the other variable to be consistent with the model. In the univariate case the other variable will describe the overall distribution of the variables, e.g., normal, lognormal, power-law, etcetera. If the model is X ≤ Y then X is drawn randomly and Y ≥ X is drawn randomly. The values of X may also be drawn randomly from the actual values of X. When this is done repeatedly (the bootstrap) the resultant X,Y distributions will vary because of the random draw of X. If some set of parameters (or single parameter) is attached to these X,Y distributions they will form a new parameter distribution, and we can see where the actual parameter for the empirical X,Y distribution falls: near the center, in which case the probability P(model) approaches 1, which measures the probability that these data could result from a process governed by this given model. In comparing two models, there will be to parameter distributions for a single comparative parameter and a probability P(model 1 > model 2) is generated as to whether the actual data are drawn from (consistent with) the first model as opposed to the second (in this case P(model 1 > model 2) = 1 - P(model 2 > model 1), but the probabilities of the individual models are not given.

There is now a developed field of causal modeling that uses bootstrap methods (see Halbert White). For the simple bivariate models discussed above, bootstrap models are easily programmed by computer.

Douglas R. White, Michael L. Burton, and Lilyan A. Brudner 1977 Entailment Theory and Method: A Cross-Cultural Analysis of the Sexual Division of Labor. Cross Cultural Research 12:1-24. These data are from the Standard Cross-Cultural Sample. These gender orderings have been replicated in archaeological studies. Download data in these formats: spss excel

Douglas R. White, Robert Pesner, Kark Reitz. 1983. An Exact Significance Test for Three-Way Interaction. Cross Cultural Research 18:103-122. http://ccr.sagepub.com/cgi/content/abstract/18/2/103

Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2007. Power-law distributions in empirical data. http://arxiv.org/abs/0706.1062.

Laurent Tambayong, Aaron Clauset, Cosma Shalizi, and Douglas R. White. 2008. Estimating Tsallis q and Power-Law Tails in Empirical Data. (In draft)

## Bootstraps for contingency tables

A boostrap method for contingency tables:

Hyeong Chul Jeong, Myoungshic Jhun. and Daehak Kim. 2005. Bootstrap tests for independence in two-way ordinal contingency tables. Computational Statistics & Data Analysis 48(3):623-631

To open a DOI site: http://dx.doi.org and paste the DOI: 10.1016/j.csda.2004.03.009

They use the bootstrap method not to test the contingency model itself, $p(Y_j) \equiv p(Y_j|X_i)*p(X_i)$, but to test independence, and find in their Table 1 that their estimates of significance are larger that those of Chi-squared (and Wishart) tests, but also:

"As ρ tends to 1, we have strong evidence for the alternative hypothesis. Therefore we note that the estimated powers of the three methods increase as ρ increases. When the sample size is small and the contingency table size is large, the bootstrap method is more powerful than the other methods. However, the differences between the methods decreases as the sample size increases."

Thus, for smaller samples sizes, this corrects in part for the bias in ordinary significance tests (but not for an additional aspect, that of network autocorrelation nonindependence among the cases).

It remains to workout (or find) bootstrap software for $p(Y_j) \equiv \Big[p(Y_j|X_i)*p(X_i)|X_i \epsilon X_{i=1,N}\Big]$, which does not seem difficult.

An example for n-way cross-tabs:

Bernd Streitberg. 1999. Exploring interactions in high-dimensional tables: a bootstrap alternative to log-linear models. Annals of Statistics 27(1): 405-413. http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1018031118

## Case-by-case analysis

When the researcher is fully knowledgeable about the details of a series of cases, if all the plausible alternative hypotheses or models are specified in advance, without bias, a procedure comparable to evaluating goodness-of-fit is to go through each case and score how well it stands up against each model, hypothesis, or explanation. Individual models, hypotheses, or explanations can then be evaluated in absolute terms, and in relative terms as between competing models. Evaluation can also be done of which subsumes others, and which have greater generality or explanatory power. Qualitative analysts take note.

Both approaches exemplify the method of multiple working hypotheses that is necessary for any scientific study. Interestingly, the probabilistic approach does not depend on having true probability samples of observations but can be used with opportunistic or observational samples, so long as they are not selectively chosen to for cases that fit the researchers hypotheses.

Some examples of this approach

Guillermo Algaze. 2005. The Sumerian Takeoff. Structure and Dynamics 1#1 http://repositories.cdlib.org/imbs/socdyn/sdeas/vol1/iss1

Robert McC. Adams. 2008:1. An Interdisciplinary Overview of a Mesopotamian City and its Hinterlands. Cuneiform Digital Library Journal http://cdli.ucla.edu/pubs/cdlj/2008/cdlj2008_001.html © Cuneiform Digital Library Initiative

Douglas R. White, G. P. Murdock, R. Scaglion. 1971 Natchez Class and Rank Reconsidered. Ethnology 10:369- 388. http://eclectic.ss.uci.edu/~drwhite/pub/NatchezPeople.pdf

Douglas R. White and Woodrow W. Denham. 2008 The Indigenous Australian Marriage Paradox: Small-World Dynamics on a Continental Scale, Structure and Dynamics 3(2).5. http://intersci.ss.uci.edu/wiki/pub/Paradox07b.pdf

## Reference

(Multiple working hypotheses approach: quantitative)

... more examples needed