## Thursday, July 28, 2011

### Some Confused Hypothesis Testing...

Returning to a previous post on fitting a power-law to Metropolitan Statistical Area (MSA) population data.  First, I reported results the following inferences based on Vuong LR tests (see previous post for Vuong stats and p-values):
• Two Sided Tests (Null hypothesis is that both distributions are equally far from the "truth"):
• Fail to reject null hypothesis for power-law and log-normal.
• Reject the null hypothesis for the Weibull.
• Reject the null hypothesis for the exponential.
• One-sided Tests (Null hypothesis is a power-law):
• Fail to reject the null hypothesis of power-law compared with Weibull.
• Fail to reject the null hypothesis of a power-law compared with an exponential.
• Vuong LR test for nested-models (used only for comparing a power-law and a power-law w/ exponential cut-off):
• Reject null hypothesis of a power-law in favor of a power-law with exponential cut-off.
Additionally, if you compare the following log-likelihoods for the various models, you would also select the power-law with exponential cut-off as being the preferred model:
• Power-law with exponential cut-off: -2143.809
• Log-normal: -2144.513
• Power-law: -2146.368
• Weibull: -2173.089
• Exponential: -2182.766
Later (after starring at the above plot for quite awhile and convincing myself that the log-normal "looks" better than the power-law with cut-off) I came back to my work, and decided to try and write some code to conduct a non-nested Vuong LR test to compare a power-law with exponential cut-off and a log-normal distribution.  I then reported that I was able to rejected the null hypothesis for the two-sided test (that both the power-law with exponential cut-off and log-normal distributions are equally far from the "truth"); and reject the null-hypothesis of a power-law w/ exponential cut-off for the one-sided test in favor of the log-normal.

I may have been a bit hasty in that conclusion.  Upon further review, I think that there is either a bug in my code, or that I have made some conceptual error in implementing the test.  For one, I certainly should have noticed that the p-values for both tests were suspiciously small (they were both 0.00!) given the difference in their respective LR of less than 1!

As always comments are very much encouraged...

## Thursday, July 21, 2011

### Zipf's Law for cities holds...w/ a cutoff! Check that...Nope! Log-Normal!

Update!  I have posted the code for this post online.  My code repository is very much a work in progress...so I hope you don't mind the clutter!
Continuing my quest to master the fitting of power laws to empirical data, today I am going to look at some results from Gabaix (1999) on the distribution of "city" sizes in the U.S. using a data set on the population of Metropolitan Statistical Areas.

When fitting any distribution to a data set it is important to consider whether or not the variable in question is discrete, or continuous.  This is especially true when fitting a Power Law as Clauset et al (2009) demonstrate that assuming a discrete data set can be "well approximated" by a continuous distribution power-law is questionable (at best), and that parameter estimates obtained from fitting a continuous power-law to a discrete data set can be wildly inaccurate.  However, a continuous power law will be a good approximation to the data, if the data satisfy certain (rather general) conditions:
1. The minimum value of the data set is greater than 1000, and
2. There are more than 100 observations in the data set
The above MSA  data set satisfies both criteria, so I will proceed using the continuous ML estimator (even though technically population is a discrete variable).

Population of Metropolitan Statistical Areas (MSAs):
I begin with a survival plot for the population of U.S. Metropolitan Statistical Areas (MSAs) using the data set from the Bureau of Economic Analysis (by way of Cosma Shalizi).
I present estimation results using pareto.fit() (which is available for download) to implement both the continous MLE and the Old School regression method (used by Gabaix (1999) and many, many, other papers for comparison purposes).  Note that 1) n is the number of observations in the data set above the threshold, xmin; 2) for both the continuous MLE and the regression CDF methods, I used the KS procedure to find the threshold parameter, xmin.

α xmin n
Continuous MLE 1.99 321,400 146
Regression CDF 2.07 344,600 143

95% Confidence intervals for the continuous MLE are calculated using the same non-parametric procedure described in my previous post (recall the importance of using the non-parametric bootstrap to replicate the fitting of both parameters simultaneously in order to get consistent estimates of standard errors and confidence intervals).  Results...
• α: (1.74, 2.22)
• xmin: (137,600, 552,030)
Note, that in keeping with Zipf's law (a surprisingly poor explanation of which can be found on Wikipedia), the point estimate for α is almost exactly 2.

Goodness-of-fit tests (again using the Clauset et al (2009) KS procedure) yield a p-value of 0.20 > 0.10 which means that I fail to reject the null hypothesis of a power-law for the population of MSA!
However, before I dance a little jig and claim evidence of Zipf's law I need to see if some other distribution fits the data as well or better than the power law.

Power Law Log-Normal Exponential Weibull Power Law w/ Exponential Cut-Off
Vuong Statistic NA -1.27 2.47 2.93 LR: -2.56
p-value (two-sided) 0.20 0.20 0.01 0.00 p-value: 0.02

Table turned out to be a bit muddled: I report two-sided p-values for the test of alternative hypotheses of log-normal, exponential, and stretched exponential (Weibull).  Conclusion: Fail to reject null hypothesis that the power-law and the log-normal are equally "bad" (in the sense of being same "distance" from the truth); Reject Exponential in favor of the Power-Law; Reject Weibull in favor of the Power-Law (Note that Vuong statistics are positive!).

The kicker: I reject the Power-Law model in favor of a Power-Law with exponential cut-off!  Note that since these two distributions are nested, the power-law with exponential cut-off will ALWAYS fit better than a simple power-law.  The question is whether or not the improved fit is significantly better...in this case it looks like it does!
Not sure where this blog post fits into the literature in economics on Zipf's law.  For example, I don't have any intuition about the type of theoretical mechanism that might generate a power-law with exponential cut-off.  Also, suppose, as above, I find that the data reject a pure power-law in favor of a power-law with exponential cut-off.  How do I know that the rejection isn't due to the exponential cut-off capturing "finite size" effects that must show up in the data at some point?

To conclude this very long post, I would like to point out an interesting forthcoming paper in the American Economic Review (co-authored by Xavier Gabaix).  The authors use a clustering algorithm to construct economic "cities" (which they argue are different from "political" cities) from very high resolution population data for the U.S. and UK.  They fit a power law to this data and find strong evidence in favor of Zipf's Law...I will let you take a guess as to which method they use to fit the power law (HINT: The answer makes baby Gauss cry!).

UPDATE: So I felt like I should also test the null hypothesis of a power-law with an exponential cut-off against a log-normal alternative.  Apparently, Gibrat's Law applied to cities predicts a log-normal distribution.  I wrote some very quick and dirty code to try and conduct the test based on the same Vuong LR procedure for non-nested models that I used above.  Hopefully I didn't make any mistakes!  The results: Vuong Statistic: -3.21; p-value (two-sided): 0.00; p-value (one-sided): 0.00!

It's pretty late, and been a long day, but I think that this implies that I reject the power-law with exponential cut-off in favor of the log-normal distribution. Which is good news for proponents of Gibrat's Law! Weird that testing procedure implies that log-normal is preferred to power-law with exponential cut-off; power-law with exponential cut-off is preferred to power-law; and yet some how test is indifferent between log-normal and power-law!
Finally, I am not confident that I plotted the fitted log-normal properly.  I had to multiply the density function by (3 + n/N) where n is the number of observations above xmin, and N is the total number of observations to get the plot to look correct.  I have not idea where the 3 comes from!  The (n/N) bit I understand...and now I sleep...

Update to the Update: Plotting issue fixed!  Was stupidly using the wrong function to evaluate probability density function for plotting purposes!  Plot still looks the same as above.  Thankfully, using the correct version of the density function removes the need to include the mysterious 3!  Only need include the usual factor of (n/N).

## Wednesday, July 20, 2011

### Power Law in Wealth Distribution? NO...

Recently I have been teaching myself to estimate power laws from empirical data using the improved techniques developed in Clauset et al (2009) with the hopes of applying them in my own research.  What follows is an example where I apply these techniques to some data on the wealth of the 400 Richest Americans from 2003.  This example is partially covered in one of the co-author's of Clauset et al (2009) (i.e., Cosma Shalizi) lecture notes.

Below, I expand on the example given in those lecture notes by
1. Estimating the power-law threshold parameter, xmin, from the data
2. Constructing standard errors and confidence intervals for the scaling parameter, α, and for the threshold parameter, xmin (that take into account the fact that I have estimated both α and xmin).
3. Testing an alternative hypothesis for the distribution of wealth (i.e, log-normal)
Always plot your data! I think that it is a good idea to start any data analysis with a plot of some description.  I will start this one with a plot of the probability of an individual having wealth greater than wealth (W), against wealth (W)...or more succinctly: a Survival Plot.
If the data follow a power law, then this relationship should be roughly linear.  The converse, however, is not true!  Linearity is not a sufficient condition for a power law.  Also, lots of things might look linear on a log-log plot even when they aren't.

Recipe for Fitting Power Laws:

Parameter Estimation: Following Clauset et al (2009), I estimated the scaling parameter, α, using the method of maximum likelihood whilst using the suggested KS minimization procedure for estimating the threshold parameter, xmin.  Results:
• α: 2.34
• xmin: 9e8 ($900 million) • obs.above.threshold: 302 (out of 400) The code for estimating power laws can be found on Aaron Clauset's website. There are two different implementations of the estimation procedure in R: pareto.fit() and plfit(). I have experimented with both, and although they give identical results for the above estimation, this is not true in general. From what I can tell, the (slight) differences in estimation results stem from differing implementations of the KS test. The pareto.fit() method uses R's ks.test() method, while plfit() hand-codes the KS test. I have not (yet) conducted a Monte-Carlo study to see if the differences between the implementations are worth worrying about. Standard Errors and Confidence Intervals: To calculate the standard errors and confidence intervals for the parameters I used a non-parametric procedure outlined in Cosma's lecture notes. Cosma also graciously provides the code for his lecture notes as well! To calculate the standard errors and 95% confidence intervals for α and xmin, I had to alter (trivially) the provided code to take into account the fact that I was not given the threshold parameter and instead estimated it from the data. Results: • α.se: 0.10 • xmin.se: 364e6 ($364 million)
• α.ci: (2.13, 2.53)
• xmin.ci: (-2e8, 1.05e9)
Obvious one cannot really have negative wealth (at least not at the level of \$-200 million), so just note that the 95% confidence interval for xmin includes zero.

Survival Plot (fitted power-law included):
Looks pretty good!  But while you might be able to rule out a particular model just by looking at it, I think you would be hard pressed to rule a model "in" by "eye-balling it" (unfortunately, I am sure that people do).

Goodness-of-fit Test: Continuing to follow Clauset et al (2009), I again used a Monte Carlo KS test procedure to conduct a goodness-of-fit test for the power law model.  The procedure basically generates synthetic data by re-sampling the data itself below xmin, and then generating simulated data from a power law with the estimated parameters above xmin.   A power-law is then re-estimated to each synthetic data set and the KS statistic is stored.  The result is a p-value that represents the fraction of simulations whose KS statistic was at least as large as the one obtained from fitting the power law model to the wealth data.  A small p-value (i.e., a p-value < 0.10) indicates that only a very small fraction of the simulations had a KS stat as large as the wealth data and is evidence against the plausibility of the power-law model.  Results: p-value of 0.0012!  It would seem that the data reject the power-law as a plausible model.  This is perhaps not surprising...

Testing Alternative Hypotheses:  So the wealth distribution does not follow a power law!  What about log-normal? I test this using the likelihood ratio procedure recommended in Clauset et al (2009)

The null hypothesis for comparing two non-nested distributions is that both distributions are equally far (in a Kullback-Leibler relative entropy sense...note to self: understand what the hell Kullback-Leibler relative entropy means!) from one another.  I report the Vuong test statistic and the two-sided p-value:
• Vuong Statistic: 0.21
• p-value: 0.83
Conclusion: fail to reject the null hypothesis that the power-law and the log-normal distributions are equally bad (again in the Kullback-Leibler relative entropy sense).

Other alternatives (for which  Clauset et al (2009) publish results) are the exponential, stretched exponential (Weibull), and the power-law with exponential cut-off.  I continue to have trouble getting the code for fitting the exponential distribution to work (keep getting an error in the nlm() method for minimizing the negative log-likelihood function...it keeps returning non-finite values...which makes me think that it is probably user error on my part).  I have also not yet figured out how to get the code for estimating the power-law with exponential cutoff to work.  This is at the top of my to-do list as I feel it is a very plausible alternative distribution (perhaps not necessarily for wealth...but definitely for some other data sets that I am working with at the moment).

I can however report results for the stretched exponential (Weibull).  I was able to get the code to work after I learned how to suppress warnings in R (i.e., using the suppressWarnings() method).  Again I report the Vuong test statistic and the two-sided p-value:
• Vuong Statistic: 8.05
• p-value: 0.00
The fact that the sign on the Vuong Statistic is positive indicates that the log-likelihood for the power-law model was greater than the log-likelihood for the stretched exponential (Weibull).   The test statistic of 8.05 is so large that the p-value is actually 8.88e-16!  Thus we can reject the stretched exponential in favor of the power-law for this data set (but given that we have already rejected the power law...this test is somewhat superfluous).  I hope I have interpreted the above test correctly.  If anyone feels otherwise definitely speak up!

The real objective of this post was to replicate results from fitting a power-law to empirical data (in a rigorous way that does not abuse linear regression), so that I feel confident in applying these tools in my own research.  Thus I purposefully did not stray too far from Clauset et al (2009).  I think I succeeded (at least in replicating the work of Clauset et al (2009) for this data set).  Hopefully, I have not made any glaring errors...but if I did, they are my own!

## Tuesday, July 12, 2011

### Heteroskedasticity simulations...

Finally getting back into my routine after a very successful summer school in Jerusalem.   This blog post is about part of a graduate econometrics lab on violations of the classical linear regression model that I have been working on lately.  Basic idea of the lab, which I borrowed from Prof. Mark Schaffer, is to expose students to Monte Carlo simulation and violations of the classical linear model simultaneously by having them construct simulations that demonstrate what happens to standard hypothesis tests when the DGP exhibits one of the following issues:
1. Heteroskedasticity
2. Autocorrelation
3. Endogenous regressor
4. Non-stationary regressor
To keep this post reasonably short, I am only going to focus on the section of the lab that compares various techniques for dealing with heteroskedasticity via Monte Carlo simulations.

The DGPs:
I specify the two data generating processes as follows.  For DGP #1 x can only take positive values:
• x = abs(rnorm(n, mean=0, sd=1))
• e = rnorm(n, mean=0, sd=1)
• u = e*x (implies that the variance of u increases with the square of x!)
• y = x + u
The first pane in the plot below is y versus x for n=1000 with the OLS regression line in red and the true regression function in grey.  If the heteroskedasticity isn't obvious enough, the second pane is a plot of the squared OLS residuals against the regressor x.  The true variance of the disturbance increase with the square of x (i.e., the grey curve).  Note that the true variance process looks roughly linear (even though it actually is not!).  The fact that a linear approximation to the true variance process looks like it would work pretty well for DGP #1 plays a role later.
For DGP #2 the regressor, x, is allowed to take both positive and negative values (i.e., x=rnorm(n, mean=0, sd=1)).  Everything else is the same as in DGP #1.  Here is the analogous plot for DGP #2:
Note that the even though the true variance process is non-linear in both DGP #1 and DGP #2, it can be (reasonably) well approximated by a linear function in DGP #1.  However, a linear approximation to the true variance function is an EPIC FAIL for DGP #2!  This plays a huge role in the breakdown of FGLS documented below.

Comparison #1 (OLS, OLS w/ HAC Std. Errors, and WLS): Assume that you know the true form of the variance function (i.e., you have ritually slaughtered the requisite small woodland creature, dried its bones, and left them for 3.5 days to be scattered by the wind at which point you were able to use the pattern created to divine the true variance function...alternatively, if you live in Greece, you could simply consult the Oracle).

After having done all of this, the first simulation sets the sample size to n=100 and regresses y on x (where x is given by DGP # 1) 5000 times while capturing the parameter estimate for the coefficient on x using both OLS and WLS (which incorporates the optimal weights given the true variance function) and the various flavours of the standard errors.  Note the HAC stands for Heteroskedastic and Autocorrelation Consistent standard errors.  These standard errors are used frequently in econometrics to adjust the standard errors in the presence of heteroskedasticity and autocorrelation (HAC standard errors are rumored to be consistent in the presence of arbitrary forms of heteroskedasticity and autocorrelation...although I confess to not having read the original journal articles).

Plotting the kernel densities of the OLS estimator and the WLS estimator shows that while both estimators appear to be unbiased (consistent with theory), only the WLS estimator takes advantage of the known variance function (OLS ignores this information) and thus the variance of the WLS estimator is much smaller.
Next, let's consider the size properties of the standard hypothesis test (i.e., type I error).  How often would you expect to reject the null hypothesis of β=1 if your use the standard test statistic and critical values of -1.96 and +1.96 were valid? (Answer: 5% of 5,000 repetitions = 250 times). How many rejects do we actually observe in this simulation?
• OLS: ~ 1200!
• OLS w/ HAC standard errors: ~ 450
• WLS: ~ 250
It would seem that only the WLS has the expected amount of type I error.  Not surprisingly, OLS is in bad shape.  To get even more information about the size properties, let's look at histograms of the p-values associated with the above tests.  If the test statistics were unbiased, then we should expect to find uniformly distributed p-values...

...and with OLS w/ HAC standard errors, and WLS we do find more or less uniformly distributed p-values.  OLS w/ HAC exhibits some very slight distortions...but these are nothing compared with OLS!

Finally, let's look at the power of the standard hypothesis test (i.e., type II error). Below I plot the power curves using the various flavors of standard errors.  What are we looking for in this plot? Two things.  First, a hypothesis test with good size properties should obtain a minimum value of 0.05 (for a two-tailed test) at the true value of β=1.  Second, the power curve should be relatively narrow.  The wider the power curve, the lower is the probability of correctly rejecting a false null hypothesis (i.e., the higher is the type II error).

Unsurprising conclusion of this comparison: if you are going to invest the 3.5 days necessary to obtain the true variance function, then you should use it when estimating your regression.

Comparison #2 (OLS, OLS w/ HAC Std. Errors, FGLS, and WLS (npreg): Now suppose that you do not know the true form of the variance function but wish to estimate it from the data.  This simulation sets the sample size to n=100 and regresses y on x (where x is given by DGP # 1) 5000 times while capturing the parameter estimate for the coefficient on x using OLS, FGLS, and WLS (npreg) and the various flavours of the standard errors.  Feasible Generalised Least Squares (FGLS) estimates the true variance function from the data by linearly regressing the log of the squared OLS residuals on x and using the fitted values as the estimate of the variance function. WLS (npreg) estimates the variance function using the fitted values from a non-parametric regression of the log of the squared OLS residuals on x (see Cosma Shalizi's lecture 5 notes for details).
While all three estimators appear to be unbiased, the FGLS and WLS (npreg) estimators estimate the unknown variance function (OLS does not) and thus the variance of these estimators is much smaller.

Again, consider the size and power of the standard hypothesis. How many rejections of the null hypothesis of β=1 do we actually observe?
• OLS: ~1200!
• OLS w/ HAC standard errors: ~ 400
• WLS (npreg): ~ 600
• FGLS: ~ 300
FGLS would seem to perform the best (in a relative sense), however all four procedures have excessive type I error.  They all reject a true null hypothesis too frequently.  I was expecting the size properties of the WLS (npreg) estimator to be superior to those of the FGLS estimator.  Could ye old trade-off between bias and variance.  Given that the true variance process is roughly linear for DGP #1 the bias introduced by using FGLS should be pretty small.  I also just used the default bandwidth for npreg() in my R code.  Perhaps I should have spent more time trying to optimize the bandwidth (if default bandwidth too high, then npreg() is overfitting, which should lead to higher variance in the WLS (npreg) estimator).  Also, perhaps if I had implemented in iterative procedure to estimate the variance function using the non-parametric regression technique, WLS (npreg) would outperform FGLS.  Any thoughts/ideas from readers would be welcome.

Histograms of the p-values:
In this case only with FGLS we do find more or less uniformly distributed p-values.  OLS w/ HAC se exhibits some very slight distortions...but these are nothing compared with OLS!  As touched on above, WLS (npreg) also exhibits surprisingly large size distortions.

Plot the power curve:

Again, hypothesis tests using OLS w/ HAC and FGLS have pretty good size properties, but the hypothesis tests using FGLS have superior power properties.  If you have reason to suspect that the true variance function is roughly linear, then it looks like FGLS is the way to go (although the difference between the two procedures might become negligible as sample size increases).

Comparison #3 (OLS, OLS w/ HAC Std. Errors, FGLS, and WLS (npreg) with DGP #2: Now suppose that you do not know the true form of the variance function but wish to estimate it from the data.  This simulation sets the sample size to n=100 and regresses y on x (where x is given by DGP # 2) 5000 times while capturing the parameter estimate for the coefficient on x using OLS, FGLS, and WLS (npreg) and the various flavours of the standard errors.   Recall that the true variance process for DGP #2 is poorly approximated by a linear specification.  Thus my prior was that FGLS should be an EPIC FAIL, but the WLS (npreg) would save the day! I turned out to be half right...
While all three estimators appear to be unbiased, the WLS (npreg) estimator is considerably more precise than either the OLS or FGLS estimators. Why is FGLS doing so poorly? The true variance function for DGP #2 is sufficiently non-linear that the bias introduced by using FGLS to estimate the variance function produces and EPIC FAIL!  Note that the kernel density of FGLS tracks more or less with the density of the OLS estimator!

Consider the size and power of the standard hypothesis. How many rejections of the null hypothesis of β=1 do we actually observe?
• OLS: ~1200!
• OLS w/ HAC standard errors: ~ 350
• WLS (npreg): ~ 900 ?!
• FGLS: ~ 1400!!
FGLS is an epic fail.  OLS with HAC standard error doesn't do too poorly.  What is going on with WLS (npreg)!  I was hoping that it would shine through in this most realistic of cases!  I must be over-fitting and the sample size is pretty small (only n=100).  Extending the sample size might help.  I am really perplexed by this...

Histograms of the p-values:
Only OLS with HAC standard errors has roughly uniform p-values.  The rest exhibit significant distortions.

Finally, the power curve:
OLS with HAC would seem to win again.  Its power properties are not great, but should improve with sample size.

As always R-code is available upon request.  If anyone has thoughts on what is going on with my implementation of WLS (using non-parametric regression to get the optimal weights), definitely feel free to comment!