## Thursday, July 21, 2011

### Zipf's Law for cities holds...w/ a cutoff! Check that...Nope! Log-Normal!

Update!  I have posted the code for this post online.  My code repository is very much a work in progress...so I hope you don't mind the clutter!
Continuing my quest to master the fitting of power laws to empirical data, today I am going to look at some results from Gabaix (1999) on the distribution of "city" sizes in the U.S. using a data set on the population of Metropolitan Statistical Areas.

When fitting any distribution to a data set it is important to consider whether or not the variable in question is discrete, or continuous.  This is especially true when fitting a Power Law as Clauset et al (2009) demonstrate that assuming a discrete data set can be "well approximated" by a continuous distribution power-law is questionable (at best), and that parameter estimates obtained from fitting a continuous power-law to a discrete data set can be wildly inaccurate.  However, a continuous power law will be a good approximation to the data, if the data satisfy certain (rather general) conditions:
1. The minimum value of the data set is greater than 1000, and
2. There are more than 100 observations in the data set
The above MSA  data set satisfies both criteria, so I will proceed using the continuous ML estimator (even though technically population is a discrete variable).

Population of Metropolitan Statistical Areas (MSAs):
I begin with a survival plot for the population of U.S. Metropolitan Statistical Areas (MSAs) using the data set from the Bureau of Economic Analysis (by way of Cosma Shalizi).
I present estimation results using pareto.fit() (which is available for download) to implement both the continous MLE and the Old School regression method (used by Gabaix (1999) and many, many, other papers for comparison purposes).  Note that 1) n is the number of observations in the data set above the threshold, xmin; 2) for both the continuous MLE and the regression CDF methods, I used the KS procedure to find the threshold parameter, xmin.

α xmin n
Continuous MLE 1.99 321,400 146
Regression CDF 2.07 344,600 143

95% Confidence intervals for the continuous MLE are calculated using the same non-parametric procedure described in my previous post (recall the importance of using the non-parametric bootstrap to replicate the fitting of both parameters simultaneously in order to get consistent estimates of standard errors and confidence intervals).  Results...
• α: (1.74, 2.22)
• xmin: (137,600, 552,030)
Note, that in keeping with Zipf's law (a surprisingly poor explanation of which can be found on Wikipedia), the point estimate for α is almost exactly 2.

Goodness-of-fit tests (again using the Clauset et al (2009) KS procedure) yield a p-value of 0.20 > 0.10 which means that I fail to reject the null hypothesis of a power-law for the population of MSA!
However, before I dance a little jig and claim evidence of Zipf's law I need to see if some other distribution fits the data as well or better than the power law.

Power Law Log-Normal Exponential Weibull Power Law w/ Exponential Cut-Off
Vuong Statistic NA -1.27 2.47 2.93 LR: -2.56
p-value (two-sided) 0.20 0.20 0.01 0.00 p-value: 0.02

Table turned out to be a bit muddled: I report two-sided p-values for the test of alternative hypotheses of log-normal, exponential, and stretched exponential (Weibull).  Conclusion: Fail to reject null hypothesis that the power-law and the log-normal are equally "bad" (in the sense of being same "distance" from the truth); Reject Exponential in favor of the Power-Law; Reject Weibull in favor of the Power-Law (Note that Vuong statistics are positive!).

The kicker: I reject the Power-Law model in favor of a Power-Law with exponential cut-off!  Note that since these two distributions are nested, the power-law with exponential cut-off will ALWAYS fit better than a simple power-law.  The question is whether or not the improved fit is significantly better...in this case it looks like it does!
Not sure where this blog post fits into the literature in economics on Zipf's law.  For example, I don't have any intuition about the type of theoretical mechanism that might generate a power-law with exponential cut-off.  Also, suppose, as above, I find that the data reject a pure power-law in favor of a power-law with exponential cut-off.  How do I know that the rejection isn't due to the exponential cut-off capturing "finite size" effects that must show up in the data at some point?

To conclude this very long post, I would like to point out an interesting forthcoming paper in the American Economic Review (co-authored by Xavier Gabaix).  The authors use a clustering algorithm to construct economic "cities" (which they argue are different from "political" cities) from very high resolution population data for the U.S. and UK.  They fit a power law to this data and find strong evidence in favor of Zipf's Law...I will let you take a guess as to which method they use to fit the power law (HINT: The answer makes baby Gauss cry!).

UPDATE: So I felt like I should also test the null hypothesis of a power-law with an exponential cut-off against a log-normal alternative.  Apparently, Gibrat's Law applied to cities predicts a log-normal distribution.  I wrote some very quick and dirty code to try and conduct the test based on the same Vuong LR procedure for non-nested models that I used above.  Hopefully I didn't make any mistakes!  The results: Vuong Statistic: -3.21; p-value (two-sided): 0.00; p-value (one-sided): 0.00!

It's pretty late, and been a long day, but I think that this implies that I reject the power-law with exponential cut-off in favor of the log-normal distribution. Which is good news for proponents of Gibrat's Law! Weird that testing procedure implies that log-normal is preferred to power-law with exponential cut-off; power-law with exponential cut-off is preferred to power-law; and yet some how test is indifferent between log-normal and power-law!
Finally, I am not confident that I plotted the fitted log-normal properly.  I had to multiply the density function by (3 + n/N) where n is the number of observations above xmin, and N is the total number of observations to get the plot to look correct.  I have not idea where the 3 comes from!  The (n/N) bit I understand...and now I sleep...

Update to the Update: Plotting issue fixed!  Was stupidly using the wrong function to evaluate probability density function for plotting purposes!  Plot still looks the same as above.  Thankfully, using the correct version of the density function removes the need to include the mysterious 3!  Only need include the usual factor of (n/N).