Beyond Microfoundations:: 2011

Wednesday, November 30, 2011

Power-laws in Gold? A journey into the world of dependent data...

Attention Conservation Notice:

I have a son!
Though plausible, the power-law is probably not the best model for the positive tails of gold returns; power-law model is firmly rejected for the negative tail.
Rest of post is the long, tired ramblings about my travails in finding confidence intervals for my power-law parameter estimates with dependent data.

Apologies for the long lag between posts, but my wife gave birth to our son Callan Harry on Sept. 19th and the joys of being a father took precedence over my blog posts! Now that things have settled down a bit, I thought I would take the time to write up some thoughts on the price of gold...

The above top plot in the above chart displays the daily nominal price of gold from 2 January 1973 through 28 November 2011using historical data from usagold.com. Note the sharp increase in the nominal price of gold following the creation of gold backed ETFs in March 2003. The bottom plot displays normalized standard logarithmic returns for gold over the same time period. Note the ever-present volatility clustering.

Normally when looking at an asset price over such a long time period one would prefer to look at the real price of the asset rather than the nominal price. As I am interested in exploring the effects of credit and liquidity constraints on asset prices, I want to focus on the dynamics of the nominal price of gold. Credit constraints typically enter the picture in the form of debt contracts which I assume are written in nominal rather than real terms. There are two ways to think about liquidity/re-saleability: quantity liquidity/re-saleability and value liquidity/re-saleability. A measure of quantity liquidity might be the average daily trading volume of an asset over some time period. A measure of value liquidity, might be the average $-value of an asset traded over some time period. I feel like the two concepts are not identical, and I prefer to think of liquidity in terms of value re-saleability rather than quantity re-saleability.

Show me the tails!
In thinking about how the dynamics of asset prices might be impacted by credit and liquidity constraints, it seems logical (to me at least) that one should focus on the behavior of the extreme tails of the return distribution. Below is a survival plot of both the positive and negative tails of normalized gold returns along with the best-fit power-law model found using the Clauset et al. (2009) method.

This approach to parameter estimation assumes that the underlying data (at least above the threshold parameter) are independent. How realistic is this assumption for gold returns (and for asset returns in general)? The random-walk model for asset prices, which is consistent with the strong form of EMH, would suggest that asset returns should be independent (otherwise historical asset price data could be profitably used to predict future asset prices). Given that gold returns, and asset returns generally, display clustered volatility one would expect that the absolute returns would display significant long-range correlations. Below are plots of various autocorrelation functions for gold...

Both the positive and negative tails of normalized gold returns display significant autocorrelations. MLE of scaling exponents assumes that gold returns, above the threshold parameter, are independent. My main concern is that the consistency of the MLE will be impacted by the dependency in the data. Perhaps the autocorrelations will go away if I restrict the sample to only observations above a given threshold?

Dependency in the positive tail of gold returns does appear to go away as I raise the threshold (it almost disappears completely above the optimal threshold). ACF plot for the negative tail is similar, although given that the optimal threshold is lower for the negative tail, the dependency is a bit more significant. Perhaps this argues for using an alternative to criterion to the KS distance in selecting the optimal threshold parameter. Perhaps Anderson-Darling? Anderson-Darling tends to be conservative in that it chooses a higher threshold parameter which might help deal with data dependency issues. But higher threshold means less data in the tail, which will make it more difficult to both reject a spurious power law and rule out alternative hypotheses. Life sure is full of trade-offs!

Given that the behavior of ACF is likely to vary considerably across assets, I would like to be able to make more general statements about the consistency of the MLE estimator in the presence of dependent data. Any thoughts?

Parameter Uncertainty...
Following the Clauset et al. (2009) methodology, I derive standard errors and confidence intervals for my parameter estimates using a non-parametric bootstrap. My first implementation of the non-parametric bootstrap simply generates synthetic data by re-sampling, with replacement, the positive and negative tails of gold returns, and then estimates the sampling distributions for the power-law model's parameters by fitting a power-law model to each synthetic data set.

To give you an idea of what the re-sampled data look like, here is a survival plot of the negative returns for gold along with all 1000 bootstrap re-samples used to derive the confidence intervals...

...and here is a plot of the bootstrap re-samples over-laid with the best-fit power law models:

The bootstrap replicates can be used to derive standard-errors for the scaling exponent and threshold parameters for the positive and negative tails, as well as various confidence intervals (percentile, basic, normal, student, etc) for the parameter estimates. Which of these confidence intervals is most appropriate depends, in part, on the shape of the sampling distribution. The simplest way to get a look at the shape of the sampling distributions is to plot kernel density estimates of the bootstrap replicates for the various parameters. Dotted black lines indicate the MLE parameter estimate.

WTF! Note that all of the sampling distributions have multiple local peaks! I suspect that the multi-peaked densities are due to sensitivity of the optimal threshold to the resampling procedure. Is this an example of non-parametric bootstrapping gone awry? The fact the the bootstrap estimates of the sampling distributions have multiple peaks suggests that standard confidence intervals (i.e., percentile, basic, normal, student, etc) are unlikely to provide accurate inferences.

The best alternative method (that I have come across so far) for calculating confidence intervals when the sampling distribution has multiple peaks is the Highest Density Regions (HDR) from Hyndman (1996). Again, the dotted black lines indicate the MLE parameter estimates.

The (significant!) differences between the estimated densities in the simple kernel density plots and the HDR plots is due to the use of different estimation methods. HDR uses a more sophisticated (more accurate?) density estimation algorithm based on a quantile algorithm from Hyndman (1996). Bandwidth for density estimation is selected using the algorithm from Samworth and Wand (2010). The green, red, and blue box-plots denote the 50, 95, and 99% confidence region(s) respectively. Note that sometimes the confidence region consists of disjoint intervals. But I still don't like that the densities are so messy. It makes me think that there might be something wrong in my resampling scheme. Can I come up with a better way to do the non-parametric bootstrap sampling for asset returns? I think so...

My second implementation of the non-parametric bootstrap attempts to deal with the possible dependency in the gold returns data by using a maximum entropy bootstrap to simulate the logarithm of the gold price series directly. I generate 1000 synthetic price series, and for each series I calculate the normalized returns, split the normalized returns into positive and negative tails, and then generate the sampling distributions for the power law parameters by fitting a power-law model to each tail separately.

Here are some plots of synthetic gold price data generated using the meboot() package:

I did not expect that the distribution of synthetic data would collapse as the number of bootstrap replicates increased. Was I wrong to have this expectation? Probably! Is this a law-of-large numbers/CLT type result? Seems like variance at time t of the bootstrap replicates of the gold price is collapsing towards zero while the mean at time t is converging to the value of the observed gold price. Clearly I need to better understand how the maximum entropy bootstrap works! Update: this behavior is the result of my using the default setting of force.clt=TRUE within the meboot() function. One can eliminate this behavior by setting force.clt=FALSE. Unfortunately, I have not been able to find a whole lot of information about the costs/benefits of setting force.clt=TRUE or force.clt=FALSE! The documentation for the meboot package is a bit sparse...

However, a plot of the negative tails of the normalized returns calculated from the simulated price series does look like I expected (which is good because this is the synthetic data to which I actually fit the power-law model!)...

The HDR confidence intervals based on the maximum entropy non-parametric bootstrap are much nicer! Mostly, I think, because the densities of the bootstrap replicates are much more well-behaved (particularly for the scaling exponents!).

Clearly I have more work to do here. I really don't feel like I have a deep understanding of either the maximum entropy bootstrap or the HDR confidence intervals. But I feel like both are an improvement over my initial implementation...

Goodness-of-fit for the Power-Law?
Using the KS goodness-of-fit test advocated in Clauset et al. (2009), I find that while the power-law model is plausible for the positive tail (p-value: 0.81 > 0.10), the power-law model can be rejected for the negative tail of normalized gold returns (p-value: 0.00 < 0.10).

The KS goodness-of-fit test generates synthetic data similar to the empirical data below the estimated power-law threshold, but that follow a true power-law (with MLE for scaling exponent) above the estimated threshold. This implementation destroys any underlying time dependencies in the data. Are these underlying time dependencies important for understanding the heavy-tailed behavior of asset returns? Previous research suggests yes. Suppose that time dependencies, in particular the slow-decay of volatility correlations, are key to understanding the heavy-tailed behavior of asset returns. Does the current implementation of the goodness-of-fit test under-estimate the goodness-of-fit for the power-law model?

Testing Alternative Hypotheses:
I end this ludicrously long blog post with a quick discussion of the results of the likelihood ratio tests used to determine if some alternative distribution fits the data better than a power-law.

Results are typical. Even though the power-law model is plausible for the positive tail of gold returns, the power-law can be rejected at the 10% level in favor of the power-law with an exponential cut-off. Also, other heavy-tailed alternatives, namely the log-normal and the stretched-exponential, can not be ruled out (even the thin-tailed exponential distribution can not be ruled out!). For the negative tail, the power-law is simply not plausible, and the power-law with cut-off is preferred based on log-likelihood criterion. All of these alternative distributions were fit to the data using maximum likelihood estimators derived under the assumption of independence of observations in the tail.

Code and data to reproduce the above results can be found here.

Many thanks to Aaron Clauset for his recent post on power-laws and terrorism, which was a major inspiration/motivation for my writing this post. Also, many thanks to Cosma Shalizi for continuing to provide excellent lecture notes on all things R.

Wednesday, September 7, 2011

Logistic Regression and Stock Returns...

Attention conservation notice: Stock returns are non-stationary! Logistic regression, smoothing splines, generalized additive models are all interesting techniques and fun to play with...but stock returns are still non-stationary!

Suppose instead of trying to predict tomorrow's stock return based on today's return, I just try to predict whether or not tomorrow's return will be positive. One way to do this would be using logistic regression. The response variable will be an indicator variable that takes a value of 1 if tomorrow's return is positive, and 0 otherwise. The name of the game is to model the probability of tomorrow's return being positive, conditional on the value of today's return.

Here is the R summary output for a logistic model of the S&P 500 data from Yahoo!Finance:

Call:
glm(formula = (tomorrow > 0) ~ today, family = binomial, data = SP500.data.v2)

Deviance Residuals:
   Min      1Q Median      3Q     Max
-1.647 -1.223   1.080   1.129   2.007

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.1106     0.0161   6.866 6.60e-12 ***
today         8.6524     1.6788   5.154 2.55e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 21458 on 15513 degrees of freedom Residual deviance: 21431 on 15512 degrees of freedom AIC: 21435

Number of Fisher Scoring iterations: 3

Both the intercept and the coefficient on today's return are highly significant. How do we interpret the coefficients? If today's S&P 500 return was 0.0, then the intercept represents the logistic model's prediction for the likelihood that tomorrow's S&P 500 return is positive. However to get the predicted probability I need to transform the coefficient from the logistic scale back to the probability scale:

exp( 0.1106) / (1 + exp( 0.1106)) = 0.527611 or roughly 53%

The logistic model predicts there is a 53% chance of tomorrow's return being positive given that today's return was zero (slightly better than a coin flip!). Suppose that the S&P 500 was down 10% today (i.e., today's return is -0.10)? The predicted probability of tomorrow's return being positive is:

exp( 0.1106 + 8.6524(-0.10)) / (exp( 0.1106 + 8.6524(-0.10))) =

0.3198024 or roughly 32%

I plot the predicted probabilities (and their confidence bands) against today's return using the logistic model for the daily returns of the S&P 500 from January 1950 through this past Friday. The predicted probabilities increase monotonically as today's return goes from negative to positive (i.e., the lowest probability of a positive return tomorrow follows large negative returns today, and the highest probability of a positive return tomorrow follows a large positive return today. The results of the logistic model are, I think, inconsistent with mean-reverting returns. I would think mean reverting returns would require a higher likelihood of a positive return tomorrow if today's return where large and negative. Note that the confidence bands are mildly asymmetric, and that they narrow considerably where the bulk of the observed returns lie.

Similar plots for the FTSE-100 from 1984-2011, and the Straits Times Index (STI) from 1987-2011. The probability of tomorrow's FTSE-100 being positive given today's return is basically a coin-flip much of the time!

The predicted probabilities from the logistic model using data from the Straits Times Index (STI) in Singapore are more similar to the predicted probabilities for the S&P 500.

I would like to have some sense of whether the above logistic models are well-specified. How might I go about validating the above logistic regressions? One way would be to estimate and compare the fits of some more flexible models (i.e., smoothing splines, generalized additive models, etc). If the logistic regression is well specified, then I would expect that the more flexible models should not give significantly different predictions than the above logistic model. Here is the plot of the predicted probabilities and the 95% confidence bands of the logistic model for the S&P 500, a smoothing spline (blue), and a generalized additive model (red).

WTF! The smoothing spline and the GAM have very similar predicted probabilities, particularly when today's return falls within [-0.05, 0.05] (i.e., over the bulk of the observed returns). However, visually, neither the smoothing spline nor the GAM appear to match the logistic regression well at all!

The plot is actually a bit misleading, because your eye is immediately drawn to the huge differences in predicted probabilities for the extreme tails of today's return. However, there are only a handful of observations in the extreme positive and negative tails of today's return (check the data rug) and thus the predicted probabilities for the smoothing spline and GAM models are unlikely to be very precise in these regions. Better to focus on the differences in the predicted probabilities between the spline/GAM models and the logistic model in the region around today's return equals zero (where the bulk of the data is located). In this neighborhood, the curve of predicted probabilities for all three models is increasing (consistent, I think, with some type of trend-following dynamic). Note, however, that for the smoothing spline and GAM models, the slope of the predicted probability curve is much steeper (suggesting a more aggressive trend-following dynamic?) compared to the logistic model.

Here are the plots for the FTSE-100 and the Straits Times Index (STI)...

The whole point of fitting the smoothing spline and the GAM was to determine whether or not the logistic model is well specified. As mentioned above, if the logistic model is well specified, then spline/GAM models should not be significantly better fits to the data. We can measure goodness of fits by comparing the deviance of the logistic model with that of the GAM (I am going to ignore the smoothing spline because it doesn't technically respect the probability scale). Thus the observed difference in deviance between the logistic model and the GAM for the S&P 500 is the deviance for the null model (i.e., Logistic model) less the deviance of the alternative (GAM model).

21430.65 - 21348.16 = 82.49624

The observed difference in deviance for the FTSE-100 and the STI are 16.53285 and 12.30237, respectively. A smaller deviance indicates a better fit, thus, as expected, the more flexible GAM is a better fit for the S&P 500, the FTSE-100, and the STI. Are these observed differences in deviance significant? Let's go to the bootstrap! Basic idea is to use the bootstrap to generate the sampling distribution for the difference in deviance between the null and alternative models, and then see how often these bootstrap replicates of the difference in deviance exceed the observed difference in deviance. If the fraction of times that the replicated difference in deviance exceeds the observed difference is high, then it is likely that the observed improvement in fit using the GAM is the result of statistical fluctuations and is therefore not significant.

Results? Running the above bootstrap using 1000 replications yields p-values of 0.00, 0.02, and 0.06 for the S&P 500, FTSE-100, and STI, respectively. The p-values indicate that the improvement in fit using the GAM model on data from the S&P 500 and FTSE-100 is significant at the 5% level, and that the improvement in fit using the GAM model is significant for the STI data at the 10% level.

What does all of this mean? Well, as far as this little modeling exercise is concerning, the results of the bootstrap specification test suggest that we should use a GAM instead of a logistic model. Here are the plots of the GAM predicted probabilities with confidence bands for the S&P 500, the FTSE-100 and the STI. Note how wide the 95% confidence bands for the predicted probability of tomorrow's return being positive are when today's return is either really negative or really positive. This is exactly as it should be! There just aren't enough observed extreme returns (positive or negative) to support precise predictions.

Could this be used to construct a useful investment stratagem? I think doubtful. Compare the GAM model's predictions for the S&P 500 above, which make use of historical data from 1950-2011, to the GAM model's predictions using data from 1993-2011 for SPY, the widely traded ETF which tracks the S&P 500 (I assume that implementing a stratagem would involve trading some ETF like SPY). I have also included the logistic regression and its 95% confidence bands because the bootstrap specification test fails to reject the logistic model in favor of the GAM (p-value of 0.50).

The two are substantially different. The interesting little window of trend-following behavior is now gone. Perhaps it was a historical artifact from pre-computer trading days of yore? The negative slope of the predicted probabilities is consistent with mean-reversion in returns.

The underlying problem with trying to predict the behavior of stock returns, is that they are non-stationary. It's not just that the parameters of the "data generating process" for stock returns are changing over time, the entire data generating process itself is evolving over time. When the underlying process itself is changing, getting more historical data is unlikely to help (and in fact is likely to make predictions substantially worse!)...

Update: Code is now available!

Thursday, September 1, 2011

So you want to predict a stock market...but, which one?

Attention conservation notice: Stock returns are non-stationary time-series and as such all of the analysis below is wasted in a feeble attempt to squeeze water from the proverbial rock.

Which of the following equity markets would you expect to be more "predictable" using only information on past returns: the S&P 500, the FTSE-100, or the Straits Times index? Personally, I had no really strong prior regarding which of these markets would be the most "predictable." I suppose that a EMH purist (extremist?) would claim that they should all be equally unpredictable.

Some simple plots of S&P 500 logarithmic returns from 3 January 1950 - 24 August 2011 will help set the mood (all data is from Yahoo!Finance)...

Time series plot of the logarithmic returns exhibits clustered volatility (i.e., heteroskedasticity and serial correlation). Kernel density plot is leptokurtic (i.e., excess kurtosis, which leads to heavier tails, than the best-fit Gaussian). More on the scatter plot below...I will not bore you with the other plots (suffice it to say that they all exhibit the same properties).

A good place to start is a simple test of the random walk hypothesis. If share prices follow a random walk with a drift, then tomorrow's share price should be the sum of today's share price and the drift term (and some statistical white noise). Thus if we regress tomorrow's return on today's return, the coefficient on today's return should be statistically insignificant.

R_t+1 = μ + β R_t + ε _t+1

H₀: β = 0

Here is a scatter plot of FTSE-100 returns from 2 April 1984 - 24 August 2011. The above regression specification is plotted in red, and the random walk null hypothesis is plotted in grey. The two are virtually identical for the FTSE-100.

Although OLS parameter estimates are consistent in the presence of arbitrary heteroskedasticity and serial correlation, OLS estimates of the standard errors are not. To make sure my inferences are valid, I use HAC standard errors. For the S&P 500 and the FTSE-100, using HAC standard errors, the coefficient on today's return is statistically insignificant. However, even with HAC standard errors, the coefficient on today's return remains positive and significant for the Straits Times Index. What this says to me is that predicting tomorrow's return using a linear model is pretty damned useless (unless you invest in Singaporean equities!). But why should we limit ourselves to a linear model? We shouldn't.

Above, I made the implicit assumption that the relationship between today's return and tomorrow's return is linear. Is this assumption supported by the data? One way we can test this assumption is to calculate the observed difference in the mean-squared error of the linear specification and that of a general non-linear specification. One would expect that a general non-linear model would have a lower mean-squared error because it is more flexible, so we need to check whether this difference is statistically significant. One way to do this is to generate a sampling distribution of this difference using the bootstrap, and then see how often we would observe a difference between the linear and non-linear specification as large as what we actually did observe. If such a large difference is very rare, then this is evidence that the linear model is mis-specified.

Now, some care needs to be taken in generating the bootstrap replicates given that the time-series exhibits such complicated dependencies. For my implementation, I generate the simulated response variable by resampling the the residuals from the linear null model.

Linear or Non-Linear? That is the question...

For computational reasons, I use a smoothing spline to estimate my non-linear specification, using generalized cross-validation to choose the curvature penalty (you could easily use some other estimation method such as kernel regression in place of the smoothing spline). Even though the smoothing spline has the lower MSE in all three cases, it is only significantly lower for the S&P 500. The p-value for the bootstrap specification test (sketched above) is roughly 0.03, indicating that the linear specification can be rejected (p-values are 0.18 and 0.13 for the FTSE and the STI, respectively).

I plot the smoothing spline and 95% confidence bands (these are basic confidence intervals generated via a non-parametric bootstrap that re-samples the residuals of the smoothing spline) for both the S&P 500 and the Strait Times Index. Since we are interested in prediction I plot the splines over a uniform grid of m=500 points (instead of simply evaluating and plotting the splines at the observed values of today's return). Given that the smoothing spline is not significantly superior to the linear model for STI returns, I include the both the smoothing spline (blue) and the OLS regression (red) specifications in the scatter plot of STI returns. Do these confidence bands seem too "tight" to anyone else? Sample sizes are pretty large (over 15,000 obs. for the S&P 500)...

Could we use the above models to craft a profitable investment strategy? Note that the OLS regressions for both the S&P 500 and the Straits Times Index have positive slope. This is because, the bulk of the data (say for returns in [-0.02, 0.02]), exhibits mild "trend following" behavior. Using the estimated smoothing splines, if today's S&P 500 return was -0.01 (or down 1%), 95% of the time tomorrow's return should be between -0.00156 (or down 0.156%) and -0.000657 (or down 0.0657%). Similarly, if today's S&P 500 return was 0.01 (or up 1%), tomorrow's return should be between 0.000978 (or up 0.0978%) and 0.00188 (or up 0.188%). Taking advantage of such minute expected returns would require, I think, either a large bankroll, or lots of leverage (or both!).

Also, what is the best way to trade the S&P 500 "market portfolio" in a cost efficient manner on a daily basis? Exchange Traded Funds like the SPY might be an option. Here is a plot of the smoothing spline fit to the data for SPY over the entire history of the fund (trading began on 29 January 1993). I include the OLS regression fit because the slope is negative and significant (even using HAC standard errors) and because the spline fit is not a significant improvement over the linear model (p-value of 0.53 using my bootstrap specification test).

This plot of the SPY data from 1993 onwards looks quite a bit different, and the behavior of returns seems quite a bit different than the corresponding plot of S&P 500 data from 1950 onwards (despite the fact that SPY tracks S&P 500 almost identically from 1993 forward). If the parameters of the underlying data generating process (DGP) driving stock returns remained constant over time, then one would expect to find that subsets of the data behaved similarly to the entire sample. Maybe there are simply structural breaks in the data? Or, remember what I said about stock returns being non-stationary!

What about the FTSE-100? Is all hope lost in trying to predict FTSE returns using past data? I may try predicting whether or not tomorrow's return will be positive given today's observed return using logistic regression in a future post (I am currently having difficulties getting R to give me the correct predicted values for the logistic models! It worked fine yesterday!)

Update: Code is now available!

Wednesday, August 10, 2011

Scary Thought for the Day...

So I have been watching CNBC the past few days (and have been surprised to find much of the commentary reasonably informative!). Over the past week, investors have seen the U.S. Government's credit rating cut from AAA to AA+ by S&P. Following the downgrade these investors have seen wild swings in global equities markets, with the result that large numbers of investors have been piling in to U.S. Treasuries pushing interest rates down even further! So far the "market" seems to think the credit of the U.S. Government is still pretty solid. Listening to the coverage of media speculation that France was going to have its sovereign rating slashed from AAA to some other combination of letters and +/- signs, the following scary thought occurred to me...

Suppose that investors start start thinking the following:

The U.S. Government bond is still the safest, most risk-free asset around.
S&P is correct in their assessment that the U.S. Government no longer deserves a AAA credit rating.

If you believe 1) and 2), then do you conclude that no other government bonds, say for France, Germany, Netherlands, Luxembourg, Austria, Finland etc can be AAA either? If investors start to believe this in mass, then interest rates for one or more of the above countries may rise, which will increase debt burdens, which could cause one or more of the above countries to receive a downgrade, and the cycle would continue...

This would be a very bad self-fulfilling prophecy for the Euro zone.

Wednesday, August 3, 2011

All that one could ever want to know about Keynes...

I have just acquired first editions of the three volume biography of J.M. Keynes by Robert Skidelsky. I plan to start Volume I: Hopes Betrayed, which covers Keynes early years through the end of the Great War, today...

Thursday, July 28, 2011

Some Confused Hypothesis Testing...

Returning to a previous post on fitting a power-law to Metropolitan Statistical Area (MSA) population data. First, I reported results the following inferences based on Vuong LR tests (see previous post for Vuong stats and p-values):

Two Sided Tests (Null hypothesis is that both distributions are equally far from the "truth"):

Fail to reject null hypothesis for power-law and log-normal.
Reject the null hypothesis for the Weibull.
Reject the null hypothesis for the exponential.

One-sided Tests (Null hypothesis is a power-law):

Fail to reject the null hypothesis of power-law compared with Weibull.
Fail to reject the null hypothesis of a power-law compared with an exponential.

Vuong LR test for nested-models (used only for comparing a power-law and a power-law w/ exponential cut-off):

Reject null hypothesis of a power-law in favor of a power-law with exponential cut-off.

Additionally, if you compare the following log-likelihoods for the various models, you would also select the power-law with exponential cut-off as being the preferred model:

Power-law with exponential cut-off: -2143.809
Log-normal: -2144.513
Power-law: -2146.368
Weibull: -2173.089
Exponential: -2182.766

Later (after starring at the above plot for quite awhile and convincing myself that the log-normal "looks" better than the power-law with cut-off) I came back to my work, and decided to try and write some code to conduct a non-nested Vuong LR test to compare a power-law with exponential cut-off and a log-normal distribution. I then reported that I was able to rejected the null hypothesis for the two-sided test (that both the power-law with exponential cut-off and log-normal distributions are equally far from the "truth"); and reject the null-hypothesis of a power-law w/ exponential cut-off for the one-sided test in favor of the log-normal.

I may have been a bit hasty in that conclusion. Upon further review, I think that there is either a bug in my code, or that I have made some conceptual error in implementing the test. For one, I certainly should have noticed that the p-values for both tests were suspiciously small (they were both 0.00!) given the difference in their respective LR of less than 1!

As always comments are very much encouraged...

Thursday, July 21, 2011

Zipf's Law for cities holds...w/ a cutoff! Check that...Nope! Log-Normal!

Update! I have posted the code for this post online. My code repository is very much a work in progress...so I hope you don't mind the clutter!

Continuing my quest to master the fitting of power laws to empirical data, today I am going to look at some results from Gabaix (1999) on the distribution of "city" sizes in the U.S. using a data set on the population of Metropolitan Statistical Areas.

When fitting any distribution to a data set it is important to consider whether or not the variable in question is discrete, or continuous. This is especially true when fitting a Power Law as Clauset et al (2009) demonstrate that assuming a discrete data set can be "well approximated" by a continuous distribution power-law is questionable (at best), and that parameter estimates obtained from fitting a continuous power-law to a discrete data set can be wildly inaccurate. However, a continuous power law will be a good approximation to the data, if the data satisfy certain (rather general) conditions:

The minimum value of the data set is greater than 1000, and
There are more than 100 observations in the data set

The above MSA data set satisfies both criteria, so I will proceed using the continuous ML estimator (even though technically population is a discrete variable).

Population of Metropolitan Statistical Areas (MSAs):
I begin with a survival plot for the population of U.S. Metropolitan Statistical Areas (MSAs) using the data set from the Bureau of Economic Analysis (by way of Cosma Shalizi).

I present estimation results using pareto.fit() (which is available for download) to implement both the continous MLE and the Old School regression method (used by Gabaix (1999) and many, many, other papers for comparison purposes). Note that 1) n is the number of observations in the data set above the threshold, xmin; 2) for both the continuous MLE and the regression CDF methods, I used the KS procedure to find the threshold parameter, xmin.

	α	xmin	n
Continuous MLE	1.99	321,400	146
Regression CDF	2.07	344,600	143

95% Confidence intervals for the continuous MLE are calculated using the same non-parametric procedure described in my previous post (recall the importance of using the non-parametric bootstrap to replicate the fitting of both parameters simultaneously in order to get consistent estimates of standard errors and confidence intervals). Results...

α: (1.74, 2.22)
xmin: (137,600, 552,030)

Note, that in keeping with Zipf's law (a surprisingly poor explanation of which can be found on Wikipedia), the point estimate for α is almost exactly 2.

Goodness-of-fit tests (again using the Clauset et al (2009) KS procedure) yield a p-value of 0.20 > 0.10 which means that I fail to reject the null hypothesis of a power-law for the population of MSA!

However, before I dance a little jig and claim evidence of Zipf's law I need to see if some other distribution fits the data as well or better than the power law.

	Power Law	Log-Normal	Exponential	Weibull	Power Law w/ Exponential Cut-Off
Vuong Statistic	NA	-1.27	2.47	2.93	LR: -2.56
p-value (two-sided)	0.20	0.20	0.01	0.00	p-value: 0.02

Table turned out to be a bit muddled: I report two-sided p-values for the test of alternative hypotheses of log-normal, exponential, and stretched exponential (Weibull). Conclusion: Fail to reject null hypothesis that the power-law and the log-normal are equally "bad" (in the sense of being same "distance" from the truth); Reject Exponential in favor of the Power-Law; Reject Weibull in favor of the Power-Law (Note that Vuong statistics are positive!).

The kicker: I reject the Power-Law model in favor of a Power-Law with exponential cut-off! Note that since these two distributions are nested, the power-law with exponential cut-off will ALWAYS fit better than a simple power-law. The question is whether or not the improved fit is significantly better...in this case it looks like it does!

Not sure where this blog post fits into the literature in economics on Zipf's law. For example, I don't have any intuition about the type of theoretical mechanism that might generate a power-law with exponential cut-off. Also, suppose, as above, I find that the data reject a pure power-law in favor of a power-law with exponential cut-off. How do I know that the rejection isn't due to the exponential cut-off capturing "finite size" effects that must show up in the data at some point?

To conclude this very long post, I would like to point out an interesting forthcoming paper in the American Economic Review (co-authored by Xavier Gabaix). The authors use a clustering algorithm to construct economic "cities" (which they argue are different from "political" cities) from very high resolution population data for the U.S. and UK. They fit a power law to this data and find strong evidence in favor of Zipf's Law...I will let you take a guess as to which method they use to fit the power law (HINT: The answer makes baby Gauss cry!).

UPDATE: So I felt like I should also test the null hypothesis of a power-law with an exponential cut-off against a log-normal alternative. Apparently, Gibrat's Law applied to cities predicts a log-normal distribution. I wrote some very quick and dirty code to try and conduct the test based on the same Vuong LR procedure for non-nested models that I used above. Hopefully I didn't make any mistakes! The results: Vuong Statistic: -3.21; p-value (two-sided): 0.00; p-value (one-sided): 0.00!

It's pretty late, and been a long day, but I think that this implies that I reject the power-law with exponential cut-off in favor of the log-normal distribution. Which is good news for proponents of Gibrat's Law! Weird that testing procedure implies that log-normal is preferred to power-law with exponential cut-off; power-law with exponential cut-off is preferred to power-law; and yet some how test is indifferent between log-normal and power-law!

Finally, I am not confident that I plotted the fitted log-normal properly. I had to multiply the density function by (3 + n/N) where n is the number of observations above xmin, and N is the total number of observations to get the plot to look correct. I have not idea where the 3 comes from! The (n/N) bit I understand...and now I sleep~~...~~

Update to the Update: Plotting issue fixed! Was stupidly using the wrong function to evaluate probability density function for plotting purposes! Plot still looks the same as above. Thankfully, using the correct version of the density function removes the need to include the mysterious 3! Only need include the usual factor of (n/N).

Wednesday, July 20, 2011

Power Law in Wealth Distribution? NO...

Recently I have been teaching myself to estimate power laws from empirical data using the improved techniques developed in Clauset et al (2009) with the hopes of applying them in my own research. What follows is an example where I apply these techniques to some data on the wealth of the 400 Richest Americans from 2003. This example is partially covered in one of the co-author's of Clauset et al (2009) (i.e., Cosma Shalizi) lecture notes.

Below, I expand on the example given in those lecture notes by

Estimating the power-law threshold parameter, xmin, from the data
Constructing standard errors and confidence intervals for the scaling parameter, α, and for the threshold parameter, xmin (that take into account the fact that I have estimated both α and xmin).
Testing an alternative hypothesis for the distribution of wealth (i.e, log-normal)

Always plot your data! I think that it is a good idea to start any data analysis with a plot of some description. I will start this one with a plot of the probability of an individual having wealth greater than wealth (W), against wealth (W)...or more succinctly: a Survival Plot.

If the data follow a power law, then this relationship should be roughly linear. The converse, however, is not true! Linearity is not a sufficient condition for a power law. Also, lots of things might look linear on a log-log plot even when they aren't.

Recipe for Fitting Power Laws:

Parameter Estimation: Following Clauset et al (2009), I estimated the scaling parameter, α, using the method of maximum likelihood whilst using the suggested KS minimization procedure for estimating the threshold parameter, xmin. Results:

α: 2.34
xmin: 9e8 ($900 million)
obs.above.threshold: 302 (out of 400)

The code for estimating power laws can be found on Aaron Clauset's website. There are two different implementations of the estimation procedure in R: pareto.fit() and plfit(). I have experimented with both, and although they give identical results for the above estimation, this is not true in general. From what I can tell, the (slight) differences in estimation results stem from differing implementations of the KS test. The pareto.fit() method uses R's ks.test() method, while plfit() hand-codes the KS test. I have not (yet) conducted a Monte-Carlo study to see if the differences between the implementations are worth worrying about.

Standard Errors and Confidence Intervals: To calculate the standard errors and confidence intervals for the parameters I used a non-parametric procedure outlined in Cosma's lecture notes. Cosma also graciously provides the code for his lecture notes as well! To calculate the standard errors and 95% confidence intervals for α and xmin, I had to alter (trivially) the provided code to take into account the fact that I was not given the threshold parameter and instead estimated it from the data. Results:

α.se: 0.10
xmin.se: 364e6 ($364 million)
α.ci: (2.13, 2.53)
xmin.ci: (-2e8, 1.05e9)

Obvious one cannot really have negative wealth (at least not at the level of $-200 million), so just note that the 95% confidence interval for xmin includes zero.

Survival Plot (fitted power-law included):

Looks pretty good! But while you might be able to rule out a particular model just by looking at it, I think you would be hard pressed to rule a model "in" by "eye-balling it" (unfortunately, I am sure that people do).

Goodness-of-fit Test: Continuing to follow Clauset et al (2009), I again used a Monte Carlo KS test procedure to conduct a goodness-of-fit test for the power law model. The procedure basically generates synthetic data by re-sampling the data itself below xmin, and then generating simulated data from a power law with the estimated parameters above xmin. A power-law is then re-estimated to each synthetic data set and the KS statistic is stored. The result is a p-value that represents the fraction of simulations whose KS statistic was at least as large as the one obtained from fitting the power law model to the wealth data. A small p-value (i.e., a p-value < 0.10) indicates that only a very small fraction of the simulations had a KS stat as large as the wealth data and is evidence against the plausibility of the power-law model. Results: p-value of 0.0012! It would seem that the data reject the power-law as a plausible model. This is perhaps not surprising...

Testing Alternative Hypotheses: So the wealth distribution does not follow a power law! What about log-normal? I test this using the likelihood ratio procedure recommended in Clauset et al (2009).

The null hypothesis for comparing two non-nested distributions is that both distributions are equally far (in a Kullback-Leibler relative entropy sense...note to self: understand what the hell Kullback-Leibler relative entropy means!) from one another. I report the Vuong test statistic and the two-sided p-value:

Vuong Statistic: 0.21
p-value: 0.83

Conclusion: fail to reject the null hypothesis that the power-law and the log-normal distributions are equally bad (again in the Kullback-Leibler relative entropy sense).

Other alternatives (for which Clauset et al (2009) publish results) are the exponential, stretched exponential (Weibull), and the power-law with exponential cut-off. I continue to have trouble getting the code for fitting the exponential distribution to work (keep getting an error in the nlm() method for minimizing the negative log-likelihood function...it keeps returning non-finite values...which makes me think that it is probably user error on my part). I have also not yet figured out how to get the code for estimating the power-law with exponential cutoff to work. This is at the top of my to-do list as I feel it is a very plausible alternative distribution (perhaps not necessarily for wealth...but definitely for some other data sets that I am working with at the moment).

I can however report results for the stretched exponential (Weibull). I was able to get the code to work after I learned how to suppress warnings in R (i.e., using the suppressWarnings() method). Again I report the Vuong test statistic and the two-sided p-value:

Vuong Statistic: 8.05
p-value: 0.00

The fact that the sign on the Vuong Statistic is positive indicates that the log-likelihood for the power-law model was greater than the log-likelihood for the stretched exponential (Weibull). The test statistic of 8.05 is so large that the p-value is actually 8.88e-16! Thus we can reject the stretched exponential in favor of the power-law for this data set (but given that we have already rejected the power law...this test is somewhat superfluous). I hope I have interpreted the above test correctly. If anyone feels otherwise definitely speak up!

The real objective of this post was to replicate results from fitting a power-law to empirical data (in a rigorous way that does not abuse linear regression), so that I feel confident in applying these tools in my own research. Thus I purposefully did not stray too far from Clauset et al (2009). I think I succeeded (at least in replicating the work of Clauset et al (2009) for this data set). Hopefully, I have not made any glaring errors...but if I did, they are my own!

Tuesday, July 12, 2011

Heteroskedasticity simulations...

Finally getting back into my routine after a very successful summer school in Jerusalem. This blog post is about part of a graduate econometrics lab on violations of the classical linear regression model that I have been working on lately. Basic idea of the lab, which I borrowed from Prof. Mark Schaffer, is to expose students to Monte Carlo simulation and violations of the classical linear model simultaneously by having them construct simulations that demonstrate what happens to standard hypothesis tests when the DGP exhibits one of the following issues:

To keep this post reasonably short, I am only going to focus on the section of the lab that compares various techniques for dealing with heteroskedasticity via Monte Carlo simulations.

The DGPs:
I specify the two data generating processes as follows. For DGP #1 x can only take positive values:

x = abs(rnorm(n, mean=0, sd=1))
e = rnorm(n, mean=0, sd=1)
u = e*x (implies that the variance of u increases with the square of x!)
y = x + u

The first pane in the plot below is y versus x for n=1000 with the OLS regression line in red and the true regression function in grey. If the heteroskedasticity isn't obvious enough, the second pane is a plot of the squared OLS residuals against the regressor x. The true variance of the disturbance increase with the square of x (i.e., the grey curve). Note that the true variance process looks roughly linear (even though it actually is not!). The fact that a linear approximation to the true variance process looks like it would work pretty well for DGP #1 plays a role later.

For DGP #2 the regressor, x, is allowed to take both positive and negative values (i.e., x=rnorm(n, mean=0, sd=1)). Everything else is the same as in DGP #1. Here is the analogous plot for DGP #2:

Note that the even though the true variance process is non-linear in both DGP #1 and DGP #2, it can be (reasonably) well approximated by a linear function in DGP #1. However, a linear approximation to the true variance function is an EPIC FAIL for DGP #2! This plays a huge role in the breakdown of FGLS documented below.

Comparison #1 (OLS, OLS w/ HAC Std. Errors, and WLS): Assume that you know the true form of the variance function (i.e., you have ritually slaughtered the requisite small woodland creature, dried its bones, and left them for 3.5 days to be scattered by the wind at which point you were able to use the pattern created to divine the true variance function...alternatively, if you live in Greece, you could simply consult the Oracle).

After having done all of this, the first simulation sets the sample size to n=100 and regresses y on x (where x is given by DGP # 1) 5000 times while capturing the parameter estimate for the coefficient on x using both OLS and WLS (which incorporates the optimal weights given the true variance function) and the various flavours of the standard errors. Note the HAC stands for Heteroskedastic and Autocorrelation Consistent standard errors. These standard errors are used frequently in econometrics to adjust the standard errors in the presence of heteroskedasticity and autocorrelation (HAC standard errors are rumored to be consistent in the presence of arbitrary forms of heteroskedasticity and autocorrelation...although I confess to not having read the original journal articles).

Plotting the kernel densities of the OLS estimator and the WLS estimator shows that while both estimators appear to be unbiased (consistent with theory), only the WLS estimator takes advantage of the known variance function (OLS ignores this information) and thus the variance of the WLS estimator is much smaller.

Next, let's consider the size properties of the standard hypothesis test (i.e., type I error). How often would you expect to reject the null hypothesis of β=1 if your use the standard test statistic and critical values of -1.96 and +1.96 were valid? (Answer: 5% of 5,000 repetitions = 250 times). How many rejects do we actually observe in this simulation?

OLS: ~ 1200!
OLS w/ HAC standard errors: ~ 450
WLS: ~ 250

It would seem that only the WLS has the expected amount of type I error. Not surprisingly, OLS is in bad shape. To get even more information about the size properties, let's look at histograms of the p-values associated with the above tests. If the test statistics were unbiased, then we should expect to find uniformly distributed p-values...

...and with OLS w/ HAC standard errors, and WLS we do find more or less uniformly distributed p-values. OLS w/ HAC exhibits some very slight distortions...but these are nothing compared with OLS!

Finally, let's look at the power of the standard hypothesis test (i.e., type II error). Below I plot the power curves using the various flavors of standard errors. What are we looking for in this plot? Two things. First, a hypothesis test with good size properties should obtain a minimum value of 0.05 (for a two-tailed test) at the true value of β=1. Second, the power curve should be relatively narrow. The wider the power curve, the lower is the probability of correctly rejecting a false null hypothesis (i.e., the higher is the type II error).

Unsurprising conclusion of this comparison: if you are going to invest the 3.5 days necessary to obtain the true variance function, then you should use it when estimating your regression.

Comparison #2 (OLS, OLS w/ HAC Std. Errors, FGLS, and WLS (npreg): Now suppose that you do not know the true form of the variance function but wish to estimate it from the data. This simulation sets the sample size to n=100 and regresses y on x (where x is given by DGP # 1) 5000 times while capturing the parameter estimate for the coefficient on x using OLS, FGLS, and WLS (npreg) and the various flavours of the standard errors. Feasible Generalised Least Squares (FGLS) estimates the true variance function from the data by linearly regressing the log of the squared OLS residuals on x and using the fitted values as the estimate of the variance function. WLS (npreg) estimates the variance function using the fitted values from a non-parametric regression of the log of the squared OLS residuals on x (see Cosma Shalizi's lecture 5 notes for details).

While all three estimators appear to be unbiased, the FGLS and WLS (npreg) estimators estimate the unknown variance function (OLS does not) and thus the variance of these estimators is much smaller.

Again, consider the size and power of the standard hypothesis. How many rejections of the null hypothesis of β=1 do we actually observe?

OLS: ~1200!
OLS w/ HAC standard errors: ~ 400
WLS (npreg): ~ 600
FGLS: ~ 300

FGLS would seem to perform the best (in a relative sense), however all four procedures have excessive type I error. They all reject a true null hypothesis too frequently. I was expecting the size properties of the WLS (npreg) estimator to be superior to those of the FGLS estimator. Could ye old trade-off between bias and variance. Given that the true variance process is roughly linear for DGP #1 the bias introduced by using FGLS should be pretty small. I also just used the default bandwidth for npreg() in my R code. Perhaps I should have spent more time trying to optimize the bandwidth (if default bandwidth too high, then npreg() is overfitting, which should lead to higher variance in the WLS (npreg) estimator). Also, perhaps if I had implemented in iterative procedure to estimate the variance function using the non-parametric regression technique, WLS (npreg) would outperform FGLS. Any thoughts/ideas from readers would be welcome.

Histograms of the p-values:

In this case only with FGLS we do find more or less uniformly distributed p-values. OLS w/ HAC se exhibits some very slight distortions...but these are nothing compared with OLS! As touched on above, WLS (npreg) also exhibits surprisingly large size distortions.

Plot the power curve:

Again, hypothesis tests using OLS w/ HAC and FGLS have pretty good size properties, but the hypothesis tests using FGLS have superior power properties. If you have reason to suspect that the true variance function is roughly linear, then it looks like FGLS is the way to go (although the difference between the two procedures might become negligible as sample size increases).

Comparison #3 (OLS, OLS w/ HAC Std. Errors, FGLS, and WLS (npreg) with DGP #2: Now suppose that you do not know the true form of the variance function but wish to estimate it from the data. This simulation sets the sample size to n=100 and regresses y on x (where x is given by DGP # 2) 5000 times while capturing the parameter estimate for the coefficient on x using OLS, FGLS, and WLS (npreg) and the various flavours of the standard errors. Recall that the true variance process for DGP #2 is poorly approximated by a linear specification. Thus my prior was that FGLS should be an EPIC FAIL, but the WLS (npreg) would save the day! I turned out to be half right...

While all three estimators appear to be unbiased, the WLS (npreg) estimator is considerably more precise than either the OLS or FGLS estimators. Why is FGLS doing so poorly? The true variance function for DGP #2 is sufficiently non-linear that the bias introduced by using FGLS to estimate the variance function produces and EPIC FAIL! Note that the kernel density of FGLS tracks more or less with the density of the OLS estimator!

Consider the size and power of the standard hypothesis. How many rejections of the null hypothesis of β=1 do we actually observe?

OLS: ~1200!
OLS w/ HAC standard errors: ~ 350
WLS (npreg): ~ 900 ?!
FGLS: ~ 1400!!

FGLS is an epic fail. OLS with HAC standard error doesn't do too poorly. What is going on with WLS (npreg)! I was hoping that it would shine through in this most realistic of cases! I must be over-fitting and the sample size is pretty small (only n=100). Extending the sample size might help. I am really perplexed by this...

Histograms of the p-values:

Only OLS with HAC standard errors has roughly uniform p-values. The rest exhibit significant distortions.

Finally, the power curve:

OLS with HAC would seem to win again. Its power properties are not great, but should improve with sample size.

As always R-code is available upon request. If anyone has thoughts on what is going on with my implementation of WLS (using non-parametric regression to get the optimal weights), definitely feel free to comment!

Blog Topics...