Beyond Microfoundations:: So you want to predict a stock market...but, which one?

Attention conservation notice: Stock returns are non-stationary time-series and as such all of the analysis below is wasted in a feeble attempt to squeeze water from the proverbial rock.

Which of the following equity markets would you expect to be more "predictable" using only information on past returns: the S&P 500, the FTSE-100, or the Straits Times index? Personally, I had no really strong prior regarding which of these markets would be the most "predictable." I suppose that a EMH purist (extremist?) would claim that they should all be equally unpredictable.

Some simple plots of S&P 500 logarithmic returns from 3 January 1950 - 24 August 2011 will help set the mood (all data is from Yahoo!Finance)...

Time series plot of the logarithmic returns exhibits clustered volatility (i.e., heteroskedasticity and serial correlation). Kernel density plot is leptokurtic (i.e., excess kurtosis, which leads to heavier tails, than the best-fit Gaussian). More on the scatter plot below...I will not bore you with the other plots (suffice it to say that they all exhibit the same properties).

A good place to start is a simple test of the random walk hypothesis. If share prices follow a random walk with a drift, then tomorrow's share price should be the sum of today's share price and the drift term (and some statistical white noise). Thus if we regress tomorrow's return on today's return, the coefficient on today's return should be statistically insignificant.

R_t+1 = μ + β R_t + ε _t+1

H₀: β = 0

Here is a scatter plot of FTSE-100 returns from 2 April 1984 - 24 August 2011. The above regression specification is plotted in red, and the random walk null hypothesis is plotted in grey. The two are virtually identical for the FTSE-100.

Although OLS parameter estimates are consistent in the presence of arbitrary heteroskedasticity and serial correlation, OLS estimates of the standard errors are not. To make sure my inferences are valid, I use HAC standard errors. For the S&P 500 and the FTSE-100, using HAC standard errors, the coefficient on today's return is statistically insignificant. However, even with HAC standard errors, the coefficient on today's return remains positive and significant for the Straits Times Index. What this says to me is that predicting tomorrow's return using a linear model is pretty damned useless (unless you invest in Singaporean equities!). But why should we limit ourselves to a linear model? We shouldn't.

Above, I made the implicit assumption that the relationship between today's return and tomorrow's return is linear. Is this assumption supported by the data? One way we can test this assumption is to calculate the observed difference in the mean-squared error of the linear specification and that of a general non-linear specification. One would expect that a general non-linear model would have a lower mean-squared error because it is more flexible, so we need to check whether this difference is statistically significant. One way to do this is to generate a sampling distribution of this difference using the bootstrap, and then see how often we would observe a difference between the linear and non-linear specification as large as what we actually did observe. If such a large difference is very rare, then this is evidence that the linear model is mis-specified.

Now, some care needs to be taken in generating the bootstrap replicates given that the time-series exhibits such complicated dependencies. For my implementation, I generate the simulated response variable by resampling the the residuals from the linear null model.

Linear or Non-Linear? That is the question...

For computational reasons, I use a smoothing spline to estimate my non-linear specification, using generalized cross-validation to choose the curvature penalty (you could easily use some other estimation method such as kernel regression in place of the smoothing spline). Even though the smoothing spline has the lower MSE in all three cases, it is only significantly lower for the S&P 500. The p-value for the bootstrap specification test (sketched above) is roughly 0.03, indicating that the linear specification can be rejected (p-values are 0.18 and 0.13 for the FTSE and the STI, respectively).

I plot the smoothing spline and 95% confidence bands (these are basic confidence intervals generated via a non-parametric bootstrap that re-samples the residuals of the smoothing spline) for both the S&P 500 and the Strait Times Index. Since we are interested in prediction I plot the splines over a uniform grid of m=500 points (instead of simply evaluating and plotting the splines at the observed values of today's return). Given that the smoothing spline is not significantly superior to the linear model for STI returns, I include the both the smoothing spline (blue) and the OLS regression (red) specifications in the scatter plot of STI returns. Do these confidence bands seem too "tight" to anyone else? Sample sizes are pretty large (over 15,000 obs. for the S&P 500)...

Could we use the above models to craft a profitable investment strategy? Note that the OLS regressions for both the S&P 500 and the Straits Times Index have positive slope. This is because, the bulk of the data (say for returns in [-0.02, 0.02]), exhibits mild "trend following" behavior. Using the estimated smoothing splines, if today's S&P 500 return was -0.01 (or down 1%), 95% of the time tomorrow's return should be between -0.00156 (or down 0.156%) and -0.000657 (or down 0.0657%). Similarly, if today's S&P 500 return was 0.01 (or up 1%), tomorrow's return should be between 0.000978 (or up 0.0978%) and 0.00188 (or up 0.188%). Taking advantage of such minute expected returns would require, I think, either a large bankroll, or lots of leverage (or both!).

Also, what is the best way to trade the S&P 500 "market portfolio" in a cost efficient manner on a daily basis? Exchange Traded Funds like the SPY might be an option. Here is a plot of the smoothing spline fit to the data for SPY over the entire history of the fund (trading began on 29 January 1993). I include the OLS regression fit because the slope is negative and significant (even using HAC standard errors) and because the spline fit is not a significant improvement over the linear model (p-value of 0.53 using my bootstrap specification test).

This plot of the SPY data from 1993 onwards looks quite a bit different, and the behavior of returns seems quite a bit different than the corresponding plot of S&P 500 data from 1950 onwards (despite the fact that SPY tracks S&P 500 almost identically from 1993 forward). If the parameters of the underlying data generating process (DGP) driving stock returns remained constant over time, then one would expect to find that subsets of the data behaved similarly to the entire sample. Maybe there are simply structural breaks in the data? Or, remember what I said about stock returns being non-stationary!

What about the FTSE-100? Is all hope lost in trying to predict FTSE returns using past data? I may try predicting whether or not tomorrow's return will be positive given today's observed return using logistic regression in a future post (I am currently having difficulties getting R to give me the correct predicted values for the logistic models! It worked fine yesterday!)

Update: Code is now available!

Beyond Microfoundations:

Blog Topics...

Thursday, September 1, 2011

So you want to predict a stock market...but, which one?

1 comment: