## Wednesday, September 7, 2011

### Logistic Regression and Stock Returns...

Attention conservation notice: Stock returns are non-stationary! Logistic regression, smoothing splines, generalized additive models are all interesting techniques and fun to play with...but stock returns are still non-stationary!
Suppose instead of trying to predict tomorrow's stock return based on today's return, I just try to predict whether or not tomorrow's return will be positive.  One way to do this would be using logistic regression. The response variable will be an indicator variable that takes a value of 1 if tomorrow's return is positive, and 0 otherwise.  The name of the game is to model the probability of tomorrow's return being positive, conditional on the value of today's return.

Here is the R summary output for a logistic model of the S&P 500 data from Yahoo!Finance:

Call:
glm(formula = (tomorrow > 0) ~ today, family = binomial, data = SP500.data.v2)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.647  -1.223   1.080   1.129   2.007

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.1106     0.0161   6.866 6.60e-12 ***
today         8.6524     1.6788   5.154 2.55e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 21458 on 15513 degrees of freedom Residual deviance: 21431 on 15512 degrees of freedom AIC: 21435

Number of Fisher Scoring iterations: 3

Both the intercept and the coefficient on today's return are highly significant.  How do we interpret the coefficients?  If today's S&P 500 return was 0.0, then the intercept represents the logistic model's prediction for the likelihood that tomorrow's S&P 500 return is positive.  However to get the predicted probability I need to transform the coefficient from the logistic scale back to the probability scale:

exp( 0.1106) / (1 + exp( 0.1106)) = 0.527611 or roughly 53%

The logistic model predicts there is a 53% chance of tomorrow's return being positive given that today's return was zero (slightly better than a coin flip!). Suppose that the S&P 500 was down 10% today (i.e., today's return is -0.10)? The predicted probability of tomorrow's return being positive is:

exp( 0.1106 + 8.6524(-0.10)) / (exp( 0.1106 + 8.6524(-0.10))) =
0.3198024 or roughly 32%

I plot the predicted probabilities (and their confidence bands) against today's return using the logistic model for the daily returns of the S&P 500 from January 1950 through this past Friday.  The predicted probabilities increase monotonically as today's return goes from negative to positive (i.e., the lowest probability of a positive return tomorrow follows large negative returns today, and the highest probability of a positive return tomorrow follows a large positive return today.  The results of the logistic model are, I think, inconsistent with mean-reverting returns.  I would think mean reverting returns would require a higher likelihood of a positive return tomorrow if today's return where large and negative.  Note that the confidence bands are mildly asymmetric, and that they narrow considerably where the bulk of the observed returns lie.
Similar plots for the FTSE-100 from 1984-2011, and the Straits Times Index (STI) from 1987-2011. The probability of tomorrow's FTSE-100 being positive given today's return is basically a coin-flip much of the time!
The predicted probabilities from the logistic model using data from the Straits Times Index (STI) in Singapore are more similar to the predicted probabilities for the S&P 500.
I would like to have some sense of whether the above logistic models are well-specified.  How might I go about validating the above logistic regressions?  One way would be to estimate and compare the fits of some more flexible models (i.e., smoothing splines, generalized additive models, etc).  If the logistic regression is well specified, then I would expect that the more flexible models should not give significantly different predictions than the above logistic model.  Here is the plot of the predicted probabilities and the 95% confidence bands of the logistic model for the S&P 500, a smoothing spline (blue), and a generalized additive model (red).
WTF!  The smoothing spline and the GAM have very similar predicted probabilities, particularly when today's return falls within [-0.05, 0.05] (i.e., over the bulk of the observed returns).  However, visually, neither the smoothing spline nor the GAM appear to match the logistic regression well at all!

The plot is actually a bit misleading, because your eye is immediately drawn to the huge differences in predicted probabilities for the extreme tails of today's return.  However, there are only a handful of observations in the extreme positive and negative tails of today's return (check the data rug) and thus the predicted probabilities for the smoothing spline and GAM models are unlikely to be very precise in these regions.  Better to focus on the differences in the predicted probabilities between the spline/GAM models and the logistic model in the region around today's return equals zero (where the bulk of the data is located).  In this neighborhood, the curve of predicted probabilities for all three models is increasing (consistent, I think, with some type of trend-following dynamic).  Note, however, that for the smoothing spline and GAM models, the slope of the predicted probability curve is much steeper (suggesting a more aggressive trend-following dynamic?) compared to the logistic model.

Here are the plots for the FTSE-100 and the Straits Times Index (STI)...
The whole point of fitting the smoothing spline and the GAM was to determine whether or not the logistic model is well specified.  As mentioned above, if the logistic model is well specified, then spline/GAM models should not be significantly better fits to the data.  We can measure goodness of fits by comparing the deviance of the logistic model with that of the GAM (I am going to ignore the smoothing spline because it doesn't technically respect the probability scale).  Thus the observed difference in deviance between the logistic model and the GAM for the S&P 500 is the deviance for the null model (i.e., Logistic model) less the deviance of the alternative (GAM model).
21430.65 - 21348.16 = 82.49624

The observed difference in deviance for the FTSE-100 and the STI are 16.53285 and 12.30237, respectively.  A smaller deviance indicates a better fit, thus, as expected, the more flexible GAM is a better fit for the S&P 500, the FTSE-100, and the STI.  Are these observed differences in deviance significant?  Let's go to the bootstrap!  Basic idea is to use the bootstrap to generate the sampling distribution for the difference in deviance between the null and alternative models, and then see how often these bootstrap replicates of the difference in deviance exceed the observed difference in deviance.  If the fraction of times that the replicated difference in deviance exceeds the observed difference is high, then it is likely that the observed improvement in fit using the GAM is the result of statistical fluctuations and is therefore not significant.

Results? Running the above bootstrap using 1000 replications yields p-values of 0.00, 0.02, and 0.06 for the S&P 500, FTSE-100, and STI, respectively.  The p-values indicate that the improvement in fit using the GAM model on data from the S&P 500 and FTSE-100 is significant at the 5% level, and that the improvement in fit using the GAM model is significant for the STI data at the 10% level.

What does all of this mean? Well, as far as this little modeling exercise is concerning, the results of the bootstrap specification test suggest that we should use a GAM instead of a logistic model.  Here are the plots of the GAM predicted probabilities with confidence bands for the S&P 500, the FTSE-100 and the STI.  Note how wide the 95% confidence bands for the predicted probability of tomorrow's return being positive are when today's return is either really negative or really positive.  This is exactly as it should be!  There just aren't enough observed extreme returns (positive or negative) to support precise predictions.
Could this be used to construct a useful investment stratagem? I think doubtful.  Compare the GAM model's predictions for the S&P 500 above, which make use of historical data from 1950-2011, to the GAM model's predictions using data from 1993-2011 for SPY, the widely traded ETF which tracks the S&P 500 (I assume that implementing a stratagem would involve trading some ETF like SPY).  I have also included the logistic regression and its 95% confidence bands because the bootstrap specification test fails to reject the logistic model in favor of the GAM (p-value of 0.50).
The two are substantially different.  The interesting little window of trend-following behavior is now gone.  Perhaps it was a historical artifact from pre-computer trading days of yore?  The negative slope of the predicted probabilities is consistent with mean-reversion in returns.

The underlying problem with trying to predict the behavior of stock returns, is that they are non-stationary.  It's not just that the parameters of the "data generating process" for stock returns are changing over time, the entire data generating process itself is evolving over time.  When the underlying process itself is changing, getting more historical data is unlikely to help (and in fact is likely to make predictions substantially worse!)...

Update: Code is now available!

## Thursday, September 1, 2011

### So you want to predict a stock market...but, which one?

Attention conservation notice: Stock returns are non-stationary time-series and as such all of the analysis below is wasted in a feeble attempt to squeeze water from the proverbial rock.
Which of the following equity markets would you expect to be more "predictable" using only information on past returns: the S&P 500, the FTSE-100, or the Straits Times index?  Personally, I had no really strong prior regarding which of these markets would be the most "predictable."  I suppose that a EMH purist (extremist?) would claim that they should all be equally unpredictable.

Some simple plots of S&P 500 logarithmic returns from 3 January 1950 - 24 August 2011 will help set the mood (all data is from Yahoo!Finance)...
Time series plot of the logarithmic returns exhibits clustered volatility (i.e., heteroskedasticity and serial correlation).  Kernel density plot is leptokurtic (i.e., excess kurtosis, which leads to heavier tails, than the best-fit Gaussian).  More on the scatter plot below...I will not bore you with the other plots (suffice it to say that they all exhibit the same properties).

A good place to start is a simple test of the random walk hypothesis.  If share prices follow a random walk with a drift, then tomorrow's share price should be the sum of today's share price and the drift term (and some statistical white noise).   Thus if we regress tomorrow's return on today's return, the coefficient on today's return should be statistically insignificant.

Rt+1 = μ + β Rt + ε t+1
H0: β = 0
Here is a scatter plot of FTSE-100 returns from 2 April 1984 - 24 August 2011.  The above regression specification is plotted in red, and the random walk null hypothesis is plotted in grey.  The two are virtually identical for the FTSE-100.
Although OLS parameter estimates are consistent in the presence of arbitrary heteroskedasticity and serial correlation,  OLS estimates of the standard errors are not.  To make sure my inferences are valid, I use HAC standard errors.  For the S&P 500 and the FTSE-100, using HAC standard errors, the coefficient on today's return is statistically insignificant.  However, even with HAC standard errors, the coefficient on today's return remains positive and significant for the Straits Times Index.  What this says to me is that predicting tomorrow's return using a linear model is pretty damned useless (unless you invest in Singaporean equities!).  But why should we limit ourselves to a linear model?  We shouldn't.

Above, I made the implicit assumption that the relationship between today's return and tomorrow's return is linear.  Is this assumption supported by the data?  One way we can test this assumption is to calculate the observed difference in the mean-squared error of the linear specification and that of a general non-linear specification.  One would expect that a general non-linear model would have a lower mean-squared error because it is more flexible, so we need to check whether this difference is statistically significant.  One way to do this is to generate a sampling distribution of this difference using the bootstrap, and then see how often we would observe a difference between the linear and non-linear specification as large as what we actually did observe.  If such a large difference is very rare, then this is evidence that the linear model is mis-specified.

Now, some care needs to be taken in generating the bootstrap replicates given that the time-series exhibits such complicated dependencies.  For my implementation,  I generate the simulated response variable by resampling the the residuals from the linear null model.

Linear or Non-Linear? That is the question...

For computational reasons, I use a smoothing spline to estimate my non-linear specification, using generalized cross-validation to choose the curvature penalty (you could easily use some other estimation method such as kernel regression in place of the smoothing spline).  Even though the smoothing spline has the lower MSE in all three cases, it is only significantly lower for the S&P 500.  The p-value for the bootstrap specification test (sketched above) is roughly 0.03, indicating that the linear specification can be rejected (p-values are 0.18 and 0.13 for the FTSE and the STI, respectively).

I plot the smoothing spline and 95% confidence bands (these are basic confidence intervals generated via a non-parametric bootstrap that re-samples the residuals of the smoothing spline) for both the S&P 500 and the Strait Times Index.  Since we are interested in prediction I plot the splines over a uniform grid of m=500 points (instead of simply evaluating and plotting the splines at the observed values of today's return).  Given that the smoothing spline is not significantly superior to the linear model for STI returns, I include the both the smoothing spline (blue) and the OLS regression (red) specifications in the scatter plot of STI returns.  Do these confidence bands seem too "tight" to anyone else?  Sample sizes are pretty large (over 15,000 obs. for the S&P 500)...
Could we use the above models to craft a profitable investment strategy?  Note that the OLS regressions for both the S&P 500 and the Straits Times Index have positive slope.  This is because, the bulk of the data (say for returns in [-0.02, 0.02]), exhibits mild "trend following" behavior.  Using the estimated smoothing splines, if today's S&P 500 return was -0.01 (or down 1%), 95% of the time tomorrow's return should be between -0.00156 (or down 0.156%) and -0.000657 (or down 0.0657%).  Similarly, if today's S&P 500 return was 0.01 (or up 1%), tomorrow's return should be between 0.000978 (or up 0.0978%) and 0.00188 (or up 0.188%).  Taking advantage of such minute expected returns would require, I think, either a large bankroll, or lots of leverage (or both!).

Also, what is the best way to trade the S&P 500 "market portfolio" in a cost efficient manner on a daily basis?  Exchange Traded Funds like the SPY might be an option.  Here is a plot of the smoothing spline fit to the data for SPY over the entire history of the fund (trading began on 29 January 1993).  I include the OLS regression fit because the slope is negative and significant (even using HAC standard errors) and because the spline fit is not a significant improvement over the linear model (p-value of 0.53 using my bootstrap specification test).
This plot of the SPY data from 1993 onwards looks quite a bit different, and the behavior of returns seems quite a bit different than the corresponding plot of S&P 500 data from 1950 onwards (despite the fact that SPY tracks S&P 500 almost identically from 1993 forward).  If the parameters of the underlying data generating process (DGP) driving stock returns remained constant over time, then one would expect to find that subsets of the data behaved similarly to the entire sample.  Maybe there are simply structural breaks in the data? Or, remember what I said about stock returns being non-stationary!

What about the FTSE-100?  Is all hope lost in trying to predict FTSE returns using past data? I may try predicting whether or not tomorrow's return will be positive given today's observed return using logistic regression in a future post (I am currently having difficulties getting R to give me the correct predicted values for the logistic models!  It worked fine yesterday!)

Update: Code is now available!