## Tuesday, June 7, 2011

### Today's Distraction...

For the past month or so I have been working my way through Cosma Shalizi's excellent course on data analysis.  Today I distracted myself from my own research by playing with Smoothing Splines (Cosma's Lecture 11).  Everything below was done in R.

First, I grabbed some historical data for the S&P 500 from Yahoo finance using the get.hist.quote() function from the tseries library.  I pulled down daily, weekly, and monthly data starting on 3 January 1950 and ending 31 Dec 2010 (start and end dates are for daily data).  I then constructed S&P 500 returns by taking the first-difference of the logarithm of the S&P 500 Adjusted Closing Price.   Here are (probably familiar) time series plots of the daily returns and a density plot...
Note that stock returns exhibit clustered volatility and are negatively skewed with significantly heavier tails than one would expect if returns were Gaussian.  On a side note (related to my current research), a couple of years ago some researchers at the Santa Fe Institute (specifically Stefan Thurner, J. Farmer, and John Geanakoplos) published a paper titled "Leverage Causes Fat Tails and Clustered Volatility."  Their model also predicts that returns should be negatively skewed (a point I think they should have included in the title).

Back to learning about splines! Is today's S&P 500 return useful in predicting tomorrow's S&P 500 return?  For the null hypothesis, I take a strongish form of the Efficient Market Hypothesis (EMH):
• Ho: Stock prices follow a random walk with a drift (i.e., returns should be mean zero white noise)
For alternative hypotheses, I use a brutally simple parametric model, and then what ever functional form the smoothing spline finds
• HA,1: rt+1 = β0 + β1 rt + εt
• HA,2: Whatever the smoothing spline kicks out
To test the null against the parametric alternative, we simply need to test the joint restriction that β0=β1=0.  Presumably to test the smoothing spline, we need to calculate some 95% confidence bands for the fitted spline and then look to see if the confidence bands contain the curve (really a line) predicted by the null hypothesis.  This is my first time using splines, so if anyone out there knows whether a better way (or a book) about how to do hypothesis testing with smoothing splines, I would be interested in hearing from you.

Here is a scatter plot of tomorrow's return against today's return.  I fit a simple linear regression to the data and plotted the curve in gray.  While both the slope and intercept terms are very significant (p-values essentially zero for both), it is worth noting that the standard confidence intervals are not valid (much too narrow) given the blatant violation of Gauss-Markov assumptions for the regression.  More work needs to be done before we can take this as evidence against the EMH null (since this post is about smoothing splines I am going to simply state that I would be surprised if, after calculating appropriate standard errors (either using bootstrapping, or some type of heteroskedastic robust standard errors, etc), the parameter results were still significant...but maybe!)

The smoothing spline is in orange.  I used the smooth.spline() function in R to fit the spline (using leave-one-out cross-validation to pick the optimal penalty for the curvature).
If stock prices reflect all relevant information about the value of the stock, then one would expect that today's return should be pretty useless in predicting tomorrow's return (thus under the null the true regression line should be the dotted red line in the above scatter).

But what about the smoothing spline?  A few things:
1. While the regression line is positively sloped, the smoothing spline is negatively sloped for larger negative and large  positive values of today's return.
2. The asymmetry.  The slope of the smoothing spline is more negative for large negative values of today's return (compared with the slope of the smoothing spline for large positive values of today's return).
3. Outliers:  The October 1987 stock market crash looms large in the data. How sensitive is the estimated smoothing spline to these 1-2 observations?
4. Is the asymmetry of the smoothing spline a statistical artifact?
5. Most importantly, despite the dramatic appearance, is the smoothing spline significantly different than the dotted red line?
I am going to focus on point 5 for the rest of the post.  I calculated 95% confidence bands for the smoothing spline using a bootstrap re-sampling of the data points.  Basically, I re-sampled (with replacement) the stock return data creating a new synthetic data set, fit a smoothing spline to this new data set, and then repeated the process a bunch of times to build up a distribution that I could use to create the confidence bands.
For the most part, the dotted-red line lies entirely within the 95% confidence band for the smoothing spline (if you squint you can kind of see a small portion of the dotted-red line that lies outside the bands).  So despite the dramatic appearance of the smoothing spline I would say that we can not statistically distinguish it from the dotted-red line.

I was curious to see what the above plot might look like if I used weekly and monthly S&P 500 returns instead of daily returns...

Weekly Returns:

Monthly Returns:

I was surprised at how different the daily, weekly, and monthly smoothing splines turned out to be...still in all three cases the 95% confidence bands for the smoothing spline contain (almost completely) the red-dotted line.  I will have a think as to why they are so different, and perhaps follow up with another post.  My R code will be posted as soon as I have time to get my Google Code page up and running...until then feel free to email me (or leave email in a comment) and I will send it to you.

Update: As pointed out in a comment below, EMH predicts stock returns should follow a random walk with a drift...which implies that the dotted-red line doesn't necessarily need to have a zero intercept.  One would hope that the drift is slightly positive!