This is the first of a series of posts summarizing the work I’ve done on Stock Market Prediction as part of my portfolio project at Data Science Retreat.

The scope of this post is to get an overview of the whole work, specifically walking through the foundations and core ideas.

First of all I provide the list of modules needed to have the Python code running correctly in all the following posts. I import them only once at the beginning and that’s it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import cPickle
import numpy asnp
import pandas aspd
import datetime
from sklearn import preprocessing
from datetime import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn import neighbors
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
import operator
import pandas.io.data
from sklearn.qda import QDA
import re
from dateutil import parser
from backtest import Strategy,Portfolio

Nice! Now we can start!

Introduction

The idea at the base of this project is to build a model to predict financial market’s movements. The forecasting algorithm aims to foresee whether tomorrow’s exchange closing price is going to be lower or higher with respect to today. Next step will be to develop a trading strategy on top of that, based on our predictions, and backtest it against a benchmark.

Specifically, I’ll go through the pipeline, decision process and results I obtained trying to model S&P 500 daily returns.

My whole work will be structured as follows:

Problem Definition

The aim of the project is to predict whether future daily returns of a S&P 500 are going to be positive or negative.

Thus the problem I’m facing is a binary classification.

The metric I deal with is daily return which is computed as follows:

Returni=AdjClosei–AdjClosei−1AdjClosei−1

The Return on the i-th day is equal to the Adjusted Stock Close Price on the i-th day minus the Adjusted Stock Close Price on the (i-1)-th day divided by the Adjusted Stock Close Price on the (i-1)-th day. Adjusted Close Price of a stock is its close price modified by taking into account dividends. It is common practice to use this metrics in Returns computations.

Since the beginnning I decided to focus only on S&P 500, a stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE (New York Stock Exchange) or NASDAQ. Being such a diversified portfolio, the S&P 500 index is typically used as a market benchmark, for example to compute betas of companies listed on the exchange.

Feature Analysis

The main idea is to use world major stock indices as input features for the machine learning based predictor. The intuition behind this approach is that globalization has deepened the interaction between financial markets around the world. Shock wave of US financial crisis (from Lehman Brothers crack) hit the economy of almost every country and debt crisis originated in Greece brought down all major stock indices. Nowadays, no financial market is isolated. Economic data, political perturbation and any other oversea affairs could cause dramatic fluctuation in domestic markets. A “bad day" on the Australian or Japanese exchange is going to heavily affect Wall Street opening and trend. In the light of the previous considerations the following predictors have been selected:

It is very easy to get historical daily prices of the previous indices. Python provides easy libraries to handle the download. The data can be pulled down from Yahoo Finance or Quandl and cleanly formatted into a dataframe with the following columns:

  • Date : in days
  • Open : price of the stock at the opening of the trading (in US dollars)
  • High : highest price of the stock during the trading day (in US dollars)
  • Low : lowest price of the stock during the trading day (in US dollars)
  • Close : price of the stock at the closing of the trading (in US dollars)
  • Volume : amount of stocks traded (in US dollars)
  • Adj Close : price of the stock at the closing of the trading adjusted with dividends (in US dollars)

The following is a screenshot of Yahoo Finance website showing a subset of NASDAQ Composite historical prices. This is exactly how a Pandas DataFrame looks like after having downloaded the data.


Output of Prediction

How do I plug the desired output of my prediction inside my dataframe? The answer is pretty straightforward and basically consists in repeating the exact same steps followed for predictors. Thus eventually, together with the 8 selected major stock indices, we’ll end up downloading a 9th dataset for S&P 500. Notice that the output of our prediction is a binary classification; we want to be able to answer the following question: is tomorrow going to be an Up or Down day? In order to do that the S&P data must undergo a simple manipulation of two steps:

  1. Compute S&P 500 daily returns (we’ll do this for predictors as well, as discussed in the next post).
  2. Generate an additional column in the DataFrame with ‘Up’ whenever the return on that specific day was positive and ‘Down’ whenever it was negative.

This passage of the pipeline is actually very important and it must be absolutely clear. I’ll spend a couple of words in addition to what I’ve already written. As I stressed, the output of my prediction is whether S&P 500 daily returns are positive or not. To carry out this kind of prediction I use the following indices: NASDAQ, Dow Jones, Frankfurt, London , Paris, Tokyo, Hong Kong, Australia and S&P 500 itself. Obviously I won’t use S&P 500 daily returns to forecast S&P 500 daily returns! This would not make sense. What I mean by S&P 500 itself is that I’ll play with S&P 500 historical close prices lagging them in time accordingly. The intuition is that I do not want to lose any potential information contained in the output data.

So to recap the logic is the following:

  1. Download 9 dataframes (NASDAQ, Dow Jones, Frankfurt, London , Paris, Tokyo, Hong Kong, Australia, S&P 500).
  2. Compute S&P 500 daily returns and turn them into a binary variable (Up, Down). This is my output and it won’t be touched anymore.
  3. Play with all the other columns of the 9 available dataframes (S&P 500 included) as explained in the following post.

For the sake of completeness I attach the Python code in charge of data gathering and very first preparation:

1
2
3
4
5
6
7
8
9
10
11
12
13
def getStock(symbol,start,end):
"""
Downloads Stock from Yahoo Finance.
Computes daily Returns based on Adj Close.
Returns pandas dataframe.
"""
df=pd.io.data.get_data_yahoo(symbol,start,end)
df.columns.values[-1]='AdjClose'
df.columns=df.columns+'_'+symbol
df['Return_%s'%symbol]=df['AdjClose_%s'%symbol].pct_change()
returndf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def getStockFromQuandl(symbol,name,start,end):
"""
Downloads Stock from Quandl.
Computes daily Returns based on Adj Close.
Returns pandas dataframe.
"""
import Quandl
df=Quandl.get(symbol,trim_start=start,trim_end=end,authtoken="your token")
df.columns.values[-1]='AdjClose'
df.columns=df.columns+'_'+name
df['Return_%s'%name]=df['AdjClose_%s'%name].pct_change()
returndf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def getStockDataFromWeb(fout,start_string,end_string):
"""
Collects predictors data from Yahoo Finance and Quandl.
Returns a list of dataframes.
"""
start=parser.parse(start_string)
end=parser.parse(end_string)
nasdaq=getStock('^IXIC',start,end)
frankfurt=getStock('^GDAXI',start,end)
london=getStock('^FTSE',start,end)
paris=getStock('^FCHI',start,end)
hkong=getStock('^HSI',start,end)
nikkei=getStock('^N225',start,end)
australia=getStock('^AXJO',start,end)
djia=getStockFromQuandl("YAHOO/INDEX_DJI",'Djia',start_string,end_string)
out=pd.io.data.get_data_yahoo(fout,start,end)
out.columns.values[-1]='AdjClose'
out.columns=out.columns+'_Out'
out['Return_Out']=out['AdjClose_Out'].pct_change()
return[out,nasdaq,djia,frankfurt,london,paris,hkong,nikkei,australia]

Lets’ move to the details of feature generation.

by Francesco Pochetti