NIFTY 50 Index Prediction: A Multi-factor Based Approach

27 min readSep 9, 2021

DISCLAIMER: This article is for educational and research purposes. Do not use these models in real-time trading or investing.

Abstract
1. Introduction
2. Related Work
3. Data Collection
4. Exploratory Data Analysis
5. Data Processing and Featurization
6. Experimentation and Results
7. Error Analysis
8. Data Pipeline
9. Deployment
10. Conclusion
11. Future Work
Code Reference
Contact Links
Reference

Abstract

The stock market is highly volatile and highly non-linear and predicting the stock price accurately is very hard. The price of a stock goes up or down based on demand and supply but is influenced by various factors like corporate actions, economic policies, inflation, interest rates, global economic conditions, crude oil prices, war, natural calamities, etc. As India is an emerging market, Indian stock markets are also influenced by global events. NIFTY 50 price-to-earnings ratio (PE ratio), features from Global Indices like S&P 500, NASDAQ Composite, Euronext 100, and other factors like US 10 year treasury yields, Gold prices, Crude oil prices, USD-INR exchange rate are used to predict NIFTY 50 Index value. Different Machine learning and Deep learning algorithms are used to predict the NIFTY 50 Index value and evaluated using the Root Mean Squared Error (RMSE) Metric.

1. Introduction

Stock prices are extremely volatile and highly non-linear. Predicting stock prices is an extremely difficult task as it depends on various factors like market sentiments, political actions, global policies, fundamentals of the stock, etc. A rise or fall in stock price has an important role in deciding investor’s profits. Accurate prediction of stock prices can maximize the profits of the investor.

As India is an emerging market, the price of a stock not only depends on the fundamentals of the company, economic policies but also on global events.

NIFTY 50 PE ratio: PE ratio tells whether the market is undervalued, overvalued, or fairly valued.

Crude oil prices:

How movement in crude oil price impacts economy and stock market (moneycontrol.com)
India imports 86 percent of its annual crude oil requirement. Since the payments are made in the US dollars, India’s deficit will depend on crude price as well as on the USD/INR exchange rates.
Crude price impact on Indian economy: Higher crude price will have a negative impact on the fiscal and current account deficits of the economy. Increase in these deficits will lead to higher inflation and also impact monetary policy, consumption, and investment behaviour in the economy. A 10 percent increase in oil price will increase the trade deficit by $7 billion, that is, trade deficit will widen by 560bps.
Impact on Indian financial markets: Energy stocks have 12.5 percent weightage in the Nifty50 and 15.2 percent in the Sensex. Hence, the Nifty and the Sensex are sensitive to oil price movements. Higher crude prices adversely affect tyre manufacturers, footwear, lubricants, paints, and airline companies.

Gold prices

Gold has traditionally been considered a safe haven and a hedge to equity markets (moneycontrol.com).

In times of market crash, generally, Gold holds or increases its value. In asset allocation, it is advised to have around 5–10% holdings in Gold as hedging.

US 10-year Treasury yields:

Why the 10-Year U.S. Treasury Yield Matters

Foreign Institutional Investors decrease their exposure in Indian markets when Treasury yields in the US increase and thereby negatively impacting our indices. The reverse happens when Treasury yields are decreased.

USD-INR exchange rate:

Impact of Exchange Rate Fluctuation (equityfriend)
Indian companies can be divided into two groups based on the impact of currency fluctuation on their stock price and profitability:
Net Exporters — These companies sell product to outside world and receive payment in foreign currency (be it dollar, pound, euro etc). Whenever rupee appreciates as compared to these currencies, companies are exposed to translation loss as they can buy fewer rupees with same amount of foreign currency. This translation loss hurts their profitability since the raw material cost is in terms of rupees. Similarly, company’s profitability increases in case of rupee depreciation.
Net Importers — These companies buy product from outside world and make payment in foreign currency. Whenever rupee appreciates they are able to buy more foreign currency for payment resulting in overall translation gain. Profitability of companies increases in this case and similarly, profitability decreases when rupee depreciates.
NIFTY 50 is not overly dependent on the USD-INR currency over the longer term (motilaloswal).

Other market indices: Generally Indian Indices open in red or green depending upon US markets (but this is not always true). The general observation is that at 2 pm IST sometimes the NIFTY 50 index increases or decreases depending upon European markets which open in red or green at 2 PM IST.

Features are extracted from market indices and other market influencers and machine learning and deep learning models are trained to predict the NIFTY 50 Index value. Since this is a regression problem, the Root Mean Squared Error (RMSE) metric is used to evaluate the performance of different models.

2. Related Work

In [2], the authors collected data of Nike, Goldman Sachs, J&J, Pfizer, and JPMC from Yahoo Finance from 04 May 2009 to 04 May 2019. Six features namely High-Low (H-L), Close-Open (C-O), 7 days Moving Average, 14 days Moving Average, 21 days Moving Average, and 7 days std are extracted. The authors trained an Artificial Neural Network (ANN) and Random Forest model and used RMSE, Mean Absolute Percentage Error (MAPE), and MBE (Mean Bias Error) as metrics and predicted closing prices. The authors found out the ANN-based model shows better prediction results.

In [3], the authors used two different datasets and built two different ARIMA models for stock price prediction. The two stocks considered are Nokia (from 25th April 1995 to 25th February 2011) and Zenith Bank (from 3rd January 2006 to 25th February 2011). The author experimented with different values of p, d, q in the ARIMA model, and for final model selection, the author used three metrics, namely BIC (Bayesian Information Criterion), Adjusted R square, and Standard Error of Regression.

In [4], the authors used RNN, LSTM, and CNN for stock price prediction. The dataset consists of minute-wise prices of 3 stocks, namely TCS, INFOSYS, and CIPLA from July 2014 to June 2015. As the price range for the three stocks was different, the authors normalized the data and mapped it to range 0 to 1. The authors used a sliding window approach for short-term prediction. The window size was fixed to 100 minutes with an overlap of 90 minutes and a prediction was made for 10 minutes in the future. All three models were trained for 1000 epochs and RMSE is used as a metric and the model with low RMSE is taken as the final model. CNN gave the least RMSE for the prediction of stock prices.

[5] is a two-part blog on medium. The author of the blog used NIFTY 50 prices from 01 January 2000 to 31 December 2019. Along with this data, the author used Twitter news data from NDTVProfit for new sentiment in prediction. The author used ARIMA and variation of ARIMA for prediction and got an RMSE of 1707 and 965 respectively. The author finally used LSTM with and without news polarity and predicted the price. The RMSE, in this case, was 285 and 170 respectively. But for the LSTM model with news polarity, the author used the last 5 years of data because he has news information of the last 5 years only.

In [6], the author in this paper used a linear relationship between NIFTY 50 and other factors (market cues) like crude oil, currency, bond market, Japanese and US stock indices movement to predict NIFTY 50 index value. The below image shows the linear relationship the author assumed.

The author considered data from 1999 to 2007 for analysis. For every year, the author computed the weights (a, b1 to b5) using Regression analysis. The author used R-squared as a metric and achieved more than 80% R2 except for 2 years.

3. Data Collection

All data except the NIFTY 50 PE ratio is downloaded from Yahoo Finance. NIFTY 50 PE ratio is downloaded from trendlyne.com. Data is downloaded from 17 September 2007 to 11 August 2021. Data downloaded from Yahoo Finance has some missing values, which are imputed manually from investing.com.

4. Exploratory Data Analysis

4.1. NIFTY 50:

# loading NIFTY 50 datanifty = pd.read_csv(“NIFTY 50.csv”, parse_dates = True)
nifty[“Date”] = pd.to_datetime(nifty[“Date”], format = “%Y-%m-%d”)
nifty.head()

nifty.info()

The dataset consists of 7 columns

Date
Open: The opening value of the index
High: Highest index value on that Date
Low: Lowest index value on that Date
Close: The closing value of the index adjusted for splits
Adj Close: The adjusted closing value of the index for both dividends and splits
Volume: Number of shares traded on that Date

# displaying NaN rows of NIFTY 50na_rows(nifty).head()

The above missing values are imputed from investing.com. Volume information is available from 18 January 2013, so, Volume variable is not considered in our analysis. It is observed that Close and Adj Close have the same values, so, removing Adj Close column.

Adding percentage change between High and Low for a particular day as a new column.

Whenever time is available, it is always important to do the time-based splitting of data into train, test, and cross-validation. Time-based splitting is done in the ratio of 80:20.

nifty_train, nifty_test = time_based_train_test_split(nifty_imputed)# visualize data
plt.figure()
plt.plot(nifty_train["Date"], nifty_train["Close"], "g", label = "train data")
plt.plot(nifty_test["Date"], nifty_test["Close"], "b", label = "test data")
plt.legend()
plt.xlabel("Year")
plt.ylabel("Index value")
plt.title("NIFTY 50 Index")
plt.grid()
plt.show()

4.2. NIFTY 50 PE ratio:

# loading NIFTY 50 PE datanifty_pe = pd.read_csv("NIFTY 50 PE.csv")
nifty_pe.head()

# splitting NIFTY 50 PE based on time
nifty_pe_train, nifty_pe_test=time_based_train_test_split(nifty_pe)# plotting NIFTY 50 wrt NIFTY 50 PE
plot_nifty_wrt_others(nifty_pe_train, nifty_pe_test, "NIFTY 50 PE", col = "PE_NIFTY50")

Observation: From the above plot we can see that whenever PE went above 25, the market corrected except in the pandemic year after which the market fell sharply in March 2020. From 01 April 2021, consolidated earnings of companies were taken into account, so, there is a drastic fall in PE.

4.3. Crude oil prices:

# splitting crude oil prices dataframe based on time
crude_train, crude_test = time_based_train_test_split(crude_imputed)# plotting NIFTY 50 wrt crude oil closing prices
plot_nifty_wrt_others(crude_train, crude_test, "Brent Crude Oil Prices", col = "crude close")

Observation: Moderate or fair oil prices impact the index positively or do not have an impact whereas has too high prices impact the index negatively.

4.4. Gold prices:

# splitting gold prices dataframe based on time
gold_train, gold_test = time_based_train_test_split(gold_imputed)# plotting NIFTY 50 wrt gold closing prices
plot_nifty_wrt_others(gold_train, gold_test, "Gold Prices", col = "gold close")

Observation: In 2008, when the market crashed, gold has held its value and remained almost constant at around 900 levels. When the market crashed in 2020, Gold prices increased from around 1600 level to 2000 level.

4.5. Euronext 100:

The Euronext 100 Index is the blue chip index of the pan-European exchange, Euronext NV [Source]

# splitting euronext 100 index dataframebased on time
euronext_train, euronext_test = time_based_train_test_split(euronext_imputed)# plotting NIFTY 50 wrt euronext 100 index
plot_nifty_wrt_others(euronext_train, euronext_test, "Euronext 100 Index value", col = "euronext close")

Observation:

When the time range is divided into intervals, it can be observed that both indices move in the same direction.
Example: Till around 2009, both indices were corrected.
Between 2016 and 2018 both indices almost have an uptrend.
In March 2020, all the stocks around the Globe crashed.

4.6. NASDAQ Composite:

The Nasdaq Composite is a stock market index that includes almost all stocks listed on the Nasdaq stock exchange. Along with the Dow Jones Industrial Average and S&P 500, it is one of the three most-followed stock market indices in the United States. The composition of the NASDAQ Composite is heavily weighted towards companies in the information technology sector. [Source]

# splitting nasdaq composite data based on time
nasdaq_train, nasdaq_test = time_based_train_test_split(nasdaq)# plotting NIFTY 50 wrt nasdaq composite index closing values
plot_nifty_wrt_others(nasdaq_train, nasdaq_test, "NASDAQ Composite Index value", col = "nasdaq close")

Observation:

When the time range is divided into intervals, it can be observed that both indices move in the same direction.
Example: Till around 2009, both indices were corrected.
Between 2016 and 2018 both indices almost have an uptrend.

4.7. S&P 500

The Standard and Poor’s 500 or simply the S&P 500, is a stock market index tracking the performance of 500 large companies listed on stock exchanges in the United States. It is one of the most commonly followed equity indices. [Source]

# splitting S&P 500 based on time
sp500_train, sp500_test = time_based_train_test_split(sp500)#plotting NIFTY 50 wrt S&P 500 index closing values
plot_nifty_wrt_others(sp500_train, sp500_test, "S&P 500 Index value", col = "sp500 close")

Observation: Both the indices almost overlap or remained in a range for most of the time.

4.8. US 10 year treasury yields:

The 10-year Treasury note is a debt obligation issued by the United States government with a maturity of 10 years upon initial issuance. A 10-year Treasury note pays interest at a fixed rate once every six months and pays the face value to the holder at maturity. The U.S. government partially funds itself by issuing 10-year Treasury notes. [Source]
Treasury yield is the return on investment, expressed as a percentage, on the U.S. government’s debt obligations. Looked at another way, the Treasury yield is the effective interest rate that the U.S. government pays to borrow money for different lengths of time. [Source]

# splitting treasury yields data based on time
treasury_train, treasury_test = time_based_train_test_split(treasury_imputed)# plotting NIFTY 50 wrt treasury yields
plot_nifty_wrt_others(treasury_train, treasury_test, "US 10 year treasury yield", col = "treasury close")

Observation:

An abnormal change in Treasury yields impacts the index.
Treasury yields increased from 2.5% to 3.75% around 2011 at the same time index went down from 6000 levels to 5000 levels.
Treasury yields increased from 1.5% to 2.5% around 2017 at the same time index went down from 9000 levels to 8000 levels.

4.9. USD-INR exchange rate:

# splitting USD-INR exchange data based on time
usd_inr_train, usd_inr_test = time_based_train_test_split(usd_inr_imputed)# plotting NIFTY 50 wrt USD-INR exchange rate 
plot_nifty_wrt_others(usd_inr_train, usd_inr_test, "USD INR Conversion", col = "usd_inr close")

Observation:

An abnormal change in USD-INR conversion impacts index value
Example: Around 2009, USD-INR conversion increased steeply from around 45 to around 50 and index value decreased steeply from 4500 levels to 2500 levels

5. Data Processing and Featurization

Let us merge all the CSV files based on the Date column.

df = pd.merge(nifty_imputed, nifty_pe, on = "Date", how = "left")
df = pd.merge(df, crude_imputed, on = "Date", how = "outer")
df = pd.merge(df, gold_imputed, on = "Date", how = "outer")
df = pd.merge(df, euronext_imputed, on = "Date", how = "outer")
df = pd.merge(df, nasdaq, on = "Date", how = "outer")
df = pd.merge(df, sp500, on = "Date", how = "outer")
df = pd.merge(df, treasury_imputed, on = "Date", how = "outer")
df = pd.merge(df, usd_inr_imputed, on = "Date", how = "outer")
df.sort_values(by = ["Date"], inplace = True)# adding indicator variable
df["is_nifty_pe_imputed"] = 0
df["crude is_holiday"] = 0
df["gold is_holiday"] = 0
df["euronext is_holiday"] = 0
df["nasdaq is_holiday"] = 0
df["sp500 is_holiday"] = 0
df["treasury is_holiday"] = 0
df["usd_inr is_holiday"] = 0df.head()

50 columns are present in the data frame but displaying only some of them in the above image.

The data type of every column

Int64Index: 3651 entries, 0 to 3420
Data columns (total 50 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  3651 non-null   datetime64[ns]
 1   Open                  3421 non-null   float64       
 2   High                  3421 non-null   float64       
 3   Low                   3421 non-null   float64       
 4   Close                 3421 non-null   float64       
 5   (H-L)*100/L           3421 non-null   float64       
 6   PE_NIFTY50            3414 non-null   float64       
 7   crude open            3584 non-null   float64       
 8   crude high            3584 non-null   float64       
 9   crude low             3584 non-null   float64       
 10  crude close           3584 non-null   float64       
 11  crude (H-L)*100/L     3584 non-null   float64       
 12  gold open             3556 non-null   float64       
 13  gold high             3556 non-null   float64       
 14  gold low              3556 non-null   float64       
 15  gold close            3556 non-null   float64       
 16  gold (H-L)*100/L      3556 non-null   float64       
 17  euronext open         3555 non-null   float64       
 18  euronext high         3555 non-null   float64       
 19  euronext low          3555 non-null   float64       
 20  euronext close        3555 non-null   float64       
 21  euronext (H-L)*100/L  3555 non-null   float64       
 22  nasdaq open           3503 non-null   float64       
 23  nasdaq high           3503 non-null   float64       
 24  nasdaq low            3503 non-null   float64       
 25  nasdaq close          3503 non-null   float64       
 26  nasdaq (H-L)*100/L    3503 non-null   float64       
 27  sp500 open            3503 non-null   float64       
 28  sp500 high            3503 non-null   float64       
 29  sp500 low             3503 non-null   float64       
 30  sp500 close           3503 non-null   float64       
 31  sp500 (H-L)*100/L     3503 non-null   float64       
 32  treasury open         3516 non-null   float64       
 33  treasury high         3516 non-null   float64       
 34  treasury low          3516 non-null   float64       
 35  treasury close        3516 non-null   float64       
 36  treasury (H-L)*100/L  3516 non-null   float64       
 37  usd_inr open          3629 non-null   float64       
 38  usd_inr high          3629 non-null   float64       
 39  usd_inr low           3629 non-null   float64       
 40  usd_inr close         3629 non-null   float64       
 41  usd_inr (H-L)*100/L   3629 non-null   float64       
 42  is_nifty_pe_imputed   3651 non-null   int64         
 43  crude is_holiday      3651 non-null   int64         
 44  gold is_holiday       3651 non-null   int64         
 45  euronext is_holiday   3651 non-null   int64         
 46  nasdaq is_holiday     3651 non-null   int64         
 47  sp500 is_holiday      3651 non-null   int64         
 48  treasury is_holiday   3651 non-null   int64         
 49  usd_inr is_holiday    3651 non-null   int64         
dtypes: datetime64[ns](1), float64(41), int64(8)
memory usage: 1.4 MB

Note: There will be some missing values after merging all the CSV files because some of the stock markets may have a holiday while some do not have a holiday on a particular day. This adds missing entries in our data frame.

# displaying NaN rows
na_rows(df).head()

Lets us split the data into train, cv, test sets.

Output:

((2398, 50), (525, 50), (727, 50))

Conditions for filling missing values:

On some days some stock markets (other than NIFTY 50) will have a holiday. Generally, the previous day’s market conditions of the S&P 500 affect today’s NIFTY 50 prices. Example: As today is a holiday for S&P 500, tomorrow’s NIFTY 50 prices don’t have any effect. So, we will replace Open, Close values of S&P 500 with yesterday’s Open values and (H-L)*100/L’will be 0. We will update the is_holiday indicator value to 1.
On some days NIFTY 50 market is closed but other markets are open. For example, 01 May 2020 is a holiday for NIFTY 50 and not for S&P 500. So, we will replace the Close value of S&P 500 on 30 Apr 2020 with 01 May 2020 and also replace (H-L)*100/L value of 30 Apr 2020 with a mean of 30 Apr 2020 and 01 May 2020. The reason for doing this is S&P 500 on 30 Apr 2020 opens after NIFTY 50 is closed and next NIFTY 50 opens on 02 May 2020 after S&P 500 closes on 01 May 2020.
After handling missing entries, the strict condition is that when NIFTY 50 data is available on a particular day, then other values should not be nan.

We will drop columns like High, Low from data frames as we won’t use those columns and pass them through the handle_nan function to handle missing values. Even after passing through the handle_nan function, some values may be missing because on those days NIFTY 50 has missing values. We will drop those missing values.

# We will not use raw "High", "Low" columns, so let's remove them.
train_cv.drop(columns = ["High", "Low", 
                         "crude high", "crude low", 
                         "gold high", "gold low", 
                         "euronext high", "euronext low", 
                         "nasdaq high", "nasdaq low", 
                         "sp500 high", "sp500 low", 
                         "treasury high", "treasury low", 
                         "usd_inr high", "usd_inr low"], 
              inplace = True)train.drop(columns = ["High", "Low", 
                      "crude high", "crude low", 
                      "gold high", "gold low", 
                      "euronext high", "euronext low", 
                      "nasdaq high", "nasdaq low", 
                      "sp500 high", "sp500 low", 
                      "treasury high", "treasury low", 
                      "usd_inr high", "usd_inr low"], 
           inplace = True)cv.drop(columns = ["High", "Low", 
                   "crude high", "crude low", 
                   "gold high", "gold low", 
                   "euronext high", "euronext low", 
                   "nasdaq high", "nasdaq low", 
                   "sp500 high", "sp500 low", 
                   "treasury high", "treasury low", 
                   "usd_inr high", "usd_inr low"], 
        inplace = True)test.drop(columns = ["High", "Low", 
                     "crude high", "crude low", 
                     "gold high", "gold low", 
                     "euronext high", "euronext low", 
                     "nasdaq high", "nasdaq low", 
                     "sp500 high", "sp500 low", 
                     "treasury high", "treasury low", 
                     "usd_inr high", "usd_inr low"], 
          inplace = True)# handling NaN values in train, cv and test sets
imputed_train_cv = handle_nan(train_cv)
imputed_train = handle_nan(train)
imputed_cv = handle_nan(cv)
imputed_test = handle_nan(test)# Let us drop the rows where we don't have information of NIFTY 50
imputed_train.dropna(inplace = True)
imputed_test.dropna(inplace = True)
imputed_train_cv.dropna(inplace = True)
imputed_cv.dropna(inplace = True)

One important sanity check to perform is that to verify whether High is greater than Low columns. If not, then we impute those values with the mean of that column.

# checking whether high value is greater than or equal to low value 
high_low_test_util(train_cv, "train_cv")
high_low_test_util(train, "train")
high_low_test_util(cv, "cv")
high_low_test_util(test, "test")

Output:

High value is greater than low value. CORRECT DATA for nifty train_cv
High Value is less than Low Value. INCORRECT DATA for crude train_cv
High Value is less than Low Value. INCORRECT DATA for gold train_cv
High value is greater than low value. CORRECT DATA for euronext train_cv
High value is greater than low value. CORRECT DATA for nasdaq train_cv
High value is greater than low value. CORRECT DATA for sp500 train_cv
High value is greater than low value. CORRECT DATA for treasury train_cv
High value is greater than low value. CORRECT DATA for usd_inr train_cv

High value is greater than low value. CORRECT DATA for nifty train
High Value is less than Low Value. INCORRECT DATA for crude train
High Value is less than Low Value. INCORRECT DATA for gold train
High value is greater than low value. CORRECT DATA for euronext train
High value is greater than low value. CORRECT DATA for nasdaq train
High value is greater than low value. CORRECT DATA for sp500 train
High value is greater than low value. CORRECT DATA for treasury train
High value is greater than low value. CORRECT DATA for usd_inr train

High value is greater than low value. CORRECT DATA for nifty cv
High value is greater than low value. CORRECT DATA for crude cv
High value is greater than low value. CORRECT DATA for gold cv
High value is greater than low value. CORRECT DATA for euronext cv
High value is greater than low value. CORRECT DATA for nasdaq cv
High value is greater than low value. CORRECT DATA for sp500 cv
High value is greater than low value. CORRECT DATA for treasury cv
High value is greater than low value. CORRECT DATA for usd_inr cv

High value is greater than low value. CORRECT DATA for nifty test
High value is greater than low value. CORRECT DATA for crude test
High value is greater than low value. CORRECT DATA for gold test
High value is greater than low value. CORRECT DATA for euronext test
High value is greater than low value. CORRECT DATA for nasdaq test
High value is greater than low value. CORRECT DATA for sp500 test
High value is greater than low value. CORRECT DATA for treasury test
High value is greater than low value. CORRECT DATA for usd_inr test

From above, it is clear that all High-Low data except crude train_cv gold train_cv, crude train, and gold train is incorrect. So, we will replace the negative values with the mean value of (H-L)*100/L column.

Output:

High value is greater than low value. CORRECR DATA for crude train_cv
High value is greater than low value. CORRECR DATA for gold train_cv
High value is greater than low value. CORRECR DATA for crude train
High value is greater than low value. CORRECR DATA for gold train

Let us add other features namely (C-O)*100/O and (𝑋₂-𝑋₁)*100/𝑋₁

where 𝑋₂ = today’s value
𝑋₁ = yesterday’s value
𝑋ᵢ can be Open or Close values

# adding new features
imputed_train_cv = add_features(imputed_train_cv)
imputed_train = add_features(imputed_train)
imputed_cv = add_features(imputed_cv)
imputed_test = add_features(imputed_test)imputed_train_cv.shape, imputed_train.shape, imputed_cv.shape, imputed_test.shape

Output:

((2738, 66), (2239, 66), (498, 66), (683, 66))

It is observed that USD-INR does not have a holiday in any data frame, so, removing the indicator variable usd_inr is_holiday and saving the CSV files.

# usd_inr never had a holiday. So, removing itimputed_train_cv.drop(columns = ["usd_inr is_holiday"], inplace = True)
imputed_train.drop(columns = ["usd_inr is_holiday"], inplace = True)
imputed_cv.drop(columns = ["usd_inr is_holiday"], inplace = True)
imputed_test.drop(columns = ["usd_inr is_holiday"], inplace = True)# saving the data
imputed_train_cv.to_csv("final_train_cv_features.csv", index = False)
imputed_cv.to_csv("final_cv_features.csv", index = False)
imputed_train.to_csv("final_train_features.csv", index = False)
imputed_test.to_csv("final_test_features.csv", index = False)

6. Experimentation and Results

6.1. Baseline model: The range of RMSE is [0, ∞). Let us have a baseline model and compare the RMSE of ML, DL models with the baseline model. The mean model is taken to be our base model.

Test RMSE of the baseline model is 1863.719

6.2. Simple Moving Average Model (SMA): In SMA, the mean of last i datapoints is predicted as the index value on (i+1)th day.

Let us do hyperparameter tuning to find the best value for i.

SMA(nifty_train["Close"], range(1, 50))

Observation: From the above plot, we can see that, as window size increases, train RMSE is increasing. So, let us take window size = 1 and compute test RMSE and visualize the results.

Test RMSE of SMA is 150.375

NOTE: Important thing to note is by taking window size = 1, we are predicting yesterday’s closing value as today’s closing value.

6.3. Weighted Moving Average Model (WMA): In, WMA, the weighted mean of the last n days is predicted as the index value (n+1)th day.

Note: Price₁ is the recent value.

Let us do hyperparameter tuning to find the best value for n.

WMA(nifty_train["Close"], range(1, 50))

Observation: From the above plot, we can say that train RMSE is lowest when window size = 1 i.e., we predict yesterday’s closing price as today’s close price.

Note: WMA with window size = 1 is the same as SMA with window size = 1.

6.4. Exponential Moving Average Model (EMA): In EMA, the exponential mean of the last t datapoints is predicted as the index value on (t+1)th day.

Let us do hyperparameter tuning to find the best value for t (window size) and α (alpha).

alpha = [.1, .2, .3, .4, .5, .6, .7, .8, .9, 1]
EMA(nifty_train["Close"], range(1, 1000), alpha)

Observations:

alpha = 1 is the same as SMA. So, we will ignore alpha = 1.
Keeping alpha as constant, the RMSE first decreases reaches a minimum, and then increases.
As alpha increases, the minimum RMSE value for a particular plot is decreasing.

Choosing alpha = 0.9 and window size = 473 and computing test RMSE.

Test RMSE for Exponential Moving Average Model is 149.114

6.5. ARIMA Model: ARIMA stands for AutoRegressive Integrated Moving Average.

AR (p) process: In the AR process, the output variable is a linear function of its previous p values.
MA (q) process: In the MA process, the output variable is a linear function of previous q error terms.
I (d): Differencing of original observations is done to make time-series stationary. d is the number of times differencing of original observations is done to make time-series stationary.

Conditions for a time series to be stationary:

the constant mean for all t
the constant standard deviation for all t
autocovariance c(k,l) depends only on the difference (k-l)

Dickey-Fuller test for time series stationarity:

Null Hypothesis: time series is NOT stationary
Alternate Hypothesis: time series is stationary

# check stationary: mean, variance and adfuller testdate = nifty_train["Date"]
check_mean_std(nifty_train["Close"], date)
check_adfuller(nifty_train["Close"])

NULL HYPOTHESIS: time series is NOT stationary
ALTERNATE HYPOTHESIS: time series is stationary
Test statistic: -0.30171708973006334
p-value: 0.9252607886097605
Critical Values: {'1%': -3.4327493635995006, '5%': -2.8626000650414487, '10%': -2.5673343070718775}
Test statistic is greater than 5% critical value, ACCEPT NULL HYPOTHESIS
p-value is greater than 0.05, ACCEPT NULL HYPOTHESIS

As the p-value is greater than 0.05, we accept Null Hypothesis, i.e., the time series is not stationary. Just by looking at the time series, we can say that NIFTY 50 closing values have an upward trend.

Let us take the difference of the time series and test for stationarity

periods = 1
ts = nifty_train["Close"]
date = nifty_train["Date"]# differencing method
ts_diff = ts - ts.shift(periods = periods)
ts_diff.dropna(inplace = True)# check stationary
check_mean_std(ts_diff, date.values[periods:])
check_adfuller(ts_diff)

NULL HYPOTHESIS: time series is NOT stationary
ALTERNATE HYPOTHESIS: time series is stationary
Test statistic: -20.231475637451954
p-value: 0.0
Critical Values: {'1%': -3.4327493635995006, '5%': -2.8626000650414487, '10%': -2.5673343070718775}
Test statistic is less than 5% critical value, REJECT NULL HYPOTHESIS
p-value is less than 0.05, REJECT NULL HYPOTHESIS

As the p-value is less than 0.05, we reject Null Hypothesis. So, our d = 1 in ARIMA model.

Let us plot the Autocorrelation function plot and the Partial Autocorrelation function plot of the differenced time series to get the values for p and q.

From the above plot, we can see that, PACF plot crosses the upper confidence interval after 1st lag, so p = 2. From the ACF plot, we can see that the plot crosses the upper confidence interval after 1st lag, so q = 2.

Let us build the ARIMA model with p = 2, d = 1, q = 2 and visualize the results.

Test RMSE of the ARIMA model is 1875.698

Till now we have used only NIFTY 50 times series for prediction. Now lets us consider other calculated features that predict NIFTY 50 index closing values.

6.6. Linear Regression: The output of the linear regression model is a linear function of input features. The features considered are Close values of various indices, NIFTY 50 PE, and various percentage difference features of different indices and variables like USD-INR exchange rate, Treasury yields, etc. Let us fit the model and visualize the results.

Test RMSE of linear regression is 149.809

Note: Before passing the data to the linear regression model, the data is scaled.

6.7. Linear Regression with L1 regularization: In the above model, we did not add any regularization term. So, let us add L1 regularization, do hyperparameter tuning and visualize the results.

For different values of the regularization parameter, alpha, train, and cross-validation RMSE is calculated and plotted as shown above. For alpha = 0.001, we got low test RMSE.

Test RMSE of linear regression with L1 regularization is 144.816

6.8. Linear Regression with L2 regularization: Instead of L1 regularization, let us add L2 regularization and see the results.

For different values of the regularization parameter, alpha, train, and cross-validation RMSE is calculated and plotted as shown above. For alpha = 0.033, we got low test RMSE.

Test RMSE of linear regression with L1 regularization is 149.603

6.9. Linear Regression with L1 and L2 regularization: Let us include both L1 and L2 regularization. The hyperparameters are alpha and l1_ratio.

To balance both train and cross-validation RMSEs, hyperparameters alpha = 0.001, and l1_ratio = 0.9. Let us visualize the results.

Test RMSE of linear regression with L1 regularization is 144.851

6.10. Support Vector Regressor: Let us try SVR and see the results. The hyperparameters considered for this model are regularization parameter C and kernel.

From the above two heatmaps, we can see that the linear kernel is far better than other kernels. To balance both train and cross-validation RMSEs, hyperparameters C = 3.3, and kernel = linear. Let us visualize the results.

Test RMSE of linear regression with L1 regularization is 186.138

6.11. LSTM: Let us now try deep learning models and see the results. As the data is time series, i.e., sequence information, so it is a no-brainer to use LSTM models. Let us first consider only NIFTY 50 closing values and try to do the prediction.

Let us prepare the data needed for LSTM.

Hyperparameters tuned are steps (in prepare_data function), lstm units, and dense layer units. Each architecture is trained for 3 trials. For steps = 180, lstm units = 1024, and dense units = 1024, low test RMSE is achieved.

Test RMSE of LSTM model is 168.314

6.12. LSTM with other features: Let us add the computed features of other indices and visualize results.

Different hyperparameters are tuned to achieve low test RMSE.

Test RMSE of LSTM model with other features is 166.955

Results

+-----------------------------------------------------+-----------+
|                        Model                        | test RMSE |
+-----------------------------------------------------+-----------+
|                    Baseline Model                   |  1863.719 |
|                Simple Moving Average                |  150.375  |
|           Exponential Moving Average Model          |  149.114  |
|                 ARIMA(2, 1, 2) Model                |  1875.698 |
|               Linear Regression Model               |  149.809  |
|    Linear Regression Model with L1 Regularization   |  144.816  |
|    Linear Regression Model with L2 Regularization   |  149.603  |
| Linear Regression Model with L1 & L2 Regularization |  144.851  |
|          Support Vector Regressor (linear)          |  186.138  |
|                         LSTM                        |  168.314  |
|               LSTM with other features              |  166.954  |
+-----------------------------------------------------+-----------+

From the results, it is clear that linear regression with L1 regularization gave the lowest RMSE of 144.816.

7. Error Analysis:

Let us take our best model, i.e., linear regression with L1 regularization, and see where our model is performing badly. We will consider the absolute percentage error between true and predicted values.

Observations:

From the above plot, we can see that whenever there is a steep rise or fall in index value, then absolute percentage error values are more.
Example: We can see that absolute percentage error shot up to 14% around March 2020 due to Covid uncertainty.

Let us see how many predictions are above certain absolute percentage error.

Percentage of predictions with absolute percentage error greater than 2 is 7.93
Percentage of predictions with absolute percentage error greater than 4 is 2.2
Percentage of predictions with absolute percentage error greater than 6 is 0.73
Percentage of predictions with absolute percentage error greater than 8 is 0.29
Percentage of predictions with absolute percentage error greater than 10 is 0.15
Percentage of predictions with absolute percentage error greater than 12 is 0.15

8. Data Pipeline

Single-day prediction on a market holiday

%%time
single_day_prediction("2018-11-04")

Output:

Please enter a non-stock market holiday date between '2018-11-02' and '2021-08-10' in 'YYYY-MM-DD formatCPU times: user 2.3 ms, sys: 0 ns, total: 2.3 ms
Wall time: 2.32 ms

Single-day prediction on a market working day

%%time
single_day_prediction("2018-11-05")

Output:

Actual closing index value for next working day is: 10530.0
Predicted closing index value for next working day is: 10520.252086858969CPU times: user 3.26 ms, sys: 0 ns, total: 3.26 ms
Wall time: 3.86 ms

multiple days prediction

%%time
multiple_days_prediction()

RMSE: 118.618CPU times: user 222 ms, sys: 9.54 ms, total: 232 ms
Wall time: 232 ms

9. Deployment

The best model, i.e., linear regression with L1 regularization is deployed on the AWS. Link to web-app. The below video shows some sample predictions.

10. Conclusion

Stock prices are extremely volatile and highly non-linear. Accurate prediction of stock prediction is an extremely difficult task. As India is an emerging economy, global events impact the Indian stock market. In this Case Study, we tried to incorporate global event factors to predict NIFTY 50 closing value. Experiments are done with different algorithms. Linear Regression with L1 regularization gave the lowest RMSE. One important thing to note is in the Simple Moving Average Model, we did the prediction that yesterday’s closing value of NIFTY 50 is today’s closing value. With this, we got an RMSE of 150.375 but using Linear Regression with L1 regularization we got an RMSE of 144.816. This proves that predicting stock prices is an extremely difficult task. We did not consider all the variables that impact NIFTY 50, we did consider only a few and we could improve from the SMA model. With the multi-factor approach, we could slightly decrease the RMSE of the LSTM model from 168.314 to 166.954. This also proves that other indices affect Indian indices.

11. Future Work

Other indices like Nikkei 225 of Tokyo Stock Exchange, Straits Times Index of Singapore Stock Exchange, etc can be used to decrease RMSE.
In our analysis, we have removed Volume column. Abnormal change in Volume also impacts index.
In short term, the market is influenced by news. The polarity of news sentiment can be incorporated as a feature to predict index value and decrease RMSE.
Deep learning models are data-hungry. More data should be collected to train the complex DL model.

Code Reference

https://github.com/nagi1995/nifty-50-prediction-a-multi-factor-approach

Contact Links

Email: binginagesh@gmail.com
LinkedIn: https://www.linkedin.com/in/bingi-nagesh-5b0412b7/

References

[1] appliedaicourse.com

[2] Stock Closing Price Prediction using Machine Learning Techniques.

[3] Stock Price Prediction Using the ARIMA Model.

[4] Stock price prediction using LSTM, RNN and CNN-sliding window model.

[5] Stock market prediction using new sentiments.

[6] Multi-factor estimation of stock index movement: A case analysis of NIFTY 50, National Stock Exchange of India.

NIFTY 50 Index Prediction: A Multi-factor Based Approach

Contents

Abstract

1. Introduction

2. Related Work

3. Data Collection

4. Exploratory Data Analysis

5. Data Processing and Featurization

6. Experimentation and Results

Results

7. Error Analysis:

8. Data Pipeline

9. Deployment

10. Conclusion

11. Future Work

Code Reference

Contact Links

References

Written by Bingi Nagesh

Responses (1)