IST 718: Big Data Analytics

Description

The project aims to predict future stock prices by analyzing historical values. The dataset contains more than 2700 Ticker symbols with some CSV files dating back to 1970. However, to ensure accurate models, the team selected only 410 tickers that make up the S&P 500.

To ensure better accuracy, the project uses multiple models, including ARIMA, LSTM, and RNN. While RNN and LSTM are excellent models, they require a lot of computational resources to run. ARIMA models are better suited for accounting for seasonality in the data.

The recommendation system combines the results from all three models to identify stocks that have a higher probability of being categorized as either “Buy” or “Sell.” The recommendation engine follows simple rules, depending on whether it’s for day trading or short-term investments.

For day trading, the recommendation engine identifies stocks that have the highest percentage difference between their low and high values. For short-term investments, if all models agree on the opening and closing prices, a “Buy” is recommended if the closing price is greater than the opening price, and a “Sell” is recommended if the closing price is less than the opening price.

Overall, the project utilizes various models and techniques to provide accurate recommendations to traders and investors.

The project utilizes a dataset from Kaggle that contains extensive data on a variety of different Tickers dating back to the 1970s. The dataset includes open, close, volume, high, low, and adjusted close values for each date. The dataset has more than 3700 different ticker symbols, but for the scope of this project, the team decided to limit themselves to the data available on 410 stocks that form the S&P 500 composite.

Data cleaning was performed on the dataset to fill gaps for some of the tickers, where some dates in between the values were missing. The team filled these gaps by carrying forward the previous known value using the ffill function. Dropping NA was used to remove unknown values from the beginning when processing a given ticker symbol.

For comparisons, multiple such symbols can be picked into a single data frame, and dropping NA would give only the subset that has values across all the tickers.

The team used autocorrelation to do preliminary analysis of the stocks to find what affects the stock prices and whether there is a seasonality involved. This data tells what previous values affect the future values of the stock the most and can be helpful in building better models.

The team generated ARIMA, LSTM, and RNN models for each ticker symbol for “Open,” “Close,” “Low,” and “High” data frames. Together that constitutes 410 tickers * 4 types of values * 3 types of models = 4920 models. Just for ARIMA, the algorithm explores at least 9 models to pick the best possible model. Hence, model generation and selection is a computationally heavy and very time-consuming process.

Overall, the project uses a comprehensive dataset and utilizes advanced techniques to generate multiple models for each ticker symbol, providing accurate predictions for traders and investors.

Below is a subsample of a few stocks looks like. We see there are occasionally some gaps
in the dataset and thus we had to remove null values when running any models. For our
correlations, we only ran values across stocks for where they each had a value for that
given date.

Exploring the data, we found that there exist some seasonality trends and some
autocorrelation. We can see that the broader market that stocks are a part of can be
subject to trends and events such as the holidays, start of war, climate catastrophe or
news of inflation. Neither do they affect all stocks nor equally

CORRELATION – BEFORE REMOVING TRENDS

Here we see that there is a high correlation among all the stock values, except American
Airlines Group. The reason they have a high correlation is that their values all have been
increasing. However, once we remove the trends and seasonalities, we see a different
picture.

The ARIMA method is a widely used time series forecasting method that utilizes autocorrelation in data. It requires tuning three parameters, p, q, and d, which respectively represent the order of the AR term, the order of the MA term, and the order of differencing needed to make the time series stationary. The model performs better if a seasonality term is obtained. The pmdarima implementation is used to select the best values of p, q, and d to avoid errors and save time. The confidence interval for ARIMA models tends to grow larger as we look further into the future, which means that this method is best for day trading and short-term investing.

The use of LSTM models for predicting stock prices, with an 80/20 train/test split. Initially, a custom model was used, but it took too long to run, so they used a pre-trained sequential model with an activation function and one dense layer. This model performed well on some stocks but had some overfitting on others. Using the same model type on all 410 tickers can be prone to overfitting.

The Recurrent Neural Networks (RNN) is a type of neural network that is specialized for sequential and time-series data. The RNN’s hidden layer uses historical input data to streamline time-series data. In this model, the data was split into an 80/20 ratio and had 10 epochs. The Tensorflow package was used to access Keras models such as SimpleRNN, which allowed the layers to be set based on the array’s shape.

The LSTM model predicts that the customer should buy AMAT, TJX, and CDNS for the highest raw profits, and buy CNWT, NTRR, and NMHLY for the highest percentage profits. The customer should avoid buying BRK-A, AZO, and IDXX as they are expected to see the largest gross losses, while INTH, CPICQ, and NOXL have the highest predicted percentage losses. On the other hand, the RNN model suggests buying NTRR, TROW, and GPN for the highest gross percentage profit, while avoiding buying INTU, MDT, and IDXX. The two models predict different stocks, and it is difficult to determine which model is more accurate.

Following graphs compare the results of the three models for Open, Close, High and Low
for CSCO.

It shows that the models may initially be close to the predictions but may start vastly
disagreeing as we go further into the future. This means, the closer the forecast the
higher the probability that the values would be correct.

Reflection and Learning Goals

This course’s practical application of the analytics techniques that I learned in your previous classes was very helpful. Building big data analytics pipelines is another important skill, as it enables me to process and analyze large volumes of data in an efficient and effective way.

Gaining actionable insights from data is ultimately what analytics is all about. By using the techniques you learned in the course, I can identify patterns, trends, and correlations in the data that can help me make better decisions and create a competitive advantage for any organization.

Translating a business challenge into an analytics challenge involves identifying a specific problem or question that a business is trying to solve or answer, and then determining the data and analytics techniques needed to address it. For example, a business challenge might be to understand why sales have been declining in a particular region or to predict which customers are most likely to churn.

Once the business challenge has been identified, different analytics techniques can be used to make predictions and gain insights. Linear and logistic regression can be used to model relationships between variables and make predictions based on those relationships. Decision trees can help identify the most important variables and their relationships to the outcome variable, while neural networks can handle complex and nonlinear relationships.

Data science can be used to gain actionable insights by identifying patterns and trends in large datasets. Python can be used to build big data analytics pipelines, which are sets of tools and techniques used to collect, store, process, and analyze large amounts of data. Classic and state-of-the-art machine learning techniques can be used to create predictive models that can help businesses make informed decisions.

Overall, this course has helped me develop a strong foundation in advanced analytics, which can be applied to a wide range of business challenges.
The full project can be found here

Ayush Malakar

IST 718: Big Data Analytics

Description

Reflection and Learning Goals

Recent Posts

Archives

Categories