My Projects
-
Analysis of Earnings Calls: Unveiling Insights and Categorizing Stocks in the ARKK ETF
(NLP, Supervised & Unsupervised Learning)
This project leverages automation and natural language processing (NLP) techniques to analyze earnings call transcripts of various companies. The analysis aims to categorize stocks within the ARKK ETF and uncover meaningful topics discussed during earnings calls, providing valuable insights for informed investment decisions. Designated group project lead role.
- Web Mining for Earnings Call URLs using Selenium: Utilized Selenium to programmatically navigate to the earnings call sections of Seeking Alpha for a comprehensive list of stock tickers from selected ETFs. Constructed URLs for each ticker's earnings call page, extracted the first earnings call URL listed, and stored the ticker symbol and corresponding earnings call URL in a MongoDB collection.
- Extract, Transform, Load (ETL) and Natural Language Processing (NLP): Retrieved earnings call transcripts using BeautifulSoup and Selenium. Preprocessed the text data by removing punctuation, converting text to lowercase, and tokenizing the documents. Applied part-of-speech tagging using spaCy, filtered relevant words, removed stop words, and performed lemmatization. Converted the textual data into TF-IDF features using TfidfVectorizer from sklearn.
- Supervised Learning with Logistic Regression: Paired the TF-IDF matrix with a target variable indicating inclusion in the ARKK ETF. Trained a Logistic Regression model using cross-validation, analyzed the coefficients to identify significant words, and generated word clouds for positive and negative coefficients. Recommended stocks for the ARKK ETF based on predicted probabilities and constructed a narrative highlighting the importance of key terms.
- Unsupervised Learning with Latent Dirichlet Allocation (LDA): Created a dictionary of words using gensim, applied LDA to discover representative topics in the earnings call transcripts, and visualized these topics with word clouds. Compared topic distributions for ARKK and non-ARKK stocks, identifying distinctive themes and business models associated with each group.
- Outcome Analysis: The project provided a comprehensive understanding of the factors influencing stock performance within the ARKK ETF and identified potential candidates for inclusion based on sentiment and topic analysis.
-
ETF Recommender Project - Part 1
(Data Extraction and Processing)
This project involves creating an ETF recommender system by extracting, transforming, and loading data from various web sources for ETF holdings. The project focuses on obtaining data from iShares, Invesco, and StockAnalysis to build a comprehensive dataset for further analysis and recommendations. Designated group project lead role.
- Web Mining: Utilized BeautifulSoup and requests libraries to scrape ETF data from iShares, Invesco, and StockAnalysis websites. This involved parsing HTML content to extract relevant data, handling HTTP requests, and managing user-agent headers to avoid blocking.
- Data Collection and Integration: Extracted ETF tickers and their holdings from iShares using custom functions to map tickers to their corresponding URLs, and then downloaded and processed the holdings data. For Invesco and StockAnalysis, similar scraping techniques were employed to gather the necessary data.
- Data Transformation: Standardized the collected data by cleaning and reformatting it to ensure consistency. This included handling missing values, converting data types, and aligning the columns from different sources into a uniform structure.
- Database Management: Leveraged MongoDB for storing the ETF holdings data. Created a connection to the MongoDB cluster, structured the data into a hierarchical format suitable for MongoDB, and inserted the data into the database. Indexes were created to optimize query performance.
- Preparation for Analysis: Ensured the dataset was ready for analysis by performing extensive data validation and cleaning. This included verifying the completeness of the data, removing duplicates, and ensuring accurate mapping of ETF tickers to their respective holdings.
-
ETF Recommender Project - Part 2
(Clustering and Recommendations)
The second part of the ETF Recommender Project focuses on analyzing the ETF data to identify similar ETFs, extract additional relevant data such as expense ratios, and recommend new investment ideas using clustering and association rule mining techniques. Designated group project lead role.
- Similarity Analysis: Utilized Python libraries like SciPy and pandas to calculate Jaccard and cosine similarities between ETFs. For instance, the cosine similarity between QQQ and QQQM was found to be 1.0, indicating identical holdings. The analysis involved creating a normalized data matrix and computing pairwise similarity metrics.
- Expense Ratio Extraction: Employed Selenium for automated web scraping to gather expense ratios from stockanalysis.com. This involved configuring a headless Chrome browser, navigating through multiple pages, and extracting tabular data. The extracted data was then cleaned and integrated into the dataset.
- Apriori Algorithm for Association Rules: Leveraged the apriori algorithm from the apyori library to identify strong association rules among stocks within ETFs. This process involved creating a list of transactions representing ETF holdings and running the apriori algorithm to discover frequent itemsets and generate association rules, such as ETFs holding AAPL are also likely to hold MSFT, NVDA, and META.
- Feature Engineering and Clustering: Retrieved stock characteristics from Finviz.com using BeautifulSoup for web scraping. The data underwent extensive feature engineering, including normalization, handling missing values, and clipping outliers. KMeans clustering from scikit-learn was then applied to categorize stocks based on financial metrics, identifying clusters of similar stocks.
- Cluster-Based ETF Recommendations: Analyzed the distribution of ETF holdings within clusters to recommend ETFs with similar stock compositions. This involved visualizing cluster distributions using libraries like matplotlib and seaborn, and computing cosine similarities between ETFs based on their cluster distributions. For example, ARKK was found to be most similar to XT and JGLO based on cluster analysis, facilitating better investment decisions.
-
Multi-Class Prediction of Obesity Risk
(Classification)
This project involves building a multi-class prediction model to assess obesity risk based on various health and demographic factors.
- Data Preprocessing: Applied standard scaling and median imputation for numerical columns to handle missing values and ensure balanced feature contribution. Categorical columns were processed using constant imputation and OneHotEncoding.
- Feature Engineering: Implemented feature engineering techniques to standardize numerical features and encode categorical features, facilitating their use in the machine learning models.
- Model Selection: Utilized Gradient Boosting and Decision Tree algorithms to capture non-linear relationships and compare their capabilities in predicting obesity risk levels.
- Pipeline Creation: Developed a pipeline to streamline preprocessing and model application, ensuring seamless handling of testing data.
- Model Optimization: Explored hyperparameter tuning and feature selection/creation to enhance the performance of the Decision Tree algorithm, aiming to improve precision for smaller sample classifications.
-
Grocery Store Demand Forecasting
(Regression)
This group project involved building a machine learning model to forecast inventory levels for a grocery store.
- Data Collection and Integration: Downloaded and merged datasets for sales, stores, and products. Conducted feature engineering by creating new temporal features and converting categorical variables.
- Exploratory Data Analysis (EDA): Performed thorough EDA to identify trends, patterns, and anomalies. Generated histograms and line plots to visualize sales data across different dimensions such as product name, product category, store number, and date.
- Model Development and Evaluation: Developed and fine-tuned multiple machine learning models including Linear Regression, Random Forest, and XGBoost. Evaluated model performance using metrics like R-squared, MAE, MSE, and RMSE to select the best model.
- Feature Engineering and Selection: Created new features such as angular encoding for temporal features and applied scaling and one-hot encoding through a pipeline. Dropped redundant features to enhance model performance.
- Model Prediction and Aggregation: Used the final XGBoost model to predict demand and adjusted inventory levels. Aggregated data to predict weekly inventory for each store-product combination, ensuring the model's RMSE was within acceptable limits.