Portfolio
On this page, you can delve into a diverse range of my portfolio projects, each showcasing different skills and accomplishments.
You can also visit my GitHub repositories for more projects and practice:
Heart Disease and Attack Risk Classification Using Machine Learning Methods
Portfolio Project of Machine Learning AppDeployment (DSAN 6700)
Washington, D.C.
October 2023 – December 2023
Under the course ‘Machine Learning App Deployment’ (DSAN 6700), I spearheaded a three-month project leading a team of three. Our mission was to harness data processing and construct machine learning models to analyze and predict risks of heart disease and attacks.
Key Responsibilities and Accomplishments:
- Employed a range of machine learning techniques, including traditional algorithms (KNN, Decision Tree, Naive Bayes) and ensemble models (Random Forest, Gradient Boosting, XGBoost), for effective heart risk classification
- Constructed a stacked model integrating ensemble methods with Logistic Regression, achieving a notable prediction accuracy of 91%
Analysis of Reddit Posts about Soccer
Portfolio Project of Big Data and Cloud Computing (DSAN 6000)
Washington, D.C.
September 2023 – December 2023
In the ‘Big Data and Cloud Computing’ course (DSAN 6000), I led a four-person team on a semester-long project. Following our professor’s guidelines, we delved into Natural Language Processing (NLP), feature engineering, and data modeling applied to a Reddit post dataset encompassing over 3 billion soccer-related comments. We employed PySpark on both Azure and AWS platforms to navigate and analyze this vast dataset, ultimately extracting and developing insightful data interpretations.
Key Responsibilities and Accomplishments:
- Leveraged PySpark for NLP and feature engineering on a massive Reddit dataset comprising over 3 billion comments
- Executed comprehensive sentiment analysis to track the evolving perspectives of soccer fans regarding the 2022 World Cup
- Implemented K-means to categorize users into distinct clusters, employing the elbow method for optimal parameter determination
Sales Forecasting Using Time Series Analysis
Portfolio Project of Time Series (ANLY 560)
Washington, D.C.
February 2023 – May 2023
In the ‘Time Series’ course (ANLY 560), I independently managed a semester-long project focused on analyzing Walmart sales data from 2010 to 2012. I employed various time series models, meticulously comparing them based on RMSE values. Eventually, I selected and refined the model with the optimal parameters for accurate sales forecasting.
Key Responsibilities and Accomplishments:
- Applied SARIMA and deep learning models to capture the data patterns and forecasted based on the Walmart Historical Sales Data
- Achieved an RMSE of 482 using SARIMA model, outperforming benchmarks and automatic models
- Employed SARIMAX model to assess the impact of external factors, including CPI index, fuel price, and unemployment rate
Exploring the Factors Affecting Salaries and Employment
Portfolio Project of Statistical Learning (ANLY 512)
Washington, D.C.
April 2023 – May 2023
In the ‘Statistical Learning’ course (ANLY 512), I worked in a team of three to apply a range of machine learning algorithms for classifying salary categories using a US Census dataset. Our approach was to effectively leverage and compare various algorithms to determine the best fit for this specific task.
Key Responsibilities and Accomplishments:
- Trained a classifier with different machine learning models, including Logistic Regression, XGBoost, SVM and Naive Bayes, to predict the income category based on Census data, where XGBoost performed the best and reached an accuracy of 86%
- Used Artificial Neural Networks with the Multi-Layer Perceptron Classifier method to classify salary categories