Machine Learning • XGBoost • Streamlit • Docker • Business Analytics
This project tackled the challenge of improving B2B sales strategy using predictive analytics. I led the end-to-end development of a machine learning-driven playbook that forecasts deal outcomes and segments high-value customers, transforming static sales processes into dynamic, data-informed strategies.
We worked with three anonymized HubSpot datasets (Companies, Deals, Tickets), cleaning over 100 features and handling high missingness. EDA revealed that the dataset was US-centric and dominated by small-to-midsize companies. We engineered duration-based features and corrected severe class imbalance across industry and customer type categories.
After merging company and deal data via mapping dictionaries, I engineered business-relevant features such as region zone, deal size category, and revenue brackets. I used Random Forest to identify top predictors, selecting 10 robust features spanning behavioral and demographic attributes.
I trained and evaluated six models (Random Forest, XGBoost, AdaBoost, KNN, Logistic Regression, and Decision Trees). Ensemble models like XGBoost and AdaBoost delivered high AUC-ROC scores, with high accuracy. To ensure model integrity, features prone to leakage (e.g., deal probability, stage) were excluded.
We found that smaller companies and industries like Manufacturing and Retail had the highest win rates. Customer types labeled “In Trial” or “Partner” had over 80% conversion, while “Prospect” and “Vendor” had the lowest. These insights shaped our lead-scoring and segmentation logic.
I deployed the model using Streamlit and Docker. The dashboard includes five modules: deal prediction, filtered analytics, win-rate insights, segment discovery, and segment comparisons. This tool empowers sales teams to make fast, data-backed decisions and focus efforts on high-impact deals.
Python • pandas • scikit-learn • XGBoost • Streamlit • Docker • Git • matplotlib • seaborn