Advaith Venkatsubramanian | Data Scientist & ML Engineer

About Me

I'm a Master's student in Computer Science at Arizona State University, graduating in May 2026. Before grad school, I spent 3+ years at Wabtec Corporation as a Data Scientist, where I built ML models for locomotive pricing, automated OCR pipelines, and helped design operational chatbots.

Currently, I'm working as a Software Developer at the ASU Office of University Affairs, where I engineer RAG-based AI assistants, build NLP-driven pipelines, create Tableau visualizations, and ensure WCAG 2.x and ADA compliance. I'm passionate about the intersection of AI, data engineering, and applied machine learning.

I thrive in fast-paced environments where I can combine analytical thinking with full-stack engineering to deliver solutions that make a real difference.

🤖
AI & Machine Learning
RAG, LLMs, NLP, Deep Learning

📊
Data Science
Predictive Modeling, Analytics

🔧
Data Engineering
ETL/ELT, Pipelines, Cloud

💻
Full-Stack Dev
React, Go, Flask, Docker

Experience

Professional Journey

From engineering intern to data scientist — building impactful solutions across industries.

Software Developer

ASU Office of University Affairs

Mar 2025 – Present

Engineered a RAG-based AI assistant using large language models, vector search, and Python to automate activity creation workflows, cutting manual content entry time by 60%.
Contributed to the Collaboratory web platform by implementing bug fixes and UI enhancements using JavaScript, React, Golang, Docker, and PostgreSQL, improving system usability and overall application stability.
Engineered an LLM-powered onboarding chatbot using LangChain, GPT-4, ChromaDB, and HuggingFace Sentence Transformers — implementing a RAG pipeline over internal PDF/Markdown docs to answer process queries and auto-generate personalized onboarding schedules, reducing agenda creation time by 30%.
Developed 9 Tableau visualizations for St. Mary's Food Bank analyzing donation patterns, volume trends, and regional distribution, surfacing insights that informed donor outreach strategy.
Built a resume screening pipeline in Python using PyPDF2 and regex-based keyword extraction to parse candidate PDFs, score applicants against job description criteria, and auto-populate Excel trackers via openpyxl — reducing manual screening time by 75% for student worker hiring.
Achieved WCAG 2.x and ADA compliance across the Collaboratory platform by implementing ARIA labeling, keyboard navigation, and semantic HTML for diverse user accessibility.

Python JavaScript/React LLMs RAG PostgreSQL Tableau Docker WCAG/ADA

Data Scientist

Wabtec Corporation

Jul 2021 – Aug 2024

Designed and deployed an ML pricing model using Random Forest to forecast locomotive part costs, achieving 87% prediction accuracy and enabling data-driven procurement decisions across the supply chain.
Created a production NLP classification pipeline using TF-IDF and SVM to automatically extract root causes from unstructured engineering logs, improving diagnostic accuracy by 32% and accelerating troubleshooting workflows.
Implemented an ETL pipeline leveraging Fuzzy Matching and Cosine Similarity to deduplicate 120K customer records in a large-scale MDM dataset, reducing data redundancy by 40% and cutting downstream processing time by 25%.
Designed an OCR document processing service using AWS Textract, S3, and EC2 to extract and structure data from unstructured documents at scale, reducing manual data entry time by 30%.
Owned Jenkins and Chef CI/CD workflows across multiple services and conducted systematic code reviews, maintaining production stability and ensuring zero-downtime releases.

Python Random Forest TF-IDF SVM AWS Textract S3 EC2 CI/CD

Projects

Things I've Built

A selection of projects combining AI, data engineering, and software development.

🏥

TrustMed AI

85% relevance · <8s response

Developed an end-to-end AI assistant for medical Q&A on diabetes and cardiovascular diseases using a retrieval-augmented generation architecture with vector search, achieving 85% relevance score with sub-8-second response time.

RAG AWS Python Pinecone LangGraph LLaMA

View Code

🛡️

Intrusion Detection System

>0.95 ROC-AUC

Built a production-grade intrusion detection system on the NSL-KDD dataset (125K+ records), implementing automated retraining pipelines, model lifecycle management, real-time inference APIs, and drift detection.

MLflow Airflow FastAPI Docker Evidently

View Code

🦟

Dengue Outbreak Forecasting

Applied ARIMA-based time series modeling on multi-region historical epidemiological data to forecast dengue case counts, enabling health resource planning.

Python ARIMA Time Series

View Code

🏠

Airbnb Listings ELT Pipeline

Built an end-to-end ELT pipeline to ingest and transform Airbnb listings data, incorporating data quality tests, and generated analytics on a Streamlit dashboard.

Snowflake dbt SQL Streamlit

View Code

Skills