Back to Projects

Big Data & NLP · Data Scientist

View Code

Twitter Sentiment Analysis at Scale

PySparkNLPDatabricksBig DataML

A big data machine learning project that classifies tweets as positive or negative by processing 1.6 million labeled tweets from the Sentiment140 dataset. Built on Apache Spark and Databricks, leveraging distributed computing for scalable NLP processing.

The Problem

Understanding public sentiment at scale requires processing millions of text entries efficiently. Traditional single-machine approaches don't scale. The challenge was building a distributed ML pipeline that handles massive text data while maintaining classification accuracy.

The Approach

Developed a PySpark pipeline: text preprocessing (lowercasing, URL removal, tokenization, stopword filtering), TF-IDF vectorization, and model comparison between Logistic Regression and Naive Bayes. Deployed on Databricks for scalable cloud-based processing.

Technical Details

  • Apache Spark & PySpark for distributed computing
  • TF-IDF vectorization for text feature extraction
  • Logistic Regression vs Naive Bayes comparison
  • Databricks cloud platform for scalable execution
  • 1.6M tweet dataset (balanced: 800K positive, 800K negative)

Outcomes

  • Logistic Regression achieved 77.96% accuracy (outperforming Naive Bayes at 76.37%)
  • Built production-ready pipeline on scalable cloud infrastructure
  • Demonstrated distributed ML capabilities for real-world NLP tasks

Interested in working together on something similar?

Let's Talk