Smart Email Management System

Final Year Project: Leveraging Natural Language Processing and Data Analytics for Smarter Email Management Optimization or Advancing Email Productivity with Hybrid AI and Data Analytics

This project delivers a modern, intelligent email management platform built with Django, combining machine learning, rule-based logic, and generative AI to help users focus on what matters. The system automatically prioritizes, categorizes, and summarizes emails, supports hands-free voice commands, and provides actionable analytics—all while ensuring privacy and scalability.

Link: Read the project announcement on LinkedIn & Watch a demo of the project on LinkedIn

Motivation

Email overload is a persistent challenge for professionals. This project aims to transform the inbox experience by leveraging a hybrid AI approach combining robust machine learning models, domain-specific rules, and Generative AI to deliver reliable, context-aware email organization and insights.

Objectives

To conduct a comprehensive analysis of existing email management systems and user challenges through literature review and user surveys, and to prepare a structured email dataset for model development.
To analyze and design an intelligent unified data architecture incorporating natural language processing, machine learning algorithms, and security protocols.
To develop an intelligent multi-modal data assistant that integrates API integrations, automated visualization generation, and voice-enabled analytics
To evaluate the effectiveness and performance of the developed system through controlled testing and user acceptance in enterprise environments.

Project Structure

email_app/                    # Main Django application
├── ai_services/             # AI/ML services (categorization, prioritization, LLM)
├── Gmail_API/               # Gmail API integration
├── voice_services/          # Voice command functionality
├── views/                   # Django views (auth, email, dashboard, AI)
├── models/                  # Database models
├── templates/               # HTML templates
├── static/                  # CSS, JavaScript, assets
├── utils/                   # Utility functions
└── management/              # Django management commands

email_project/               # Django project settings
Data Preprocessing and EDA/  # Jupyter notebooks for data analysis
├── FYPEnronEmail_NLP1.ipynb        # Initial data loading and cleaning
├── FYP_Trans_Feature_NLP2.ipynb    # Feature transformation
├── FYP_Message_Text_Cleaning_NLP3.ipynb # Text preprocessing
└── FYP_Visualization4.ipynb         # Data visualization and EDA

Data Labeling Solutions/     # Automated labeling approaches
├── Large Language Model/   # LLM-based labeling (Qwen2.5)
└── Unsupervised Learning/   # Clustering-based labeling

Model Building/              # ML model development
├── Prioritization Model/    # Random Forest priority classification
└── Categorization Model/    # XGBoost category classification

Experiments/                 # Model experiments and iterations
├── Category Modeling/       # Category model experiments
├── Priority Modeling/       # Priority model experiments
├── Data Labeling GC/        # Google Cloud labeling experiments
└── Spam Data Preparation/   # Spam detection data prep

requirements.txt            # Python dependencies
manage.py                   # Django management script
.env                        # Environment variables

System Overview

The platform is architected for modularity and extensibility, with clear separation between data ingestion, processing, AI services, and user interaction.

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    User Interface Layer                     │
├─────────────────────────────────────────────────────────────┤
│  Web UI (Django Templates)    │  Voice Commands (Web Speech) │
│  • Inbox Management           │  • Voice Navigation          │
│  • Analytics Dashboard        │  • Hands-free Email Actions  │
│  • Email Composition          │  • Voice-to-Text Input       │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                   Django REST API Layer                     │
├─────────────────────────────────────────────────────────────┤
│  Authentication (OAuth2)      │  Email Management Views      │
│  • Gmail OAuth Integration    │  • CRUD Operations           │
│  • Session Management         │  • Bulk Actions              │
│  • User Profile Management    │  • Search & Filtering        │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    AI Services Layer                        │
├─────────────────────────────────────────────────────────────┤
│  ML Models                    │  Generative AI Services      │
│  • Random Forest (Priority)   │  • Google Gemini Integration │
│  • XGBoost (Categorization)   │  • Email Summarization       │
│  • Rule-based Enhancement     │  • Smart Reply Generation    │
│  • Vector Embeddings          │  • Context-aware Responses   │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                   Data & Integration Layer                   │
├─────────────────────────────────────────────────────────────┤
│  Gmail API Integration        │  Database (PostgreSQL)       │
│  • Email Fetching             │  • Email Storage             │
│  • Send/Reply Operations      │  • User Profiles             │
│  • Real-time Sync            │  • Vector Search (pgvector)  │
│  • Attachment Handling        │  • Analytics Data            │
└─────────────────────────────────────────────────────────────┘

Technical Stack

Backend: Django 5.2, Python 3.9+
Database: PostgreSQL with pgvector extension
ML/AI: scikit-learn, XGBoost, Google Gemini API
Frontend: HTML5, CSS3, JavaScript (Web Speech API)
APIs: Gmail API, Google Cloud AI Platform
Authentication: OAuth2 (Google)
Data Processing: pandas, numpy, BeautifulSoup4

Key Features

Hybrid Priority Scoring: Merges Random Forest predictions with a transparent rule-based engine for robust, real-world prioritization.
Automated Categorization: XGBoost model classifies emails into actionable categories (e.g., Primary, Social, Promotions).
Generative AI Summaries & Replies: Uses Google Gemini for summarizing threads and suggesting context-aware replies.
Voice-Driven Inbox: Browser-based voice commands (Web Speech API) for hands-free navigation and actions.
Analytics Dashboard: Visualizes trends, response times, and sender/recipient activity.
Semantic Search: Finds similar emails using vector embeddings and pgvector.
Secure Gmail Integration: OAuth2-based access for reading, sending, and managing emails.

Data & Modeling Pipeline

Data Collection & Preparation

Primary Dataset: Enron Email Dataset (517,401 emails from 150+ users)
Data Processing: Multi-stage pipeline implemented in Jupyter notebooks:
- FYPEnronEmail_NLP1.ipynb: Data loading, initial cleaning, and exploration
- FYP_Trans_Feature_NLP2.ipynb: Feature engineering and transformation
- FYP_Message_Text_Cleaning_NLP3.ipynb: Advanced text preprocessing (HTML cleaning, tokenization, lemmatization)
- FYP_Visualization4.ipynb: Comprehensive EDA and data visualization

Automated Data Labeling

LLM-Based Labeling: Used Qwen2.5 large language model for consistent, context-aware labeling
Category Labels: 11 workplace-relevant categories (Work/Business, Finance, Personal, Meeting, Spam, IT Alerts, HR Updates, Social Media, Utilities, Promotions, Legal)
Priority Labels: 3-tier priority system (High, Medium, Low) based on urgency and importance
Quality Assurance: Structured prompts ensure consistent labeling across 500K+ emails

Machine Learning Models

Priority Classification:
- Model: Random Forest with 100 estimators
- Features: TF-IDF vectors, sender patterns, temporal features
- Performance: 99% accuracy, 1.00 F1-score
- Hybrid Enhancement: Rule-based post-processing for edge cases
Category Classification:
- Model: XGBoost with optimized hyperparameters
- Features: TF-IDF vectors, email metadata, content patterns
- Performance: 84% accuracy, 0.87 F1-score
- Class Balancing: Weighted training to handle imbalanced categories

Model Implementation

AI Services Architecture: Modular design in email_app/ai_services/
- categorization/: Email category prediction service
- prioritazation/: Email priority scoring service
- llm/: Generative AI integration (Google Gemini)
- embeddings/: Vector embeddings for semantic search
- summarization/: Email thread summarization
- recommendations/: Smart reply suggestions

Email Category Labeling & Balancing

The Enron dataset used in this project did not include category or priority labels, which posed a challenge for supervised learning. To address this, we used the Qwen2.5 large language model (LLM) to automatically generate consistent, context-aware labels for both email category and priority. Each email was passed through a structured prompt, and the model assigned one of several workplace-relevant categories, such as:

Work or Business Email
Finance & Transactions Email
Personal Email
Meeting & Schedule Email
Spam Email
IT Alerts & System Notifications Email
Internal Policies & HR Updates Email
Social Media Email
Utilities Bill Email
Promotions or Marketing Email
Legal & Contractual Email

This LLM-based approach ensured high-quality, context-sensitive labeling, reducing human error and making the dataset suitable for training robust machine learning models.

Balancing the Dataset:

Categories with more than 15,000 samples were capped at 15,000 (randomly selected)
Smaller categories were kept as-is
No synthetic samples were added, preserving the quality of text embeddings

Remaining imbalance was handled by using class weights during model training, ensuring the model learned to recognize minority categories effectively.

Model Interpretability & Hybrid Adjustment

Despite high accuracy and F1-scores, detailed analysis revealed a consistent bias toward the Medium priority class—especially for borderline or ambiguous emails. To address this, a post-training probability adjustment layer was developed, refining model predictions using domain-specific signals such as:

Urgency or risk keywords (e.g., “urgent”, “critical”, “failure”)
Action verbs and deadline phrases (“due today”, “immediately”)
Uppercase emphasis in subject lines
Informational/courtesy phrases for Low priority (“FYI”, “reminder”)

This adjustment applies:

Weighted scoring based on detected context
Nonlinear scaling to boost or reduce probabilities (e.g., cubic boost for strong urgency)
Class threshold logic to avoid over-assigning Medium unless other classes lack strong signals

This hybrid approach ensures the system delivers more reliable, human-aligned prioritization—especially in edge cases where pure statistical models may fail. The result is a practical, interpretable solution for real-world email management.

API Integrations

Gmail API: Secure email access and management
Google Gemini (Generative AI): Summarization, smart replies
Google Cloud AI Platform: Model deployment and scaling
pgvector: Semantic search in PostgreSQL

Setup Instructions

Prerequisites

Python 3.9 or higher
PostgreSQL 12+ with pgvector extension
Gmail API credentials
Google Gemini API key

Installation Steps

Clone the repository:

git clone <your-repo-url>
cd fyp-email-analyzer

Create and activate a virtual environment:

python -m venv venv
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

Install Python dependencies:
```
pip install -r requirements.txt
```
Database Setup:
- Install PostgreSQL and create a database named email_db
- Install pgvector extension:
```
CREATE EXTENSION IF NOT EXISTS vector;
```
- Update database credentials in email_project/settings.py or use environment variables

Configure environment variables: Create a .env file in the root directory with:

# Gmail API
GMAIL_CLIENT_ID=your_gmail_client_id
GMAIL_CLIENT_SECRET=your_gmail_client_secret

# Google Gemini API
GEMINI_API_KEY=your_gemini_api_key

# Database (if using environment variables)
DB_NAME=email_db
DB_USER=postgres
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=5432

Run database migrations:
```
python manage.py migrate
```
Create a superuser (optional):
```
python manage.py createsuperuser
```
Launch the development server:
```
python manage.py runserver
```
Access the application:
- Open your browser and navigate to http://localhost:8000
- Use "Login with Google" to authenticate and connect your Gmail account

How to Use

Login: Use "Login with Google" (OAuth2) to connect your inbox.
Inbox: View emails with AI-prioritized and categorized labels.
Voice Commands: Click the mic icon and say commands like "Read the first email", "Reply", or "Compose".
Analytics: Explore the dashboard for trends and insights.
Smart Replies: Use AI-generated suggestions for quick responses.

Database Schema

The main email table includes:

id, user_email, subject, sender, recipients, date, snippet, has_attachments, attachments, star, label, folder, last_modified, priority, priority_score, priority_explanation, priority_last_updated, category, category_confidence, category_last_updated

Roadmap

On-demand summarization for long threads
Sentiment analysis for emotional context
One-click deployment scripts
Enhanced analytics (e.g., cohort analysis)
Real-time notifications for high-priority emails

Future Work

Based on the findings and limitations of this project, several recommendations are proposed for future improvements:

Improve OAuth Token Handling for Continuous Data Fetching:
- Enhance OAuth token management to reduce the need for frequent user reauthentication. Implement more efficient token refresh mechanisms or extend session handling for uninterrupted, real-time email fetching.
Enhance AI-Generated Suggested Reply Quality:
- Improve the contextual relevance and accuracy of AI-generated replies by fine-tuning Gemini API models with domain-specific data or integrating multi-turn dialogue capabilities for more sophisticated responses.
Strengthen Model Robustness for Ambiguous Emails:
- Expand the training dataset with more complex, real-world emails and collect user feedback to improve model learning. Explore advanced feature engineering (e.g., context-aware embeddings, hierarchical classification) to better handle challenging or borderline emails.
Expand Voice Command Functionalities:
- Enable users to edit email content verbally and extend voice commands to other pages (e.g., Analytics Dashboard) for tasks like reading aloud key insights. Incorporate advanced speech-to-text features for improved accessibility and user experience.
Optimize Data Preprocessing Using More Advanced LLMs:
- Use more powerful LLMs for dataset labeling, assuming improved computational resources, to produce higher-quality labeled data and improve model performance.
Upgrade Deployment Infrastructure for Larger Models:
- Upgrade server hardware, expand storage, and ensure environment compatibility to support the deployment of more advanced machine learning models and realize their full potential.

System Screenshots

Login Page
Dashboard Page
Inbox Page
Inbox Detail Page
Sent Page
Sent Detail Page
Compose Window
Spam Page
Spam Detail Page
Setting Page

Contributing

Contributions are welcome! Please open an issue or submit a pull request. All features are tested on a mix of real and synthetic data for reliability and privacy.

Project Author

Lee Wen Kang
Connect on LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart Email Management System

Contents

Motivation

Objectives

Project Structure

System Overview

System Architecture

Technical Stack

Key Features

Data & Modeling Pipeline

Data Collection & Preparation

Automated Data Labeling

Machine Learning Models

Model Implementation

Email Category Labeling & Balancing

Model Interpretability & Hybrid Adjustment

API Integrations

Setup Instructions

Prerequisites

Installation Steps

How to Use

Database Schema

Roadmap

Future Work

System Screenshots

Contributing

Project Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
Data Labeling Solutions		Data Labeling Solutions
Data Preprocessing and EDA		Data Preprocessing and EDA
Experiments		Experiments
Model Building		Model Building
assets		assets
docker		docker
email_app		email_app
email_project		email_project
.env		.env
.gitignore		.gitignore
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Smart Email Management System

Contents

Motivation

Objectives

Project Structure

System Overview

System Architecture

Technical Stack

Key Features

Data & Modeling Pipeline

Data Collection & Preparation

Automated Data Labeling

Machine Learning Models

Model Implementation

Email Category Labeling & Balancing

Model Interpretability & Hybrid Adjustment

API Integrations

Setup Instructions

Prerequisites

Installation Steps

How to Use

Database Schema

Roadmap

Future Work

System Screenshots

Contributing

Project Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages