WeWrite Search System Documentation

Core Principles

1. Precision Over Recall

Better to show 5 relevant results than 50 irrelevant ones
Users prefer accurate, targeted results over comprehensive but noisy results
Quality trumps quantity in search experience

2. Zero Tolerance for Irrelevant Results

If a result doesn't match the query, it must score 0
No fallback scoring or padding with random content
Strict relevance filtering prevents "hallucinations"

3. Transparent Scoring

Scoring must be explainable and consistent
Higher scores for better matches (exact > partial > substring)
Clear hierarchy: title matches > content matches

Search Quality Standards

Relevance Requirements

✅ Must Have:

All results must contain the search term (in title or content)
Results must be ranked by relevance score
No random or unrelated content

❌ Must Not Have:

Results that don't contain any part of the search query
Fallback content when few results exist
Inconsistent scoring between similar matches

Data Quality Requirements

✅ Must Have:

Valid usernames for all page results
Complete page titles (no "Untitled" unless actually untitled)
Proper page metadata (ownership, status)

❌ Must Not Have:

"Missing username" placeholders
Broken or invalid page references
Inconsistent data between search and page views

Implementation Checklist

For Search Algorithm Changes

Test with queries that should return few/no results
Verify no fallback scoring for unmatched content
Ensure consistent scoring across title and content
Test edge cases (very short queries, special characters)
Document any scoring changes

For Search API Updates

Populate all required fields (username, title, metadata)
Implement proper error handling for missing data
Use consistent field names across all endpoints
Add logging for debugging data quality issues
Test with real production data

For Frontend Components

Handle missing data gracefully (no "Missing username")
Implement proper loading states
Show clear "no results" messages when appropriate
Don't pad empty results with placeholder content
Provide helpful search suggestions for no-result queries

Testing Guidelines

Manual Testing Scenarios

Specific Term Search: Search for "protests" - all results should relate to protests
Rare Term Search: Search for very specific terms - should return few but relevant results
No Results Search: Search for nonsense terms - should return empty results, not random content
Username Verification: Check that all results show proper usernames, not "Missing username"
Mixed Content: Verify both title and content matches appear appropriately ranked

Automated Testing

// Example test for relevance
test('search results must be relevant', async () => {
  const results = await searchAPI('protests');
  
  results.forEach(result => {
    const hasTermInTitle = result.title.toLowerCase().includes('protests');
    const hasTermInContent = result.content?.toLowerCase().includes('protests');
    
    expect(hasTermInTitle || hasTermInContent).toBe(true);
    expect(result.matchScore).toBeGreaterThan(0);
  });
});

// Example test for data quality
test('search results must have valid usernames', async () => {
  const results = await searchAPI('test');
  
  results.forEach(result => {
    expect(result.username).toBeDefined();
    expect(result.username).not.toBe('Missing username');
    expect(result.username).not.toBe('Anonymous');
  });
});

Common Issues and Solutions

Issue: Irrelevant Results

Symptoms: Results that don't contain the search term Solution: Check scoring function for fallback logic, ensure score = 0 for no match

Issue: Missing Usernames

Symptoms: "Missing username" appearing in search results Solution: Populate usernames in search API, don't rely on frontend fallback

Issue: Poor Ranking

Symptoms: Less relevant results appearing before more relevant ones Solution: Review scoring weights, ensure title matches score higher than content

Issue: Too Few Results

Symptoms: Users complain about not finding content they know exists Solution: Check for artificial limits, ensure comprehensive search coverage

Issue: Too Many Irrelevant Results

Symptoms: Users complain about noise in search results Solution: Tighten relevance criteria, increase minimum score threshold

Monitoring and Metrics

Key Metrics to Track

Result Relevance Rate: % of results that actually match the query
Username Completion Rate: % of results with valid usernames
Search Success Rate: % of searches that return at least one result
User Satisfaction: Click-through rates on search results

Quality Alerts

Set up monitoring for:

High percentage of "Missing username" in results
Searches returning 0 results for common terms
Unusual spikes in irrelevant result reports
Performance degradation in search response times

Best Practices Summary

Do ✅

Return 0 score for irrelevant results
Populate all required data fields in search API
Test with edge cases and real user queries
Document scoring changes and rationale
Monitor search quality metrics continuously

Don't ❌

Use fallback scoring for unmatched content
Pad results with random content when few matches exist
Rely on frontend to fix missing backend data
Make scoring changes without comprehensive testing
Ignore data quality issues in search results

Search Algorithm Architecture

Two-Phase Search Approach

WeWrite uses a sophisticated two-phase search algorithm to overcome Firestore's limitations while maintaining excellent performance:

Phase 1: Fast Firestore Queries (PREFIX Matching)

Uses Firestore range queries (where('title', '>=', searchTerm))
LIMITATION: Only matches titles that START with the search term
Example: Searching "masses" will NOT find "Who are the American masses?"
Purpose: Fast results for prefix matches with minimal database reads

Phase 2: Comprehensive Client-Side Filtering (SUBSTRING Matching)

Fetches up to 2000 pages and filters using JavaScript .includes()
SOLUTION: Finds "masses" anywhere in "Who are the American masses?" ✅
Example: Searching "masses" successfully finds all pages containing the word
Purpose: Catch all substring matches that Firestore queries miss

Why Two Phases?

Firestore has critical search limitations:

No native full-text search
Range queries only support PREFIX matching (not CONTAINS/LIKE)
Cannot search for words in the middle of strings server-side

Solution: Combine fast Firestore prefix queries with comprehensive client-side substring matching to provide complete, intuitive search results.

Search Scoring System Details

Scoring Scale (0-100) - UPDATED December 2025

Title Matches (Higher Priority)

100: Exact match (title exactly equals search term)
95: Starts with search term
80: Contains search term as substring ⭐ IMPROVED - Critical for finding "masses" in "Who are the American masses?"
75: All search words found as complete words
70: Contains all search words (non-sequential)
65: Sequential word matches in order
50: Partial word matches

Content Matches (Lower Priority)

80: Exact match in content
75: Starts with search term in content
60: Contains search term as substring in content
55: All search words found as complete words in content
50: Contains all search words in content (non-sequential)
45: Sequential word matches in content
35: Partial word matches in content

No Match

0: No relevance to search query (CRITICAL: prevents irrelevant results)

Key Algorithm Improvements (December 2025)

Problem Fixed: Users reported that searching "masses" didn't find "Who are the American masses?"

Root Cause: Firestore prefix queries only match from the start of the string.

Solution Implemented:

Increased client-side search from 500 to 2000 pages
Prioritized substring matching (moved from 75 to 80 points)
Simplified scoring hierarchy for more intuitive results
Added comprehensive documentation explaining the two-phase approach

Implementation Details

Primary Scoring Function

Located in: app/api/search-unified/route.ts (lines 184-256)

/**
 * SIMPLIFIED and IMPROVED search scoring
 *
 * Prioritizes intuitive matching:
 * 1. Exact matches (100 points)
 * 2. Starts with search term (95 points)
 * 3. Contains search term as substring (80 points) ⭐ KEY FIX
 * 4. All words found (70 points)
 * 5. Partial word matches (50 points)
 */
function calculateSearchScore(text, searchTerm, isTitle = false, isContentMatch = false) {
  if (!text || !searchTerm) return 0;

  const normalizedText = text.toLowerCase();
  const normalizedSearch = searchTerm.toLowerCase();

  // Exact match (highest score)
  if (normalizedText === normalizedSearch) {
    return isTitle ? 100 : 80;
  }

  // Starts with search term (very high score)
  if (normalizedText.startsWith(normalizedSearch)) {
    return isTitle ? 95 : 75;
  }

  // IMPROVED: Contains search term as substring (high score)
  // This is CRITICAL for finding "masses" in "Who are the American masses?"
  if (normalizedText.includes(normalizedSearch)) {
    return isTitle ? 80 : 60;
  }

  // Additional word matching logic...
}

Comprehensive Fallback Search

Located in: app/api/search-unified/route.ts (lines 420-493)

// CRITICAL FIX: Firestore range queries only support PREFIX matching
// Solution: Always perform comprehensive client-side search for substring matches
const broadQuery = query(
  collection(db, getCollectionName('pages')),
  limit(2000) // Increased from 500 to catch more matches
);

// Client-side filtering using .includes() for true substring matching
const titleLower = pageTitle.toLowerCase();
if (titleLower.includes(searchTermLower)) {
  hasMatch = true; // ✅ Finds "masses" in "Who are the American masses?"
}

Uh oh!

FilesExpand file tree

SEARCH_SYSTEM.md

Latest commit

History