|
| 1 | +You are writing production-grade code as a senior engineer. |
| 2 | + |
| 3 | +The code must be secure, readable, and boring in the good way. |
| 4 | + |
| 5 | +General rules: |
| 6 | + |
| 7 | +1. No noob patterns |
| 8 | + - No global mutable state |
| 9 | + - No God objects or God files |
| 10 | + - No magic numbers or strings |
| 11 | + - No copy-paste logic |
| 12 | + - No commented-out code |
| 13 | + |
| 14 | +2. Naming discipline |
| 15 | + - Variable names must encode intent, not type |
| 16 | + - Function names must describe behavior, not implementation |
| 17 | + - Avoid abbreviations unless universally obvious |
| 18 | + - Prefer explicit over clever |
| 19 | + |
| 20 | +3. Structure and boundaries |
| 21 | + - Keep files small and focused |
| 22 | + - One responsibility per module |
| 23 | + - Dependencies flow inward, never outward |
| 24 | + - Core logic must be framework-agnostic |
| 25 | + |
| 26 | +4. Error handling |
| 27 | + - Never swallow errors |
| 28 | + - Errors must be actionable and contextual |
| 29 | + - Distinguish between user errors and system errors |
| 30 | + - Fail fast for programmer errors |
| 31 | + - Validate inputs at boundaries only |
| 32 | + |
| 33 | +5. Security by default |
| 34 | + - Treat all input as untrusted |
| 35 | + - Validate and sanitize inputs explicitly |
| 36 | + - Use allowlists, not blocklists |
| 37 | + - Avoid reflection and dynamic execution |
| 38 | + - Never trust client-side validation |
| 39 | + - Do not log secrets or tokens |
| 40 | + - Secrets must come from environment or config files |
| 41 | + - Use constant-time comparisons where applicable |
| 42 | + |
| 43 | +6. Authentication and authorization |
| 44 | + - Explicit auth checks, never implicit |
| 45 | + - Authorization must be centralized |
| 46 | + - No role checks scattered across code |
| 47 | + - Deny by default |
| 48 | + |
| 49 | +7. Data handling |
| 50 | + - Parameterized queries only |
| 51 | + - No raw SQL string concatenation |
| 52 | + - Enforce data constraints at both app and DB level |
| 53 | + - Use transactions for multi-step writes |
| 54 | + - Never assume DB writes succeed |
| 55 | + |
| 56 | +8. Concurrency and state |
| 57 | + - Assume code may run concurrently |
| 58 | + - Avoid shared mutable state |
| 59 | + - Use locks only when unavoidable |
| 60 | + - Prefer immutability and message passing |
| 61 | + - Document concurrency assumptions |
| 62 | + |
| 63 | +9. Logging and observability |
| 64 | + - Logs must be structured |
| 65 | + - Log intent, not noise |
| 66 | + - No debug logs in hot paths |
| 67 | + - Do not log PII by default |
| 68 | + |
| 69 | +10. Configuration |
| 70 | + - Config must be explicit and typed |
| 71 | + - Fail on missing config |
| 72 | + - No hidden defaults for security-sensitive values |
| 73 | + |
| 74 | +11. Testing discipline |
| 75 | + - Core logic must be testable without frameworks |
| 76 | + - Write tests for behavior, not implementation |
| 77 | + - Test failure paths |
| 78 | + - Avoid mocks unless necessary |
| 79 | + |
| 80 | +12. Comments and documentation |
| 81 | + - No AI-like comments |
| 82 | + - No restating the obvious |
| 83 | + - Comment on "why", not "what" |
| 84 | + - Use README and ARCHITECTURE docs for big-picture decisions |
| 85 | + |
| 86 | +13. Dependency hygiene |
| 87 | + - Minimize dependencies |
| 88 | + - Avoid abandoned libraries |
| 89 | + - Pin versions |
| 90 | + - Prefer standard library when possible |
| 91 | + |
| 92 | +14. Performance mindset |
| 93 | + - Do not prematurely optimize |
| 94 | + - Avoid obvious inefficiencies |
| 95 | + - Measure before optimizing |
| 96 | + - Prefer simple algorithms unless proven insufficient |
| 97 | + |
| 98 | +15. Code review mindset |
| 99 | + - Assume this code will be reviewed by someone smarter |
| 100 | + - Make intent obvious |
| 101 | + - Optimize for future maintainers |
| 102 | + |
| 103 | +Now, listing features. I've categorized them into must-have (core to functionality, derived directly from README) and good-to-have (enhancements for superiority, inspired by gaps in competitors like fuzzywuzzy, rapidfuzz, difflib, or Levenshtein crates). These make it better by emphasizing performance, flexibility, and usability without unnecessary complexity. |
| 104 | + |
| 105 | +Must-Have Features (Core Essentials) |
| 106 | + |
| 107 | +These are non-negotiable for a viable fuzzy search library, ensuring it meets the README's promises while being faster/more efficient than pure Python alternatives. |
| 108 | + |
| 109 | +Fuzzy String Matching: Compute similarity scores between two strings using predefined algorithms. |
| 110 | +Ranking Support: Given a query and a list of candidates, return sorted results by similarity score. |
| 111 | +Multiple Scorers: Built-in support for Levenshtein (edit distance), token sort (sorted tokens comparison), and Jaccard (set similarity)—as explicitly mentioned. |
| 112 | +Partial and Full Matches: Options for substring matching (partial) vs. whole-string (full), with configurable modes. |
| 113 | +Adjustable Thresholds: User-defined score cutoffs to filter results (e.g., return only matches > 0.8 similarity). |
| 114 | +Pip-Installable Python Interface: Easy installation via pip install fuzzybunny, with Pybind11 handling C++ bindings for seamless Python usage. |
| 115 | +Lightweight Design: Minimal dependencies; no external libs beyond Pybind11 and standard C++/Python stdlibs. |
| 116 | +Unicode Support: Handle international characters properly, as fuzzy matching often deals with real-world text. |
| 117 | + |
| 118 | +Good-to-Have Features (Differentiators for Superiority) |
| 119 | +These elevate it above competitors: rapidfuzz is fast but lacks hybrid/custom scorers; fuzzywuzzy is simple but slow; thefuzz adds processors but not ranking depth. Focus on maintainability—add only if they don't require extra deps or complex builds. |
| 120 | + |
| 121 | +Hybrid/Custom Scorers: Allow combining scorers (e.g., weighted average of Levenshtein + Jaccard) or user-defined functions—better than rigid options in fuzzywuzzy. |
| 122 | +Batch Processing: Efficient matching for large lists (e.g., vectorized operations in C++), outperforming sequential Python loops in competitors. |
| 123 | +Normalization Processors: Built-in pre-processors like lowercase, remove punctuation, or tokenization—similar to fuzzywuzzy's processors but optimized in C++ for speed. |
| 124 | +Thread Safety and Parallelism: Safe for multi-threaded use; optional parallel matching via OpenMP in C++—addresses scalability gaps in single-threaded libs. |
| 125 | +Integration with Data Structures: Seamless with Python lists, NumPy arrays, or Pandas Series/DataFrames—makes it more practical than standalone funcs in rapidfuzz. |
| 126 | +Performance Metrics: Built-in benchmarking utils to compare scorers or against baselines—helps users justify using it over others. |
| 127 | +Error Handling and Validation: Robust input checks (e.g., handle empty strings gracefully) with clear exceptions—improves usability over minimalistic libs. |
| 128 | +Extensible Bindings: Easy to add new C++ scorers without recompiling Python side—future-proofs it better than monolithic designs. |
| 129 | +Documentation and Examples: In-code examples, not just "coming soon"—includes Jupyter notebooks for quickstarts, surpassing sparse docs in some competitors. |
| 130 | +Zero-Network Privacy: No telemetry or outbound calls—aligns with privacy-first, unlike some libs with optional analytics. |
| 131 | + |
| 132 | +Not chosen: Advanced features like NLP integration (e.g., with spaCy) or database indexing (e.g., SQLite fuzzy search)—these add deps/complexity, violating lightweight ethos. Instead, keep it a pure matching lib; users can layer on top. |
| 133 | +Comprehensive Requirements and Features Document |
| 134 | +1. Introduction |
| 135 | +Project Name: FuzzyBunny |
| 136 | +Version: 0.1.0 (Development) |
| 137 | +Description: A high-performance, lightweight Python library for fuzzy string matching and ranking, implemented in C++ with Pybind11 bindings. It provides flexible similarity computations with multiple scorers, targeting use cases like search autocompletion, data deduplication, and recommendation systems. |
| 138 | +Goals: |
| 139 | + |
| 140 | +Deliver 2-5x faster matching than pure Python libs (e.g., fuzzywuzzy) via C++. |
| 141 | +Maintain simplicity: Single pip install, no runtime deps. |
| 142 | +Prioritize maintainability: Modular C++ core, framework-agnostic API. |
| 143 | + |
| 144 | +Scope Boundaries: |
| 145 | + |
| 146 | +In: String-based fuzzy matching/ranking. |
| 147 | +Out: Non-string data (e.g., image similarity), full-text search engines (use Elasticsearch instead), or web services (library only). |
| 148 | + |
| 149 | +Target Users: Developers building search features in Python apps, data scientists for cleaning datasets, or CLI tools for text processing. |
| 150 | +Assumptions Challenged: README claims "python" but core is C++; assume C++ for perf. If pure Python suffices, suggest forking fuzzywuzzy instead—this adds build hurdles but justifies with benchmarks. |
| 151 | + |
| 152 | +2. Functional Requirements |
| 153 | +Core API: |
| 154 | + |
| 155 | +match(query: str, target: str, scorer: str = 'levenshtein', threshold: float = 0.0) -> float: Compute similarity score (0-1 normalized). |
| 156 | +Trade-off: Normalized scores for consistency vs. raw distances (not chosen, as users prefer percentages). |
| 157 | + |
| 158 | +rank(query: str, candidates: List[str], scorer: str = 'levenshtein', threshold: float = 0.0, top_n: int = None) -> List[Tuple[str, float]]: Return sorted matches. |
| 159 | +Optimization: C++-side sorting for efficiency; no Python overhead. |
| 160 | + |
| 161 | + |
| 162 | +Scorers (Must-Have): |
| 163 | + |
| 164 | +Levenshtein: Edit distance-based. |
| 165 | +Token Sort: Sort tokens then compare. |
| 166 | +Jaccard: Intersection over union for token sets. |
| 167 | +Good-to-Have: Hybrid (e.g., hybrid(weights: Dict[str, float])), Custom (user callback, but warn on perf hit). |
| 168 | + |
| 169 | +Match Modes (Must-Have): |
| 170 | + |
| 171 | +Full: Whole string comparison. |
| 172 | +Partial: Substring search with best-match. |
| 173 | +Adjustable via param: mode: str = 'full'. |
| 174 | + |
| 175 | +Processors (Good-to-Have): |
| 176 | + |
| 177 | +Pre-match normalization: process: bool = True (lowercase, strip punctuation). |
| 178 | +Trade-off: Enabled by default for usability, but optional to avoid altering inputs (privacy concern). |
| 179 | + |
| 180 | +Batch Support (Good-to-Have): |
| 181 | + |
| 182 | +batch_match(queries: List[str], targets: List[List[str]], ...) -> List[List[Tuple[str, float]]]. |
| 183 | +Parallel via threads if parallel: bool = True. |
| 184 | + |
| 185 | +Integrations (Good-to-Have): |
| 186 | + |
| 187 | +Pandas: Extension methods like Series.fuzzy_match(query). |
| 188 | +NumPy: Vectorized matching for arrays. |
| 189 | + |
| 190 | +3. Non-Functional Requirements |
| 191 | +Performance: |
| 192 | + |
| 193 | +<1ms per match for 100-char strings on standard hardware. |
| 194 | +Scale to 10k candidates without slowdown (benchmark vs. rapidfuzz). |
| 195 | +Trade-off: C++ for speed, but not GPU (unnecessary complexity). |
| 196 | + |
| 197 | +Usability: |
| 198 | + |
| 199 | +Intuitive API: Mirror fuzzywuzzy but faster. |
| 200 | +Docs: Sphinx-generated, with examples. |
| 201 | +Installation: pip install builds C++ automatically via setup.py. |
| 202 | + |
| 203 | +Security/Privacy: |
| 204 | + |
| 205 | +No network calls. |
| 206 | +Secure by default: No file I/O unless explicit. |
| 207 | +Handle secrets in strings without logging. |
| 208 | + |
| 209 | +Maintainability: |
| 210 | + |
| 211 | +Core logic in C++ (framework-agnostic). |
| 212 | +Tests: 90% coverage with pytest (Python) and Catch2 (C++). |
| 213 | +CI/CD: GitHub Actions for builds/tests. |
| 214 | +One-command setup: pip install -e . for dev. |
| 215 | + |
| 216 | +Compatibility: |
| 217 | + |
| 218 | +Python 3.8+. |
| 219 | +OS: Linux/Mac/Windows (cross-compile). |
| 220 | +No deps beyond Pybind11 (bundled). |
| 221 | + |
| 222 | +4. Architecture Overview |
| 223 | + |
| 224 | +Layers: |
| 225 | +C++ Core: Scorers and matching algos (single module for minimalism, no microservices). |
| 226 | +Pybind11 Bindings: Expose as Python funcs/classes. |
| 227 | +Python API: Thin wrapper for usability. |
| 228 | + |
| 229 | +Data Flow: Input strings -> Normalize (opt) -> Compute in C++ -> Return scores. |
| 230 | +Trade-offs: Embedded DB? No, use lists for candidates (simpler than SQLite). Redis? Avoid unless user adds for caching. |
| 231 | +Alternatives Considered: |
| 232 | +Pure Python: Simpler build, but slower—ranked lower for perf goals. |
| 233 | +Rust + PyO3: Safer, but Pybind11 chosen for C++ expertise assumption. |
| 234 | +Recommended: Proceed with C++/Pybind11, but prototype pure Python first to validate. |
| 235 | + |
| 236 | + |
| 237 | +5. Risks and Mitigations |
| 238 | + |
| 239 | +Build Failures: Mitigate with pre-built wheels on PyPI. |
| 240 | +Perf Bottlenecks: Benchmark early; fallback to rapidfuzz if not superior. |
| 241 | +Scope Creep: Limit to README features; good-to-haves as v2. |
| 242 | +Bad Idea Alert: If custom scorers add >20% complexity without use cases, drop them—focus on speed. |
| 243 | + |
| 244 | +6. Roadmap |
| 245 | + |
| 246 | +MVP: Must-haves by v0.1. |
| 247 | +v0.2: Good-to-haves. |
| 248 | +Future: CLI wrapper reusing core (e.g., fuzzybunny match query target). |
0 commit comments