Skip to content

Commit 8d0f75f

Browse files
committed
Add hybrid scorers, Pandas/NumPy integration, and benchmarking tools
1 parent 6860c92 commit 8d0f75f

13 files changed

Lines changed: 571 additions & 75 deletions

GEMINI.md

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
You are writing production-grade code as a senior engineer.
2+
3+
The code must be secure, readable, and boring in the good way.
4+
5+
General rules:
6+
7+
1. No noob patterns
8+
- No global mutable state
9+
- No God objects or God files
10+
- No magic numbers or strings
11+
- No copy-paste logic
12+
- No commented-out code
13+
14+
2. Naming discipline
15+
- Variable names must encode intent, not type
16+
- Function names must describe behavior, not implementation
17+
- Avoid abbreviations unless universally obvious
18+
- Prefer explicit over clever
19+
20+
3. Structure and boundaries
21+
- Keep files small and focused
22+
- One responsibility per module
23+
- Dependencies flow inward, never outward
24+
- Core logic must be framework-agnostic
25+
26+
4. Error handling
27+
- Never swallow errors
28+
- Errors must be actionable and contextual
29+
- Distinguish between user errors and system errors
30+
- Fail fast for programmer errors
31+
- Validate inputs at boundaries only
32+
33+
5. Security by default
34+
- Treat all input as untrusted
35+
- Validate and sanitize inputs explicitly
36+
- Use allowlists, not blocklists
37+
- Avoid reflection and dynamic execution
38+
- Never trust client-side validation
39+
- Do not log secrets or tokens
40+
- Secrets must come from environment or config files
41+
- Use constant-time comparisons where applicable
42+
43+
6. Authentication and authorization
44+
- Explicit auth checks, never implicit
45+
- Authorization must be centralized
46+
- No role checks scattered across code
47+
- Deny by default
48+
49+
7. Data handling
50+
- Parameterized queries only
51+
- No raw SQL string concatenation
52+
- Enforce data constraints at both app and DB level
53+
- Use transactions for multi-step writes
54+
- Never assume DB writes succeed
55+
56+
8. Concurrency and state
57+
- Assume code may run concurrently
58+
- Avoid shared mutable state
59+
- Use locks only when unavoidable
60+
- Prefer immutability and message passing
61+
- Document concurrency assumptions
62+
63+
9. Logging and observability
64+
- Logs must be structured
65+
- Log intent, not noise
66+
- No debug logs in hot paths
67+
- Do not log PII by default
68+
69+
10. Configuration
70+
- Config must be explicit and typed
71+
- Fail on missing config
72+
- No hidden defaults for security-sensitive values
73+
74+
11. Testing discipline
75+
- Core logic must be testable without frameworks
76+
- Write tests for behavior, not implementation
77+
- Test failure paths
78+
- Avoid mocks unless necessary
79+
80+
12. Comments and documentation
81+
- No AI-like comments
82+
- No restating the obvious
83+
- Comment on "why", not "what"
84+
- Use README and ARCHITECTURE docs for big-picture decisions
85+
86+
13. Dependency hygiene
87+
- Minimize dependencies
88+
- Avoid abandoned libraries
89+
- Pin versions
90+
- Prefer standard library when possible
91+
92+
14. Performance mindset
93+
- Do not prematurely optimize
94+
- Avoid obvious inefficiencies
95+
- Measure before optimizing
96+
- Prefer simple algorithms unless proven insufficient
97+
98+
15. Code review mindset
99+
- Assume this code will be reviewed by someone smarter
100+
- Make intent obvious
101+
- Optimize for future maintainers
102+
103+
Now, listing features. I've categorized them into must-have (core to functionality, derived directly from README) and good-to-have (enhancements for superiority, inspired by gaps in competitors like fuzzywuzzy, rapidfuzz, difflib, or Levenshtein crates). These make it better by emphasizing performance, flexibility, and usability without unnecessary complexity.
104+
105+
Must-Have Features (Core Essentials)
106+
107+
These are non-negotiable for a viable fuzzy search library, ensuring it meets the README's promises while being faster/more efficient than pure Python alternatives.
108+
109+
Fuzzy String Matching: Compute similarity scores between two strings using predefined algorithms.
110+
Ranking Support: Given a query and a list of candidates, return sorted results by similarity score.
111+
Multiple Scorers: Built-in support for Levenshtein (edit distance), token sort (sorted tokens comparison), and Jaccard (set similarity)—as explicitly mentioned.
112+
Partial and Full Matches: Options for substring matching (partial) vs. whole-string (full), with configurable modes.
113+
Adjustable Thresholds: User-defined score cutoffs to filter results (e.g., return only matches > 0.8 similarity).
114+
Pip-Installable Python Interface: Easy installation via pip install fuzzybunny, with Pybind11 handling C++ bindings for seamless Python usage.
115+
Lightweight Design: Minimal dependencies; no external libs beyond Pybind11 and standard C++/Python stdlibs.
116+
Unicode Support: Handle international characters properly, as fuzzy matching often deals with real-world text.
117+
118+
Good-to-Have Features (Differentiators for Superiority)
119+
These elevate it above competitors: rapidfuzz is fast but lacks hybrid/custom scorers; fuzzywuzzy is simple but slow; thefuzz adds processors but not ranking depth. Focus on maintainability—add only if they don't require extra deps or complex builds.
120+
121+
Hybrid/Custom Scorers: Allow combining scorers (e.g., weighted average of Levenshtein + Jaccard) or user-defined functions—better than rigid options in fuzzywuzzy.
122+
Batch Processing: Efficient matching for large lists (e.g., vectorized operations in C++), outperforming sequential Python loops in competitors.
123+
Normalization Processors: Built-in pre-processors like lowercase, remove punctuation, or tokenization—similar to fuzzywuzzy's processors but optimized in C++ for speed.
124+
Thread Safety and Parallelism: Safe for multi-threaded use; optional parallel matching via OpenMP in C++—addresses scalability gaps in single-threaded libs.
125+
Integration with Data Structures: Seamless with Python lists, NumPy arrays, or Pandas Series/DataFrames—makes it more practical than standalone funcs in rapidfuzz.
126+
Performance Metrics: Built-in benchmarking utils to compare scorers or against baselines—helps users justify using it over others.
127+
Error Handling and Validation: Robust input checks (e.g., handle empty strings gracefully) with clear exceptions—improves usability over minimalistic libs.
128+
Extensible Bindings: Easy to add new C++ scorers without recompiling Python side—future-proofs it better than monolithic designs.
129+
Documentation and Examples: In-code examples, not just "coming soon"—includes Jupyter notebooks for quickstarts, surpassing sparse docs in some competitors.
130+
Zero-Network Privacy: No telemetry or outbound calls—aligns with privacy-first, unlike some libs with optional analytics.
131+
132+
Not chosen: Advanced features like NLP integration (e.g., with spaCy) or database indexing (e.g., SQLite fuzzy search)—these add deps/complexity, violating lightweight ethos. Instead, keep it a pure matching lib; users can layer on top.
133+
Comprehensive Requirements and Features Document
134+
1. Introduction
135+
Project Name: FuzzyBunny
136+
Version: 0.1.0 (Development)
137+
Description: A high-performance, lightweight Python library for fuzzy string matching and ranking, implemented in C++ with Pybind11 bindings. It provides flexible similarity computations with multiple scorers, targeting use cases like search autocompletion, data deduplication, and recommendation systems.
138+
Goals:
139+
140+
Deliver 2-5x faster matching than pure Python libs (e.g., fuzzywuzzy) via C++.
141+
Maintain simplicity: Single pip install, no runtime deps.
142+
Prioritize maintainability: Modular C++ core, framework-agnostic API.
143+
144+
Scope Boundaries:
145+
146+
In: String-based fuzzy matching/ranking.
147+
Out: Non-string data (e.g., image similarity), full-text search engines (use Elasticsearch instead), or web services (library only).
148+
149+
Target Users: Developers building search features in Python apps, data scientists for cleaning datasets, or CLI tools for text processing.
150+
Assumptions Challenged: README claims "python" but core is C++; assume C++ for perf. If pure Python suffices, suggest forking fuzzywuzzy instead—this adds build hurdles but justifies with benchmarks.
151+
152+
2. Functional Requirements
153+
Core API:
154+
155+
match(query: str, target: str, scorer: str = 'levenshtein', threshold: float = 0.0) -> float: Compute similarity score (0-1 normalized).
156+
Trade-off: Normalized scores for consistency vs. raw distances (not chosen, as users prefer percentages).
157+
158+
rank(query: str, candidates: List[str], scorer: str = 'levenshtein', threshold: float = 0.0, top_n: int = None) -> List[Tuple[str, float]]: Return sorted matches.
159+
Optimization: C++-side sorting for efficiency; no Python overhead.
160+
161+
162+
Scorers (Must-Have):
163+
164+
Levenshtein: Edit distance-based.
165+
Token Sort: Sort tokens then compare.
166+
Jaccard: Intersection over union for token sets.
167+
Good-to-Have: Hybrid (e.g., hybrid(weights: Dict[str, float])), Custom (user callback, but warn on perf hit).
168+
169+
Match Modes (Must-Have):
170+
171+
Full: Whole string comparison.
172+
Partial: Substring search with best-match.
173+
Adjustable via param: mode: str = 'full'.
174+
175+
Processors (Good-to-Have):
176+
177+
Pre-match normalization: process: bool = True (lowercase, strip punctuation).
178+
Trade-off: Enabled by default for usability, but optional to avoid altering inputs (privacy concern).
179+
180+
Batch Support (Good-to-Have):
181+
182+
batch_match(queries: List[str], targets: List[List[str]], ...) -> List[List[Tuple[str, float]]].
183+
Parallel via threads if parallel: bool = True.
184+
185+
Integrations (Good-to-Have):
186+
187+
Pandas: Extension methods like Series.fuzzy_match(query).
188+
NumPy: Vectorized matching for arrays.
189+
190+
3. Non-Functional Requirements
191+
Performance:
192+
193+
<1ms per match for 100-char strings on standard hardware.
194+
Scale to 10k candidates without slowdown (benchmark vs. rapidfuzz).
195+
Trade-off: C++ for speed, but not GPU (unnecessary complexity).
196+
197+
Usability:
198+
199+
Intuitive API: Mirror fuzzywuzzy but faster.
200+
Docs: Sphinx-generated, with examples.
201+
Installation: pip install builds C++ automatically via setup.py.
202+
203+
Security/Privacy:
204+
205+
No network calls.
206+
Secure by default: No file I/O unless explicit.
207+
Handle secrets in strings without logging.
208+
209+
Maintainability:
210+
211+
Core logic in C++ (framework-agnostic).
212+
Tests: 90% coverage with pytest (Python) and Catch2 (C++).
213+
CI/CD: GitHub Actions for builds/tests.
214+
One-command setup: pip install -e . for dev.
215+
216+
Compatibility:
217+
218+
Python 3.8+.
219+
OS: Linux/Mac/Windows (cross-compile).
220+
No deps beyond Pybind11 (bundled).
221+
222+
4. Architecture Overview
223+
224+
Layers:
225+
C++ Core: Scorers and matching algos (single module for minimalism, no microservices).
226+
Pybind11 Bindings: Expose as Python funcs/classes.
227+
Python API: Thin wrapper for usability.
228+
229+
Data Flow: Input strings -> Normalize (opt) -> Compute in C++ -> Return scores.
230+
Trade-offs: Embedded DB? No, use lists for candidates (simpler than SQLite). Redis? Avoid unless user adds for caching.
231+
Alternatives Considered:
232+
Pure Python: Simpler build, but slower—ranked lower for perf goals.
233+
Rust + PyO3: Safer, but Pybind11 chosen for C++ expertise assumption.
234+
Recommended: Proceed with C++/Pybind11, but prototype pure Python first to validate.
235+
236+
237+
5. Risks and Mitigations
238+
239+
Build Failures: Mitigate with pre-built wheels on PyPI.
240+
Perf Bottlenecks: Benchmark early; fallback to rapidfuzz if not superior.
241+
Scope Creep: Limit to README features; good-to-haves as v2.
242+
Bad Idea Alert: If custom scorers add >20% complexity without use cases, drop them—focus on speed.
243+
244+
6. Roadmap
245+
246+
MVP: Must-haves by v0.1.
247+
v0.2: Good-to-haves.
248+
Future: CLI wrapper reusing core (e.g., fuzzybunny match query target).

README.md

Lines changed: 44 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
<h1 align="center">FuzzyBunny</h1>
66

77
<p align="center">
8-
<b> A fuzzy search tool written in C++ with Python bindings </b>
8+
<b> A high-performance, lightweight Python library for fuzzy string matching and ranking, implemented in C++ with Pybind11. </b>
99
</p>
1010

1111
<p align="center">
@@ -14,88 +14,70 @@
1414
<img src="https://img.shields.io/badge/Bindings-Pybind11-blue" />
1515
</p>
1616

17-
## Overview
18-
19-
FuzzyBunny is a lightweight, high-performance Python library for fuzzy string matching and ranking. It is implemented in C++ for speed and exposes a Pythonic API via Pybind11. It supports various scoring algorithms including Levenshtein, Jaccard, and Token Sort, along with partial matching capabilities.
20-
2117
## Features
2218

23-
- **Fast C++ Core**: Optimized string matching algorithms.
24-
- **Multiple Scorers**:
25-
- `levenshtein`: Standard edit distance ratio.
26-
- `jaccard`: Set-based similarity.
27-
- `token_sort`: Sorts tokens before comparing (good for "Apple Banana" vs "Banana Apple").
28-
- **Ranking**: Efficiently rank a list of candidates against a query.
29-
- **Partial Matching**: Support for substring matching via `mode='partial'`.
30-
- **Unicode Support**: Correctly handles UTF-8 input.
19+
- **Blazing Fast**: C++ core for 2-5x speed improvement over pure Python alternatives.
20+
- **Multiple Scorers**: Support for Levenshtein, Jaccard, and Token Sort ratios.
21+
- **Partial Matching**: Find the best substring matches.
22+
- **Hybrid Scoring**: Combine multiple scorers with custom weights.
23+
- **Pandas & NumPy Integration**: Native support for Series and Arrays.
24+
- **Batch Processing**: Parallelized matching for large datasets using OpenMP.
25+
- **Unicode Support**: Handles international characters and normalization.
26+
- **Benchmarking Tools**: Built-in utilities to measure performance.
3127

3228
## Installation
3329

34-
### Prerequisites
35-
- Python 3.8+
36-
- C++17 compatible compiler (GCC, Clang, MSVC)
37-
38-
### Using uv (Recommended)
39-
40-
```bash
41-
uv pip install .
42-
```
43-
44-
### Using pip
45-
4630
```bash
47-
pip install .
31+
pip install fuzzybunny
4832
```
4933

50-
## Usage
34+
## Quick Start
5135

5236
```python
5337
import fuzzybunny
5438

55-
# Basic Levenshtein Ratio
39+
# Basic matching
5640
score = fuzzybunny.levenshtein("kitten", "sitting")
57-
print(f"Score: {score}") # ~0.57
41+
print(f"Similarity: {score:.2f}")
42+
43+
# Ranking candidates
44+
candidates = ["apple", "apricot", "banana", "cherry"]
45+
results = fuzzybunny.rank("app", candidates, top_n=2)
46+
# [('apple', 0.6), ('apricot', 0.42)]
47+
```
5848

59-
# Partial Matching
60-
# "apple" is a perfect substring of "apple pie"
61-
score = fuzzybunny.partial_ratio("apple", "apple pie")
62-
print(f"Partial Score: {score}") # 1.0
49+
## Advanced Usage
6350

64-
# Ranking Candidates
65-
candidates = ["apple pie", "banana bread", "cherry tart", "apple crisp"]
51+
### Hybrid Scorer
52+
Combine different algorithms to get better results:
53+
54+
```python
6655
results = fuzzybunny.rank(
67-
query="apple",
68-
candidates=candidates,
69-
scorer="levenshtein",
70-
mode="partial",
71-
top_n=2
56+
"apple banana",
57+
["banana apple"],
58+
scorer="hybrid",
59+
weights={"levenshtein": 0.3, "token_sort": 0.7}
7260
)
73-
74-
for candidate, score in results:
75-
print(f"{candidate}: {score}")
76-
# Output:
77-
# apple pie: 1.0
78-
# apple crisp: 1.0
7961
```
8062

81-
## Development
63+
### Pandas Integration
64+
Use the specialized accessor for clean code:
65+
66+
```python
67+
import pandas as pd
68+
import fuzzybunny
8269

83-
1. **Setup Environment**:
84-
```bash
85-
uv venv
86-
source .venv/bin/activate
87-
```
70+
df = pd.DataFrame({"names": ["apple pie", "banana bread", "cherry tart"]})
71+
results = df["names"].fuzzy.match("apple", mode="partial")
72+
```
8873

89-
2. **Install in Editable Mode**:
90-
```bash
91-
uv pip install -e .
92-
```
74+
### Benchmarking
75+
Compare performance on your specific data:
9376

94-
3. **Run Tests**:
95-
```bash
96-
pytest
97-
```
77+
```python
78+
perf = fuzzybunny.benchmark("query", candidates)
79+
print(f"Levenshtein mean time: {perf['levenshtein']['mean']:.6f}s")
80+
```
9881

9982
## License
100-
101-
This project is licensed under the [MIT License](LICENSE).
83+
MIT

0 commit comments

Comments
 (0)