Skip to content

Commit 7c77c4d

Browse files
committed
docs: research full-text search index options for TanStack DB
Researched JavaScript full-text search libraries that could integrate with TanStack DB's indexing architecture following PR #950's opt-in pattern. Key findings: - Orama recommended (2KB, used by tanstack.com, full features) - MiniSearch as alternative (7KB, class-based API) - FlexSearch for high-performance needs - Custom inverted index for minimal bundle Includes implementation approach and bundle size comparison.
1 parent 5c717e6 commit 7c77c4d

1 file changed

Lines changed: 385 additions & 0 deletions

File tree

Lines changed: 385 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,385 @@
1+
# Full-Text Search Index Research for TanStack DB
2+
3+
## Context
4+
5+
Based on [PR #950](https://github.com/TanStack/db/pull/950), TanStack DB is moving toward opt-in, tree-shakeable indexing. This document researches JavaScript full-text search options that could integrate with this architecture.
6+
7+
## Requirements for TanStack DB Integration
8+
9+
Based on the existing `IndexInterface` in `packages/db/src/indexes/base-index.ts`, a full-text search index would need to:
10+
11+
1. **Implement core interface methods**: `add()`, `remove()`, `update()`, `build()`, `clear()`
12+
2. **Support dynamic updates**: Documents can be added/removed at any time (unlike Lunr.js which is immutable)
13+
3. **Be tree-shakeable**: Support lazy loading via async resolver pattern
14+
4. **Work in browser and Node.js**: No server-side dependencies
15+
5. **Be reasonably small**: Bundle size matters for client-side use
16+
6. **Support relevance scoring**: Full-text search implies ranked results
17+
18+
### Additional Full-Text Specific Operations
19+
20+
A full-text index would likely extend `IndexInterface` with:
21+
- `search(query: string, options?: SearchOptions): SearchResult[]`
22+
- `suggest(prefix: string): string[]` (autocomplete)
23+
- `highlight(docId, query): HighlightedText`
24+
25+
---
26+
27+
## JavaScript Full-Text Search Libraries Comparison
28+
29+
### 1. Orama (Recommended)
30+
31+
**GitHub**: https://github.com/oramasearch/orama
32+
**License**: Apache 2.0
33+
**Bundle Size**: ~2KB gzipped (core)
34+
35+
#### Strengths
36+
-**Already used by TanStack**: tanstack.com uses Orama for search
37+
-**Tiny bundle**: Under 2KB for full-text search
38+
-**Dynamic updates**: Full add/remove/update support
39+
-**Modern TypeScript**: Written in TypeScript with excellent types
40+
-**Feature-rich**: Typo tolerance, stemming (30 languages), facets, geo-search
41+
-**Vector search support**: Enables hybrid full-text + semantic search
42+
-**Microsecond latency**: Very fast search performance
43+
-**Plugin architecture**: Extensible design matches TanStack DB's philosophy
44+
45+
#### API Example
46+
```typescript
47+
import { create, insert, remove, search } from '@orama/orama'
48+
49+
const db = create({
50+
schema: {
51+
title: 'string',
52+
content: 'string',
53+
tags: 'string[]'
54+
}
55+
})
56+
57+
insert(db, { title: 'Hello World', content: 'Full-text search example', tags: ['demo'] })
58+
59+
const results = search(db, {
60+
term: 'search',
61+
properties: ['title', 'content'],
62+
tolerance: 1 // typo tolerance
63+
})
64+
```
65+
66+
#### Integration Considerations
67+
- Functional API (not class-based) - would need wrapper
68+
- Schema must be defined upfront
69+
- Could be exposed as `@tanstack/db-search` or similar
70+
71+
---
72+
73+
### 2. MiniSearch
74+
75+
**GitHub**: https://github.com/lucaong/minisearch
76+
**License**: MIT
77+
**Bundle Size**: ~7KB gzipped
78+
79+
#### Strengths
80+
-**Memory efficient**: Designed for constrained environments
81+
-**Dynamic updates**: Full add/remove support
82+
-**Auto-suggestions**: Built-in autocomplete
83+
-**Fuzzy search**: Configurable typo tolerance
84+
-**Simple API**: Easy to integrate
85+
-**Zero dependencies**: No external deps
86+
87+
#### Weaknesses
88+
-**No stemming built-in**: Must be added manually
89+
-**Larger than Orama**: ~3.5x bigger bundle
90+
-**No vector search**: Full-text only
91+
92+
#### API Example
93+
```typescript
94+
import MiniSearch from 'minisearch'
95+
96+
const miniSearch = new MiniSearch({
97+
fields: ['title', 'content'],
98+
storeFields: ['title']
99+
})
100+
101+
miniSearch.addAll(documents)
102+
miniSearch.add({ id: 1, title: 'New doc', content: '...' })
103+
miniSearch.remove({ id: 1 })
104+
105+
const results = miniSearch.search('query', { fuzzy: 0.2, prefix: true })
106+
```
107+
108+
#### Integration Considerations
109+
- Class-based API fits well with TanStack DB's index pattern
110+
- Would need ID-based tracking like existing `BTreeIndex`
111+
112+
---
113+
114+
### 3. FlexSearch
115+
116+
**GitHub**: https://github.com/nextapps-de/flexsearch
117+
**License**: Apache 2.0
118+
**Bundle Size**: 4.5KB (light) to 16.3KB (full) gzipped
119+
120+
#### Strengths
121+
-**Extremely fast**: Claims fastest JS search library
122+
-**Web Workers support**: Parallel indexing/searching
123+
-**Persistent indexes**: Can serialize to storage
124+
-**Phonetic search**: Advanced matching algorithms
125+
-**Multiple presets**: Configurable memory/speed tradeoffs
126+
127+
#### Weaknesses
128+
-**Complex API**: Steeper learning curve
129+
-**Inconsistent documentation**: API has changed across versions
130+
-**Memory vs speed tradeoff**: Fast modes use more memory
131+
132+
#### API Example
133+
```typescript
134+
import { Document } from 'flexsearch'
135+
136+
const index = new Document({
137+
document: {
138+
id: 'id',
139+
index: ['title', 'content']
140+
}
141+
})
142+
143+
index.add({ id: 1, title: 'Hello', content: 'World' })
144+
index.remove(1)
145+
146+
const results = index.search('hello')
147+
```
148+
149+
#### Integration Considerations
150+
- Document search mode matches TanStack DB's collection model
151+
- Would need careful version pinning due to API instability
152+
153+
---
154+
155+
### 4. Fuse.js
156+
157+
**GitHub**: https://github.com/krisk/Fuse
158+
**License**: Apache 2.0
159+
**Bundle Size**: ~12KB gzipped
160+
161+
#### Strengths
162+
-**Fuzzy search focus**: Best-in-class approximate matching
163+
-**Weighted fields**: Different importance per field
164+
-**Well maintained**: Large community
165+
-**Simple API**: Easy to use
166+
167+
#### Weaknesses
168+
-**No inverted index**: Scans all documents (O(n))
169+
-**Poor performance at scale**: Not suitable for large datasets
170+
-**No tokenization**: Matches substrings, not words
171+
172+
#### API Example
173+
```typescript
174+
import Fuse from 'fuse.js'
175+
176+
const fuse = new Fuse(documents, {
177+
keys: ['title', 'content'],
178+
threshold: 0.3
179+
})
180+
181+
const results = fuse.search('query')
182+
```
183+
184+
#### Integration Considerations
185+
- **Not recommended** for full-text search - better for small fuzzy matching
186+
- Could be useful as a lightweight "fuzzy filter" on top of other indexes
187+
188+
---
189+
190+
### 5. Lunr.js
191+
192+
**GitHub**: https://github.com/olivernn/lunr.js
193+
**License**: MIT
194+
**Bundle Size**: ~8KB gzipped
195+
196+
#### Strengths
197+
-**Built-in stemming**: Multiple language support
198+
-**TF-IDF scoring**: Good relevance ranking
199+
-**Mature**: Battle-tested library
200+
201+
#### Weaknesses
202+
-**Immutable index**: Cannot add/remove documents after build
203+
-**No maintenance**: Last commit was years ago
204+
-**Rebuild required**: Any change requires full reindex
205+
206+
#### Integration Considerations
207+
- **Not recommended** due to immutable index - doesn't fit TanStack DB's real-time update model
208+
209+
---
210+
211+
### 6. Build Custom (Inverted Index)
212+
213+
Building a minimal custom full-text index is feasible for basic needs.
214+
215+
#### Core Components
216+
```typescript
217+
class FullTextIndex implements IndexInterface {
218+
private invertedIndex = new Map<string, Set<Key>>() // term -> doc keys
219+
private docTerms = new Map<Key, Set<string>>() // doc -> terms (for removal)
220+
221+
add(key: Key, item: any) {
222+
const text = this.extractText(item)
223+
const terms = this.tokenize(text)
224+
225+
this.docTerms.set(key, new Set(terms))
226+
for (const term of terms) {
227+
if (!this.invertedIndex.has(term)) {
228+
this.invertedIndex.set(term, new Set())
229+
}
230+
this.invertedIndex.get(term)!.add(key)
231+
}
232+
}
233+
234+
remove(key: Key, item: any) {
235+
const terms = this.docTerms.get(key)
236+
if (!terms) return
237+
238+
for (const term of terms) {
239+
this.invertedIndex.get(term)?.delete(key)
240+
}
241+
this.docTerms.delete(key)
242+
}
243+
244+
search(query: string): Set<Key> {
245+
const queryTerms = this.tokenize(query)
246+
// AND logic: intersection of all term matches
247+
let result: Set<Key> | null = null
248+
for (const term of queryTerms) {
249+
const matches = this.invertedIndex.get(term) ?? new Set()
250+
result = result ? intersection(result, matches) : new Set(matches)
251+
}
252+
return result ?? new Set()
253+
}
254+
255+
private tokenize(text: string): string[] {
256+
return text.toLowerCase()
257+
.split(/\W+/)
258+
.filter(t => t.length > 1)
259+
}
260+
}
261+
```
262+
263+
#### Strengths
264+
-**Zero dependencies**: Complete control
265+
-**Minimal size**: Can be very small
266+
-**Perfect fit**: Matches TanStack DB interface exactly
267+
268+
#### Weaknesses
269+
-**No advanced features**: No fuzzy, stemming, scoring
270+
-**Maintenance burden**: Must build everything
271+
-**No relevance ranking**: Basic boolean matching only
272+
273+
---
274+
275+
## Recommendation
276+
277+
### Primary: **Orama**
278+
279+
Orama is the strongest candidate because:
280+
281+
1. **Already in TanStack ecosystem** - tanstack.com uses it, so there's familiarity
282+
2. **Smallest bundle** - Under 2KB for core functionality
283+
3. **Most features** - Typo tolerance, stemming, facets, vector search
284+
4. **Modern TypeScript** - Excellent type safety
285+
5. **Apache 2.0** - Compatible license
286+
6. **Active maintenance** - Regular updates and community
287+
288+
### Implementation Approach
289+
290+
```typescript
291+
// packages/db-search/src/fulltext-index.ts
292+
import { create, insert, remove, search, type Orama } from '@orama/orama'
293+
import type { BaseIndex, IndexInterface } from '@tanstack/db'
294+
295+
export class FullTextIndex<TKey extends string | number>
296+
extends BaseIndex<TKey>
297+
implements IndexInterface<TKey> {
298+
299+
private db: Orama<any>
300+
private keyField: string
301+
private textFields: string[]
302+
303+
constructor(
304+
id: number,
305+
expression: BasicExpression,
306+
name?: string,
307+
options?: FullTextIndexOptions
308+
) {
309+
super(id, expression, name)
310+
this.textFields = options?.fields ?? []
311+
this.keyField = options?.keyField ?? 'id'
312+
313+
this.db = create({
314+
schema: this.buildSchema()
315+
})
316+
}
317+
318+
add(key: TKey, item: any): void {
319+
insert(this.db, { [this.keyField]: key, ...item })
320+
}
321+
322+
remove(key: TKey, item: any): void {
323+
remove(this.db, key)
324+
}
325+
326+
// Full-text specific method
327+
search(query: string, options?: SearchOptions): SearchResult<TKey>[] {
328+
return search(this.db, {
329+
term: query,
330+
properties: this.textFields,
331+
...options
332+
})
333+
}
334+
335+
// For basic IndexInterface compatibility
336+
lookup(operation: 'search', value: string): Set<TKey> {
337+
const results = this.search(value)
338+
return new Set(results.map(r => r.id as TKey))
339+
}
340+
}
341+
```
342+
343+
### Alternative: MiniSearch
344+
345+
If Orama's functional API is problematic, MiniSearch's class-based design is a good fallback:
346+
- Larger bundle (~7KB) but still reasonable
347+
- Simpler integration with existing index patterns
348+
- MIT license (very permissive)
349+
350+
---
351+
352+
## Bundle Size Comparison
353+
354+
| Library | Size (gzip) | Dynamic Updates | Fuzzy | Stemming |
355+
|---------|-------------|-----------------|-------|----------|
356+
| Orama | ~2KB ||| ✅ (30 langs) |
357+
| MiniSearch | ~7KB ||| ❌ (addon) |
358+
| FlexSearch (light) | ~4.5KB ||||
359+
| Fuse.js | ~12KB ||||
360+
| Lunr.js | ~8KB ||||
361+
| Custom | ~1KB ||||
362+
363+
---
364+
365+
## Next Steps
366+
367+
1. **Prototype Orama integration** as `@tanstack/db-search`
368+
2. **Design extended IndexInterface** for full-text operations
369+
3. **Add to tree-shakeable entry points** per PR #950 pattern
370+
4. **Benchmark** against existing BTreeIndex for mixed workloads
371+
5. **Document** field configuration and search options
372+
373+
---
374+
375+
## Sources
376+
377+
- [Orama GitHub](https://github.com/oramasearch/orama)
378+
- [MiniSearch GitHub](https://github.com/lucaong/minisearch)
379+
- [FlexSearch GitHub](https://github.com/nextapps-de/flexsearch)
380+
- [Fuse.js](https://www.fusejs.io/)
381+
- [Lunr.js](https://lunrjs.com/)
382+
- [JS Search Library Comparisons](https://npm-compare.com/elasticlunr,flexsearch,fuse.js,minisearch)
383+
- [Best Search Packages for JavaScript](https://mattermost.com/blog/best-search-packages-for-javascript/)
384+
- [Top 6 JavaScript Search Libraries](https://byby.dev/js-search-libraries)
385+
- [Inverted Index Implementation](https://www.30secondsofcode.org/js/s/tf-idf-inverted-index/)

0 commit comments

Comments
 (0)