[ENH] Add token bitmap FTS index#7001
Draft
Sicheng-Pan wants to merge 1 commit into04-30-_enh_add_word_based_fts_tokenizerfrom
Draft
[ENH] Add token bitmap FTS index#7001Sicheng-Pan wants to merge 1 commit into04-30-_enh_add_word_based_fts_tokenizerfrom
Sicheng-Pan wants to merge 1 commit into04-30-_enh_add_word_based_fts_tokenizerfrom
Conversation
Contributor
Author
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
d9a678e to
d164a06
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Second PR in the full-text index redesign series. Adds
FullTextBitmapWriterandFullTextBitmapReader— the core index that hashes word tokens to 2^20 buckets and stores oneRoaringBitmapper bucket in a blockfile.High-Level Plan (for context)
Word-based tokenization(PR 1)This PR
Writer (
FullTextBitmapWriter)DashMap<u32, BucketDelta>for lock-free concurrent accumulationadd_document/delete_documentwith mutual exclusion (add clears pending delete, delete clears pending add) — last operation wins per (bucket, doc_id)write_to_blockfileswrites buckets in sorted key order, merges with existing bitmaps when forking viaold_readermurmur3_32(token, seed=0x5f3759df) % 2^20("", bucket_id: u32) → RoaringBitmapReader (
FullTextBitmapReader)search(query)tokenizes viaWordAnalyzer::tokenize_query, hashes each token, loads bucket bitmaps, AND's them togetherget_bucketreturnsResult, errors propagate properlyFork support
Option<FullTextBitmapReader>asold_readerresult = (existing | adds) - deletesTests