Building a Dataset Engine (Part 3): Hybrid Search & the UI

Building a Dataset Engine (Part 3): Hybrid Search & the UI
Markus Klooth
Markus Klooth
14 min read

How we combine vector and full-text search with configurable weighting strategies, track search analytics, and built a 37-component dataset management UI with upload, processing, and search.

Part 1 covered the data model and document processing pipeline. Part 2 covered embedding generation and vector search. This post covers the layer that ties everything together — the search system and the UI.

Vector search finds semantically similar content. "shipping delays" matches "delivery postponement." But it misses exact terms — searching for order number "ORD-12345" returns noise because the embedding doesnt capture literal string matching.

Full-text search finds exact and stemmed keyword matches. "return policy" matches "returns" and "returning." But it misses semantic equivalence — "refund process" wont match "how to get your money back."

Hybrid search runs both, combines the results, and ranks by a weighted score. The user gets semantic understanding and keyword precision. This is the approach that production RAG systems converge on — and its what we built.

Full-text search — PostgreSQL native FTS

Full-text search uses the pre-computed searchVector column from Part 1. The tsvector is generated by PostgreSQL at insert time and indexed with GIN. At query time, we match against the index — no text processing needed.

// packages/lib/src/datasets/search/full-text-search.ts

export class FullTextSearchService {
  static async search(
    query: SearchQuery,
    datasetIds: string[],
    organizationId: string,
    _userId?: string
  ): Promise<{ results: SearchResult[]; metrics: SearchPerformanceMetrics }> {
    const searchResults = await FullTextSearchService.performFullTextSearch(
      query.query,
      datasetIds,
      organizationId,
      {
        fuzzySearch: true,
        phraseSearch: false,
        booleanMode: false,
        rankingMode: 'bm25',
        minScore: 0.1,
        filters: query.filters,
      }
    )

    // Pagination
    const limit = query.limit || 20
    const offset = query.offset || 0
    return { results: searchResults.slice(offset, offset + limit), metrics }
  }
}

Under the hood, the actual SQL uses plainto_tsquery and ts_rank_cd:

SELECT
  "DocumentSegment".id,
  "DocumentSegment".content,
  ts_rank_cd("searchVector", plainto_tsquery('english', $query)) as rank
FROM "DocumentSegment"
WHERE "searchVector" @@ plainto_tsquery('english', $query)
  AND "indexStatus" = 'INDEXED'
  AND "documentId" IN (SELECT id FROM "Document" WHERE "datasetId" = ANY($datasetIds) AND enabled = true)
ORDER BY rank DESC
LIMIT 100

A few design decisions worth explaining.

Pre-computed searchVector, not runtime tsvector generation. Computing to_tsvector('english', content) at query time processes every matching rows full text. A stored, GIN-indexed searchVector column turns this into an index lookup. For 100,000 segments, this is 5ms vs 500ms.

plainto_tsquery over to_tsquery. to_tsquery requires the caller to handle boolean operators and escaping (cat & dog, not cat and dog). plainto_tsquery accepts natural language input and converts it. Users type questions, not boolean expressions.

ts_rank_cd for ranking. The _cd variant (cover density) considers proximity of matching terms, not just frequency. "return policy" ranks higher when the words appear near each other in the segment, not scattered across 500 words.

100-result cap per query. Full-text search can return thousands of matches. We cap at 100 server-side. The hybrid combiner and pagination handle the rest. This keeps memory bounded and response times consistent.

Filters applied in SQL, not in-memory. Document type, MIME type, date range, and status filters are in the WHERE clause, not post-query JavaScript filtering. PostgreSQLs query planner combines the GIN index scan with these filters efficiently.

Hybrid search — combining two ranked lists

This is where it gets interesting. Hybrid search runs vector and text search in parallel, then combines the results.

// packages/lib/src/datasets/search/hybrid-search.ts

export class HybridSearchService {
  static async search(
    query: SearchQuery,
    datasetConfigs: DatasetConfig[],
    organizationId: string,
    userId?: string
  ): Promise<{ results: SearchResult[]; metrics: SearchPerformanceMetrics }> {
    const datasetIds = datasetConfigs.map(d => d.id)

    // Read weights from query or use defaults
    const hybridOptions: HybridSearchOptions = {
      vectorWeight: query.vectorWeight ?? 0.6,
      textWeight: query.textWeight ?? 0.4,
      combineMethod: query.combineMethod || 'weighted_sum',
    }

    // Execute both searches in parallel
    const [vectorResult, textResult] = await Promise.allSettled([
      VectorSearchService.search(query, datasetConfigs, organizationId, userId),
      FullTextSearchService.search(query, datasetIds, organizationId, userId),
    ])

    // Handle partial failures gracefully
    const vectorResults = vectorResult.status === 'fulfilled' ? vectorResult.value.results : []
    const textResults = textResult.status === 'fulfilled' ? textResult.value.results : []

    // If both failed, throw
    if (vectorResults.length === 0 && textResults.length === 0
      && vectorResult.status === 'rejected' && textResult.status === 'rejected') {
      throw new SearchError('Both vector and text searches failed')
    }

    // Combine and rank
    const combinedResults = HybridSearchService.combineSearchResults(
      vectorResults, textResults, hybridOptions
    )

    return { results: combinedResults.slice(offset, offset + limit), metrics }
  }
}

Parallel execution. Vector and text search are independent — neither needs the others results. Promise.allSettled cuts latency to max(vector_time, text_time) instead of vector_time + text_time. For typical dataset sizes, this is ~80ms instead of ~150ms.

Promise.allSettled, not Promise.all. If vector search fails (e.g., embedding API timeout), text search results still return. Promise.all would throw on the first failure and discard the successful results. A partial result set is better than an error page.

Three combination methods

The combiner supports three strategies for merging vector and text results. Each handles the "how do you combine scores from two different systems?" problem differently.

Weighted sum (default)

// packages/lib/src/datasets/search/hybrid-search.ts

private static combineByWeightedSum(
  vectorResult: SearchResult | undefined,
  textResult: SearchResult | undefined,
  vectorWeight: number,
  textWeight: number
): SearchResult | null {
  if (!vectorResult && !textResult) return null

  const baseResult = vectorResult || textResult!
  let combinedScore = 0

  if (vectorResult) combinedScore += (vectorResult.score || 0) * vectorWeight
  if (textResult) combinedScore += (textResult.score || 0) * textWeight

  return { ...baseResult, score: combinedScore, searchType: 'hybrid' }
}

finalScore = (vectorScore * 0.6) + (textScore * 0.4)

Simple, intuitive, configurable. The default 60/40 split favors semantic relevance while keeping keyword precision as a strong signal. Works well when both score distributions are roughly similar.

Reciprocal Rank Fusion (RRF)

// packages/lib/src/datasets/search/hybrid-search.ts

private static combineByRRF(
  vectorResult: SearchResult | undefined,
  textResult: SearchResult | undefined,
  k: number = 60
): SearchResult | null {
  if (!vectorResult && !textResult) return null

  const baseResult = vectorResult || textResult!
  let rrfScore = 0

  // RRF formula: 1 / (k + rank)
  if (vectorResult) rrfScore += 1 / (k + (vectorResult.rank || 1))
  if (textResult) rrfScore += 1 / (k + (textResult.rank || 1))

  return { ...baseResult, score: rrfScore, searchType: 'hybrid' }
}

finalScore = 1/(k + vectorRank) + 1/(k + textRank) where k = 60.

RRF ignores raw scores entirely — only rank positions matter. This is the great equalizer. Vector scores (0-1 cosine similarity) and text scores (0-infinity ts_rank) are on completely different scales. Weighted sum requires the scores to be comparable. RRF sidesteps this entirely by using ranks instead of scores. Its robust and requires no tuning beyond the k parameter.

Linear combination

// packages/lib/src/datasets/search/hybrid-search.ts

private static combineByLinearCombination(
  vectorResult: SearchResult | undefined,
  textResult: SearchResult | undefined,
  vectorWeight: number,
  textWeight: number
): SearchResult | null {
  if (!vectorResult && !textResult) return null

  const baseResult = vectorResult || textResult!

  // Normalize scores to 0-1 before combining
  const vectorScore = vectorResult
    ? HybridSearchService.normalizeScore(vectorResult.score || 0, 'vector') : 0
  const textScore = textResult
    ? HybridSearchService.normalizeScore(textResult.score || 0, 'text') : 0

  const combinedScore = vectorScore * vectorWeight + textScore * textWeight

  return { ...baseResult, score: combinedScore, searchType: 'hybrid' }
}

Score-based like weighted sum, but with min-max normalization first. This handles the scale mismatch between cosine similarity and ts_rank by mapping both to 0-1 before combining. A middle ground between weighted sum (no normalization) and RRF (no scores).

De-duplication

A segment can appear in both vector and text results. The combiner handles this by indexing all results by segment ID into two maps, then iterating over the union of all segment IDs. Each segment gets one combined score, not two entries.

// packages/lib/src/datasets/search/hybrid-search.ts

const vectorMap = new Map<string, SearchResult>()
const textMap = new Map<string, SearchResult>()
const allSegmentIds = new Set<string>()

vectorResults.forEach(result => {
  vectorMap.set(result.segment.id, result)
  allSegmentIds.add(result.segment.id)
})
textResults.forEach(result => {
  textMap.set(result.segment.id, result)
  allSegmentIds.add(result.segment.id)
})

// Combine per segment
allSegmentIds.forEach(segmentId => {
  const vectorResult = vectorMap.get(segmentId)
  const textResult = textMap.get(segmentId)
  // ... combine using selected method
})

A segment that scores 0.9 in vector and 0.85 in text gets one combined entry — not two separate results pointing to the same content.

Search analytics — first-class, not afterthought

Every search is recorded. Not as a log line — as structured data in the database.

// packages/lib/src/datasets/services/search.service.ts

export class SearchService {
  static async search(
    query: SearchQuery,
    organizationId: string,
    userId?: string
  ): Promise<SearchResponse> {
    const startTime = Date.now()

    // ... validate, fetch datasets, execute search ...

    // Fire-and-forget analytics recording
    void SearchService.recordSearchAnalytics({
      query: query.query,
      queryType: query.searchType,
      resultsCount: results.length,
      responseTime: Date.now() - startTime,
      vectorSimilarityThreshold: query.similarityThreshold,
      maxResults: query.limit,
      filters: query.filters,
      organizationId,
      userId: finalUserId,
    })

    return { results, total, query: query.query, searchType, responseTime, hasMore }
  }
}

Fire-and-forget recording. Analytics are written asynchronously after the search response is sent. A DB write failure doesnt slow down or block the search. The void prefix explicitly marks the un-awaited promise — this is intentional, not a missing await.

Per-result score tracking. DatasetSearchResult stores rank and score for every result in every search. This enables "what score threshold captures 95% of clicked results?" — data-driven threshold tuning instead of guesswork.

Search history with deduplication. getSearchHistory() returns a users recent queries, deduplicated. Searching "return policy" three times shows up once. This powers the autocomplete/suggestions feature.

Aggregate analytics. getSearchAnalytics() computes: total queries, average response time, popular queries (by frequency), and search type distribution (vector/text/hybrid). This is the dashboard data for "is search working well?"

Suggestions from history + popularity. getSuggestions() combines the users personal history with globally popular queries. Type "ret" and see "return policy" (popular across the org) and "retrieve order status" (from your history).

The search service — orchestration layer

The SearchService is the single entry point for all search. It validates, routes, paginates, and records.

// packages/lib/src/datasets/services/search.service.ts

export class SearchService {
  static async search(
    query: SearchQuery,
    organizationId: string,
    userId?: string
  ): Promise<SearchResponse> {
    // 1. Validate query (non-empty, max 1000 chars)
    SearchService.validateQuery(query)

    // 2. Fetch accessible datasets with embedding configs
    const accessibleDatasetConfigs = await SearchService.getAccessibleDatasets(
      organizationId, finalUserId, query.datasetIds, query.includeInactive
    )

    // 3. Route to search implementation
    switch (query.searchType) {
      case 'vector':
        results = await VectorSearchService.search(query, accessibleDatasetConfigs, /* ... */)
        break
      case 'text':
        results = await FullTextSearchService.search(query, accessibleDatasetIds, /* ... */)
        break
      case 'hybrid':
      default:
        results = await HybridSearchService.search(query, accessibleDatasetConfigs, /* ... */)
        break
    }

    // 4. Record analytics (fire-and-forget)
    void SearchService.recordSearchAnalytics(/* ... */)

    // 5. Return response with metrics
    return { results, total, responseTime, searchType, hasMore }
  }
}

Dataset accessibility check. Not all datasets are searchable by all users. The service fetches datasets the user has access to (org-scoped, active status) and intersects with the requested datasetIds. This prevents searching archived or cross-org datasets.

Search type routing. The searchType parameter determines which implementation runs. The search service doesnt know how vector search works — it delegates. Adding a new search strategy (e.g., graph-based) means adding one implementation, not modifying the orchestrator.

Every response includes responseTime. Measured at the orchestration layer — includes DB queries, embedding generation, result enrichment. The number the user actually experiences, not just the vector comparison time.

The dataset management UI

The dataset UI has 37 components across 6 areas. Each component has one job.

Dataset list view

Two views: a visual grid (dataset-card.tsx) and a data-dense table (datasets-table-view.tsx). Toggle between them. Grid shows status badges, document counts, last indexed date. Table adds sorting by any column.

Stats cards at the top (datasets-stats-cards.tsx) show total datasets, total documents, total size, and average search time. A quick health check without drilling into any dataset.

The empty state (datasets-empty-state.tsx) isnt just "no datasets" — its a guided creation flow. Upload files inline or create an empty dataset to configure first.

Dataset detail view

The detail page uses a provider pattern. dataset-detail-provider.tsx wraps everything in a React context. Dataset data, documents, and actions are available to all child components without prop drilling.

The document management area (documents/document-management.tsx) is the document table with filtering by status, type, and search. Batch operations — reprocess, delete, archive — via multi-select. Real-time processing progress shows via document-processing-progress.tsx.

The document detail drawer (documents/document-detail-drawer.tsx) opens as a side drawer showing the full document info: extracted content preview, chunk settings (dataset defaults or document overrides), processing metrics (time, chunk count), and error messages for failed documents.

Upload flow

The drag-and-drop zone (document-upload-zone.tsx) accepts PDF, DOCX, TXT, and HTML. It shows file type icons, validates file size, and warns about duplicates (checksum check against existing documents). Multiple files at once.

Processing feedback is split between toasts and inline status. document-processing-toast.tsx shows toast notifications for processing start/completion/failure. processing-status.tsx renders inline status badges (UPLOADED, PROCESSING with spinner animation, INDEXED, FAILED) in the document table.

Search UI

The search method selector (search/search-method-selector.tsx) toggles between Vector, Text, and Hybrid. Each method shows its own options panel:

  • Vector options — similarity threshold slider
  • Text options — fuzzy matching toggle, phrase search toggle
  • Hybrid options — vector/text weight sliders, combination method dropdown (weighted sum, RRF, linear)

Advanced filters (search/advanced-search-options.tsx) let you filter by document type, MIME type, date range, file size, and metadata. The panel is collapsible so it doesnt clutter the default search.

Each result item (search/search-result-item.tsx) shows: matched content with highlights, similarity score, source document name, dataset name, and a search type badge.

Settings

Five settings sections, each in its own component:

  • General — name, description, status toggle
  • Embedding — model picker (shows available models from org config), dimension selector
  • Chunking — chunk size slider, overlap, delimiter input, preprocessing toggles (normalize whitespace, remove URLs/emails)
  • Vector DB — vector DB type (PostgreSQL), connection config
  • Search — default search type, default weights, similarity threshold

Each section is independently saveable. Changing the embedding model triggers a warning that existing documents will need reprocessing.

The tRPC router — feature-gated CRUD

// apps/web/src/server/api/routers/dataset.ts

create: protectedProcedure
  .input(createDatasetSchema)
  .mutation(async ({ ctx, input }) => {
    // Check feature access + limits
    await FeaturePermissionService.check(ctx.session.organizationId, 'datasets')
    // ... create dataset
  })

Feature gating, not just auth. Creating a dataset isnt just "is the user logged in?" — its "does this organizations plan include datasets, and are they under their dataset limit?" FeaturePermissionService handles both checks.

Processing status as a dedicated endpoint. getProcessingStatus returns a breakdown: { uploaded: 3, processing: 2, indexed: 45, failed: 1 }. The UI polls this during bulk uploads to show real-time progress without loading full document data.

Stats aggregation in SQL. getStats computes document count, total size, average processing time, and total searches via SQL aggregations — not by loading all documents into memory. For a dataset with 10,000 documents, this is a 5ms query vs a 500ms data transfer.

Key trade-offs

DecisionTrade-offWhy we chose it
Hybrid search as defaultRuns two search queries instead of oneNeither approach alone is good enough for production RAG
RRF as an optionIgnores raw scores (only ranks)Handles scale mismatch between cosine similarity and ts_rank
Pre-computed tsvectorsSlight write overhead per segment100x faster full-text search at query time
Fire-and-forget analyticsAnalytics can be lost on crashSearch latency matters more than analytics completeness
37 UI componentsMany filesEach component has one job — maintainable as the feature grows
Promise.allSettled for hybridMore complex error handlingPartial results are better than total failure
SQL-side filtering for FTSMore complex SQLAvoids loading thousands of results into memory

The full picture

Across these three posts, the dataset engine breaks down to:

  • 5 schema tables — Dataset, Document, DocumentSegment, DatasetSearchQuery, DatasetSearchResult
  • 5 services — DatasetService, DocumentService, EmbeddingService, SearchService, VectorService
  • 3 search implementations — VectorSearchService, FullTextSearchService, HybridSearchService
  • 4 extractors — PDF, DOCX, HTML, text (with fallback chain)
  • 1 text chunker — semantic boundary detection with 6-level priority
  • 37 UI components — upload, processing, search, settings, document management

All running on PostgreSQL with pgvector. No external vector database. No separate search service. One database, one deployment, one set of credentials.

The dataset engine is open source as part of Auxx.ai. If youre building RAG for a SaaS product and debating whether to use Pinecone or build your own — PostgreSQL is probably enough.