Testing StarRocks as a Vector Database: HNSW vs IVFPQ Real-World Performance
Let me tell you about something that surprised me recently. I’ve been using StarRocks for analytics work, and when they dropped vector search capabilities in version 3.4+, I had to test it. Could a traditional OLAP database really compete with dedicated vector databases like Pinecone or Qdrant?
Spoiler: The results shocked me. StarRocks not only works for vector search - it works really, really well.
I spent weeks building a complete testing framework, benchmarking both HNSW (Hierarchical Navigable Small World) and IVFPQ (Inverted File with Product Quantization) indexes. Here’s what I discovered, along with the actual code you can use to reproduce my tests.
Why Would You Even Try StarRocks for Vectors?
Look, if you’re already running StarRocks for analytics, adding vector search is a no-brainer. Here’s why:
- You can run vector similarity search alongside your regular SQL queries - same database, same connection
- The clustering and failover you already have? It works for vector data too (3 FE + 3 BE nodes and you’re golden)
- You already know it’s fast - we’re talking sub-second queries on billion-row datasets
- Your team already knows SQL - there’s no weird new query language to learn
- You don’t need another database to manage (and pay for)
Setting Up the Test Environment
What You’ll Need
Here’s what I used for testing:
- StarRocks 3.5.5 cluster (3 Frontend + 3 Backend nodes)
- Ubuntu 24.04 with OpenJDK 17
- Python 3.12 with uv package manager
First thing - make sure your cluster is actually running:
# Check cluster status
mysql -h 127.0.0.1 -P 9030 -u root -e "SHOW BACKENDS;"You should see something like this - all backends alive and happy:
BackendId | IP | Alive | Version | StatusCode
10002 | 172.30.0.20 | true | 3.5.5-fd4e51b | OK
10003 | 172.30.0.21 | true | 3.5.5-fd4e51b | OK
10004 | 172.30.0.22 | true | 3.5.5-fd4e51b | OKTurning On Vector Search
Note: Vector indexing is still marked as experimental in 3.5.x, so you need to explicitly enable it:
-- Enable vector search capability
ADMIN SET FRONTEND CONFIG ("enable_experimental_vector" = "true");
-- Verify the setting
ADMIN SHOW FRONTEND CONFIG LIKE 'enable_experimental_vector';Python Setup
Get your Python environment ready:
# Initialize uv project
uv init
uv add pymysql matplotlib numpy sentence-transformers tqdmCreating Tables for Vector Data
StarRocks stores vectors as ARRAY<FLOAT> columns. You can choose between HNSW and IVFPQ indexing strategies. After trying both approaches, I found it’s more reliable to create the tables first, then add indexes separately.
The Schema That Actually Works
Here’s what worked best in my testing:
-- Create database
CREATE DATABASE IF NOT EXISTS vecdemo;
USE vecdemo;
-- Base table for HNSW testing
CREATE TABLE docs_hnsw (
id BIGINT NOT NULL,
vector ARRAY<FLOAT> NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;
-- Add HNSW vector index
ALTER TABLE docs_hnsw ADD INDEX ix_hnsw (vector) USING VECTOR (
'index_type' = 'hnsw',
'dim' = '384',
'metric_type' = 'l2_distance',
'is_vector_normed' = 'false',
'M' = '32',
'efconstruction' = '200'
);
-- Base table for IVFPQ testing
CREATE TABLE docs_ivfpq (
id BIGINT NOT NULL,
vector ARRAY<FLOAT> NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;
-- Add IVFPQ vector index
ALTER TABLE docs_ivfpq ADD INDEX ix_ivfpq (vector) USING VECTOR (
'index_type' = 'ivfpq',
'dim' = '384',
'metric_type' = 'l2_distance',
'is_vector_normed' = 'false',
'nbits' = '8',
'nlist' = '512',
'M_IVFPQ' = '48'
);What Those Index Parameters Actually Mean
HNSW Settings:
M = 32: How many connections each node can have. More connections = better accuracy but uses more memoryefconstruction = 200: How hard the index tries during building. Higher = slower build but better quality
IVFPQ Settings:
nlist = 512: Splits your vectors into 512 buckets. Start with sqrt(your_data_size)nbits = 8: How much to compress each sub-vector. 8 is usually fineM_IVFPQ = 48: Breaks your 384-dim vector into 48 pieces (384/48 = 8 dimensions each)
The Complete Testing Code
I built a testing framework to properly benchmark both index types. Here’s the full code - you can run this yourself:
#!/usr/bin/env python3
"""
StarRocks Vector Search Testing: HNSW vs IVFPQ
Comprehensive test suite for vector indexing capabilities in StarRocks 3.5+
"""
import pymysql
import time
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Any
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import os
# Connection configuration
HOST = "127.0.0.1"
PORT = 9030
USER = "root"
PASSWORD = ""
DB = "vecdemo"
def connect():
"""Create database connection"""
return pymysql.connect(
host=HOST, port=PORT, user=USER, password=PASSWORD,
database=DB, autocommit=True, cursorclass=pymysql.cursors.DictCursor
)
def array_literal_for_insert(vec: List[float]) -> str:
"""Convert vector to StarRocks array literal for INSERT"""
return "array<float>[" + ",".join(f"{float(x):.7f}" for x in vec) + "]"
def arr_const_for_query(vec: List[float]) -> str:
"""Convert vector to StarRocks array constant for queries"""
return "[" + ",".join(f"{float(x):.7f}" for x in vec) + "]"
def setup_database():
"""Create database and tables with vector indexes"""
temp_conn = pymysql.connect(
host=HOST, port=PORT, user=USER, password=PASSWORD,
autocommit=True, cursorclass=pymysql.cursors.DictCursor
)
print("Setting up database and tables...")
with temp_conn.cursor() as cur:
cur.execute("CREATE DATABASE IF NOT EXISTS vecdemo;")
cur.execute("USE vecdemo;")
cur.execute("DROP TABLE IF EXISTS docs_hnsw;")
cur.execute("DROP TABLE IF EXISTS docs_ivfpq;")
# Create HNSW table
cur.execute("""
CREATE TABLE docs_hnsw (
id BIGINT NOT NULL,
vector ARRAY<FLOAT> NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;
""")
# Add HNSW vector index
cur.execute("""
ALTER TABLE docs_hnsw ADD INDEX ix_hnsw (vector) USING VECTOR (
'index_type' = 'hnsw',
'dim' = '384',
'metric_type' = 'l2_distance',
'is_vector_normed' = 'false',
'M' = '32',
'efconstruction' = '200'
);
""")
print("✓ Created HNSW table with vector index")
# Create IVFPQ table
cur.execute("""
CREATE TABLE docs_ivfpq (
id BIGINT NOT NULL,
vector ARRAY<FLOAT> NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;
""")
# Add IVFPQ vector index
cur.execute("""
ALTER TABLE docs_ivfpq ADD INDEX ix_ivfpq (vector) USING VECTOR (
'index_type' = 'ivfpq',
'dim' = '384',
'metric_type' = 'l2_distance',
'is_vector_normed' = 'false',
'nbits' = '8',
'nlist' = '512',
'M_IVFPQ' = '48'
);
""")
print("✓ Created IVFPQ table with vector index")
temp_conn.close()
def generate_sample_data(num_docs: int = 1000) -> Tuple[List[str], np.ndarray]:
"""Generate sample documents and embeddings"""
print(f"Generating {num_docs} sample documents...")
# Create diverse sample texts
topics = [
"StarRocks vector search and ANN capabilities",
"Database performance optimization and indexing",
"Machine learning and artificial intelligence",
"Data analytics and business intelligence",
"Cloud computing and distributed systems",
"Software engineering best practices",
"Web development and user experience",
"Mobile application development",
"Cybersecurity and data protection",
"Financial technology and blockchain"
]
texts = []
for i in range(num_docs):
topic = topics[i % len(topics)]
texts.append(f"Document {i}: {topic} - analysis and implementation guide {i}")
print("Loading sentence transformer model...")
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
print("Generating embeddings...")
embeddings = model.encode(texts, convert_to_numpy=True, show_progress_bar=True)
embeddings = embeddings.astype(np.float32)
assert embeddings.shape[1] == 384, f"Expected 384-dim vectors, got {embeddings.shape[1]}"
return texts, embeddings
def insert_data(conn, table: str, embeddings: np.ndarray):
"""Insert embeddings into specified table"""
print(f"Inserting {len(embeddings)} records into {table}...")
with conn.cursor() as cur:
for i, vec in enumerate(embeddings):
vec_sql = array_literal_for_insert(vec.tolist())
sql = f"INSERT INTO {table} (id, vector) VALUES (%s, {vec_sql})"
cur.execute(sql, (i + 1,))
if (i + 1) % 1000 == 0:
conn.commit()
print(f" Inserted {i + 1} records")
conn.commit()
print(f"✓ Completed inserting into {table}")
def query_all(conn, sql: str) -> List[dict]:
"""Execute query and return all results"""
with conn.cursor() as cur:
cur.execute(sql)
return cur.fetchall()
def explain_query(conn, sql: str) -> str:
"""Get query execution plan"""
rows = query_all(conn, "EXPLAIN " + sql)
return "\n".join(str(c) for r in rows for c in r.values())
def topk_hnsw(conn, qvec: np.ndarray, k: int = 10, efsearch: int = 128):
"""Query HNSW index with specified efsearch parameter"""
arr = arr_const_for_query(qvec.tolist())
sql = f"""
SELECT /*+ SET_VAR (ann_params='{{efsearch={efsearch}}}') */
id, approx_l2_distance({arr}, vector) AS d
FROM docs_hnsw
ORDER BY d ASC
LIMIT {k}
"""
return query_all(conn, sql), explain_query(conn, sql)
def topk_ivfpq(conn, qvec: np.ndarray, k: int = 10, nprobe: int = 64):
"""Query IVFPQ index with specified nprobe parameter"""
arr = arr_const_for_query(qvec.tolist())
sql = f"""
SELECT /*+ SET_VAR (ann_params='{{nprobe={nprobe}}}') */
id, approx_l2_distance({arr}, vector) AS d
FROM docs_ivfpq
ORDER BY d ASC
LIMIT {k}
"""
return query_all(conn, sql), explain_query(conn, sql)
def topk_exact(conn, table: str, qvec: np.ndarray, k: int = 10):
"""Exact k-NN search for baseline comparison"""
arr = arr_const_for_query(qvec.tolist())
sql = f"""
SELECT id, l2_distance({arr}, vector) AS d
FROM {table}
ORDER BY d ASC
LIMIT {k}
"""
return query_all(conn, sql)
def recall_at_k(approx_ids: List[int], exact_ids: List[int], k: int) -> float:
"""Calculate recall@k metric"""
return len(set(approx_ids).intersection(set(exact_ids))) / float(k)
def time_query(query_func):
"""Time a query function execution"""
start = time.perf_counter()
result = query_func()
elapsed = (time.perf_counter() - start) * 1000.0 # Convert to milliseconds
return result, elapsed
def run_benchmarks(conn, query_vectors: np.ndarray, k: int = 10):
"""Run comprehensive benchmarks comparing HNSW and IVFPQ"""
print("Running benchmarks...")
# Parameter sweeps
efsearch_vals = [16, 64, 128, 256, 512]
nprobe_vals = [1, 8, 16, 32, 64, 128]
results = []
# Generate exact baselines
print("Computing exact baselines...")
exact_hnsw_cache = []
exact_ivfpq_cache = []
for qv in tqdm(query_vectors, desc="Exact baselines"):
exact_h = topk_exact(conn, "docs_hnsw", qv, k=k)
exact_i = topk_exact(conn, "docs_ivfpq", qv, k=k)
exact_hnsw_cache.append([r["id"] for r in exact_h])
exact_ivfpq_cache.append([r["id"] for r in exact_i])
# HNSW parameter sweep
for ef in tqdm(efsearch_vals, desc="HNSW sweep"):
latencies, recalls = [], []
for i, qv in enumerate(query_vectors):
(rows, _), dt = time_query(lambda: topk_hnsw(conn, qv, k=k, efsearch=ef))
latencies.append(dt)
recalls.append(recall_at_k([r["id"] for r in rows], exact_hnsw_cache[i], k))
results.append({
"index_type": "HNSW", "parameter": ef, "param_name": "efsearch",
"avg_latency_ms": float(np.mean(latencies)), "avg_recall": float(np.mean(recalls)),
"std_latency_ms": float(np.std(latencies)), "std_recall": float(np.std(recalls))
})
# IVFPQ parameter sweep
for npv in tqdm(nprobe_vals, desc="IVFPQ sweep"):
latencies, recalls = [], []
for i, qv in enumerate(query_vectors):
(rows, _), dt = time_query(lambda: topk_ivfpq(conn, qv, k=k, nprobe=npv))
latencies.append(dt)
recalls.append(recall_at_k([r["id"] for r in rows], exact_ivfpq_cache[i], k))
results.append({
"index_type": "IVFPQ", "parameter": npv, "param_name": "nprobe",
"avg_latency_ms": float(np.mean(latencies)), "avg_recall": float(np.mean(recalls)),
"std_latency_ms": float(np.std(latencies)), "std_recall": float(np.std(recalls))
})
return results
def plot_results(results: List[Dict[str, Any]], k: int = 10):
"""Generate recall vs latency plots"""
print("Generating plots...")
hnsw_results = [r for r in results if r["index_type"] == "HNSW"]
ivfpq_results = [r for r in results if r["index_type"] == "IVFPQ"]
plt.figure(figsize=(10, 6))
if hnsw_results:
latencies = [r["avg_latency_ms"] for r in hnsw_results]
recalls = [r["avg_recall"] for r in hnsw_results]
params = [r["parameter"] for r in hnsw_results]
plt.plot(latencies, recalls, 'o-', label="HNSW", linewidth=2, markersize=8)
for lat, rec, param in zip(latencies, recalls, params):
plt.annotate(f'ef={param}', (lat, rec), xytext=(5, 5),
textcoords='offset points', fontsize=8)
if ivfpq_results:
latencies = [r["avg_latency_ms"] for r in ivfpq_results]
recalls = [r["avg_recall"] for r in ivfpq_results]
params = [r["parameter"] for r in ivfpq_results]
plt.plot(latencies, recalls, 's-', label="IVFPQ", linewidth=2, markersize=8)
for lat, rec, param in zip(latencies, recalls, params):
plt.annotate(f'np={param}', (lat, rec), xytext=(5, -15),
textcoords='offset points', fontsize=8)
plt.xlabel("Average Latency per Query (ms)")
plt.ylabel(f"Average Recall@{k}")
plt.title("StarRocks Vector Search: Recall vs Latency Comparison")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
os.makedirs("results", exist_ok=True)
plt.savefig("results/recall_vs_latency.png", dpi=300, bbox_inches='tight')
plt.show()
def print_results_table(results: List[Dict[str, Any]]):
"""Print formatted results table"""
print("\n" + "="*80)
print("BENCHMARK RESULTS")
print("="*80)
print(f"{'Index':<8} {'Param':<12} {'Value':<8} {'Latency (ms)':<12} {'Recall':<10}")
print("-"*60)
for r in results:
print(f"{r['index_type']:<8} {r['param_name']:<12} {r['parameter']:<8} "
f"{r['avg_latency_ms']:<12.2f} {r['avg_recall']:<10.3f}")
def main():
"""Main execution function"""
print("StarRocks Vector Search Testing")
print("="*50)
# Setup
setup_database()
conn = connect()
# Generate and load test data
texts, embeddings = generate_sample_data(num_docs=1000)
insert_data(conn, "docs_hnsw", embeddings)
insert_data(conn, "docs_ivfpq", embeddings)
# Wait for indexes to be built
print("\nWaiting for indexes to be built...")
time.sleep(10)
# Run benchmarks
rng = np.random.default_rng(42)
query_indices = rng.choice(len(embeddings), size=50, replace=False)
query_vectors = embeddings[query_indices]
results = run_benchmarks(conn, query_vectors, k=10)
# Display results
print_results_table(results)
plot_results(results, k=10)
conn.close()
print("\n✓ Vector search testing completed successfully!")
if __name__ == "__main__":
main()Getting the Queries Right (This Part Is Critical)
The Query Pattern That Actually Uses the Index
Here’s what took me forever to figure out - StarRocks is very picky about when it uses the vector index. You need:
- The exact approximate function in ORDER BY (
approx_l2_distanceorapprox_cosine_similarity) - A LIMIT clause
- Nothing else in the ORDER BY
-- ✓ THIS WORKS - Uses the index
SELECT /*+ SET_VAR (ann_params='{efsearch=128}') */
id, approx_l2_distance([0.1, 0.2, ...], vector) AS d
FROM docs_hnsw
ORDER BY d ASC
LIMIT 10;
-- ✗ THIS DOESN'T - Full table scan (slow!)
SELECT id, approx_l2_distance([0.1, 0.2, ...], vector) AS d
FROM docs_hnsw
ORDER BY d ASC, id DESC -- Multiple ORDER BY = no index
LIMIT 10;Tuning Performance with SQL Hints
You control search quality vs speed with SQL hints:
-- HNSW: Higher efsearch = better accuracy but slower
SELECT /*+ SET_VAR (ann_params='{efsearch=256}') */
id, approx_l2_distance([...], vector) AS d
FROM docs_hnsw ORDER BY d ASC LIMIT 10;
-- IVFPQ: Higher nprobe = more buckets searched
SELECT /*+ SET_VAR (ann_params='{nprobe=64}') */
id, approx_l2_distance([...], vector) AS d
FROM docs_ivfpq ORDER BY d ASC LIMIT 10;Combining Vector Search with Filters
This is where it gets interesting - you can mix vector search with regular SQL filters:
-- Vector search with metadata filtering
SELECT /*+ SET_VAR (ann_params='{efsearch=128}') */
id, approx_l2_distance([...], vector) AS d
FROM docs_hnsw
WHERE id BETWEEN 100 AND 2000
AND metadata_column = 'specific_value'
ORDER BY d ASC
LIMIT 10;Making Sure Your Index Is Actually Being Used
Always check with EXPLAIN - trust me, you’ll save hours of debugging:
EXPLAIN
SELECT /*+ SET_VAR (ann_params='{efsearch=128}') */
id, approx_l2_distance([...], vector) AS d
FROM docs_hnsw ORDER BY d ASC LIMIT 10;If you see VECTORINDEX: ON, you’re good. If not, check your query pattern.
Performance Characteristics
Based on testing with StarRocks 3.5.5:
What Testing Shows
- HNSW is faster - Delivers sub-50ms queries in most cases
- IVFPQ is memory-efficient - Uses less memory but slightly slower
- Both achieve excellent recall - High accuracy in finding nearest neighbors
- Performance scales well - Handles thousands of vectors efficiently
- The memory trade-off is real - IVFPQ uses significantly less memory than HNSW
Real-World Production Advice
When to Use Which Index
Go with HNSW if:
- You need responses under 50ms
- You have plenty of memory to spare
- Your dataset is under 10M vectors
- Accuracy is more important than cost
Pick IVFPQ if:
- Memory is tight or expensive
- You’re dealing with 50M+ vectors
- 100ms response time is acceptable
- You’re optimizing for cost
Scaling Tips Based on Dataset Size
Under 1M vectors:
- Just use HNSW with M=32, efconstruction=200
- Start with efsearch=128 and adjust if needed
1M-10M vectors:
- IVFPQ starts making sense here - use nlist=sqrt(N)
- HNSW still works but bump M up to 64-128
Over 10M vectors:
- IVFPQ is probably your best bet
- Consider splitting data across multiple tables
- Watch your memory usage like a hawk with HNSW
Performance Tips That Actually Matter
- Always batch your inserts - Single inserts are painfully slow
- Use connection pooling - Don’t create new connections for each query
- Give indexes time to build - After bulk loads, wait a bit
- Monitor memory on backend nodes - HNSW can eat RAM for breakfast
- Use approximate functions - Exact distance calculations are slow
Monitoring Your Setup
-- Monitor table and index statistics
SHOW TABLE STATUS FROM vecdemo;
-- Check index usage patterns
EXPLAIN
SELECT /*+ SET_VAR (ann_params='{efsearch=128}') */
id, approx_l2_distance([...], vector) AS d
FROM docs_hnsw ORDER BY d ASC LIMIT 10;
-- Monitor query performance
SELECT
query_id,
query_time,
scan_rows,
return_rows
FROM information_schema.query_log
WHERE query_sql LIKE '%approx_l2_distance%'
ORDER BY query_time DESC
LIMIT 10;How It Stacks Up Against Dedicated Vector Databases
Where StarRocks Wins
- One database to rule them all - No need to manage yet another system
- Your team already knows SQL - No weird new query languages
- Mix vector search with analytics - Join your embeddings with regular data
- Use what you already have - If you’re running StarRocks, this is free
- Enterprise-ready - HA, backup, monitoring are all there
Where It Falls Short
- Still experimental - Even in 3.5.5, the feature requires enabling experimental flag
- Not as optimized - Pinecone and friends are built just for this
- Smaller ecosystem - Fewer integrations and tools
- Missing fancy features - No graph search or hybrid indexes yet
What’s Coming Next
Current State with StarRocks 3.5.5
Now that 3.5.5 is released and tested:
- Vector search works great - Both HNSW and IVFPQ perform excellently
- Performance is solid - Fast query responses with good scaling characteristics
- Still experimental flag - You need to enable it explicitly, suggesting it’s not quite production-ready
- More improvements coming - Expect the experimental flag to be removed in future versions
My Recommendations
If you’re starting fresh:
- StarRocks 3.5.5 is ready for development and testing
- Compare performance with Pinecone/Weaviate for your specific use case
- Consider a hybrid approach if you need specialized vector features
If you already use StarRocks:
- This is a no-brainer - start using it now
- The performance in 3.5.5 is impressive
- You can gradually migrate from external vector DBs
For production use:
- Test thoroughly with your actual data volumes
- Monitor the experimental flag status in future releases
- Set up proper monitoring and backup strategies
The Bottom Line
After weeks of testing, here’s my take: StarRocks vector search actually works, and it works well.
HNSW delivers fast queries if speed is your priority. IVFPQ trades some speed for significantly lower memory usage - perfect for massive datasets. Both achieve excellent recall rates, though your results will vary based on your specific data and use case.
If you’re already running StarRocks for analytics, adding vector search is a no-brainer. You get to keep using SQL, your ops team doesn’t need to learn a new system, and you can join embeddings with your regular data. That’s huge.
Is it as good as Pinecone or Weaviate for pure vector search? Not quite. But it’s close enough that the convenience might win. And with 3.5.5 now available and performing well, this is becoming a real option for teams that want analytics and vector search in one place.
My advice? If you’re on StarRocks, start using this now. The code I’ve shared will get you started. Test it with your actual data volumes, figure out which index works for you. The performance in 3.5.5 is impressive enough for most use cases.
The fact that a traditional OLAP database can do vector search this well tells me something important: the lines between different database types are blurring. And that’s probably a good thing for those of us who just want to build stuff without managing a zoo of specialized databases.
Resources
- StarRocks Vector Index Docs
- The complete test code is in the article above - just copy and run
- HNSW Paper: Malkov & Yashunin (2018) - The algorithm behind the fast index
- IVFPQ Paper: Jégou et al. (2011) - The memory-efficient approach
Originally tested on StarRocks 3.4.5, updated and verified on StarRocks 3.5.5 in January 2025. Your results will vary based on hardware and data.