When StarRocks introduced vector search, I had to test and see myself. Is it just a toy feature just bolted on an OLAP DB or not? Turns out this thing is actually good enough that you can fold your approximate nearest-neighbor (ANN) workloads into the same cluster that’s already powering your analytics dashboards.
In this article, I decided to skip skip the benchmarking details and focus on what you actually need to get vector search running.
Why Bother?
If your team already speaks SQL and runs StarRocks in production, adding vectors means you get:
- One cluster to operate (frontends/backends already HA? great.)
- Familiar tooling: same MySQL wire protocol, same auth, same monitoring
- Real joins: mix embeddings with transactional/analytical tables without ETL hops
- Optionality: you can still call out to Pinecone/Qdrant later, but you might not need to
Prerequisites
- StarRocks 3.5.5 with 3 FE + 3 BE nodes (vector search is still flagged experimental)
- Ubuntu 24.04, OpenJDK 17, Python 3.12 (for ingestion scripts)
uv
for Python dependency management (ships its own virtualenv)- The cluster already running and reachable at
127.0.0.1:9030
- To make your life easier, you can use my
docker-starrocks
repo. The following steps will help you to setup a StarRocks cluster super quick.
git clone https://github.com/ndemir/docker-starrocks.git
git checkout 3.5.5
cd docker-starrocks
# Start cluster
docker compose up -d
./init-cluster.sh
Step 1: Sanity-Check the Cluster
First things first: make sure every backend/frontend is alive before you start creating vector tables. Trust me on this one:
mysql -h 127.0.0.1 -P 9030 -u root -e "SHOW BACKENDS;"
mysql -h 127.0.0.1 -P 9030 -u root -e "SHOW FRONTENDS;"
I expect to see all nodes Alive = true
and running. If anything is missing, fix that first.
Step 2: Flip On the Experimental Vector Flag
Vector indexing is still gated. Enable it once per cluster (it’s dynamic, no restart required):
ADMIN SET FRONTEND CONFIG ("enable_experimental_vector" = "true");
ADMIN SHOW FRONTEND CONFIG LIKE 'enable_experimental_vector';
You want the second statement to return Value = true
across every FE.
Step 3: Set Up the Schema and Indexes
Create a dedicated database and the two tables. You’ll probably start with—one HNSW and another one IVFPQ, for testing purposes. HNSW builds a multi-level graph connecting vectors to their neighbors for fast traversal, while IVFPQ clusters vectors into groups and compresses them to save memory. Both are supported in StarRocks. Notice how indexes are added after the base table exists.
CREATE DATABASE IF NOT EXISTS vecdemo;
USE vecdemo;
CREATE TABLE docs_hnsw (
id BIGINT NOT NULL,
vector ARRAY<FLOAT> NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;
ALTER TABLE docs_hnsw ADD INDEX ix_hnsw (vector) USING VECTOR (
'index_type' = 'hnsw',
'dim' = '384',
'metric_type' = 'l2_distance',
'is_vector_normed' = 'false',
'M' = '32',
'efconstruction' = '200'
);
CREATE TABLE docs_ivfpq (
id BIGINT NOT NULL,
vector ARRAY<FLOAT> NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;
ALTER TABLE docs_ivfpq ADD INDEX ix_ivfpq (vector) USING VECTOR (
'index_type' = 'ivfpq',
'dim' = '384',
'metric_type' = 'l2_distance',
'is_vector_normed' = 'false',
'nbits' = '8',
'nlist' = '512',
'M_IVFPQ' = '48'
);
Wait for the Index Build
Vector indexes build asynchronously. Poll their status to see the progress:
SHOW ALTER TABLE COLUMN FROM vecdemo;
When State = FINISHED
for both tables, you’re clear to load/query data. You can wire this poller directly into your data-loading script so CI (or workflow) can fail fast if a build stalls.
Step 4: Load Vectors
Two options:
- Bulk CSV/Parquet – use
LOAD DATA
once your embeddings are staged somewhere the BEs can reach. - Python + PyMySQL – convenient for smaller batches or quick demos.
Here’s the Python helper I keep around for demos. It assumes you already ran uv add pymysql sentence-transformers numpy tqdm
and that you’re cool with downloading the MiniLM encoder.
#!/usr/bin/env python3
import ast
import pymysql
from sentence_transformers import SentenceTransformer
HOST, PORT = "127.0.0.1", 9030
def arr_literal(vec):
return "array<float>[" + ",".join(f"{x:.7f}" for x in vec) + "]"
conn = pymysql.connect(host=HOST, port=PORT, user="root", password="", database="vecdemo", autocommit=True)
texts = [f"Document {i}: StarRocks vector search tip #{i}" for i in range(1000)]
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = model.encode(texts, convert_to_numpy=True).astype("float32")
with conn.cursor() as cur:
cur.execute("TRUNCATE TABLE docs_hnsw;")
cur.execute("TRUNCATE TABLE docs_ivfpq;")
for idx, vec in enumerate(embeddings, start=1):
lit = arr_literal(vec.tolist())
cur.execute(f"INSERT INTO docs_hnsw (id, vector) VALUES (%s, {lit})", (idx,))
cur.execute(f"INSERT INTO docs_ivfpq (id, vector) VALUES (%s, {lit})", (idx,))
conn.close()
Run it with:
uv run python scripts/load_vectors.py
Name the file whatever makes sense for your setup. The important bit is using array<float>[...]
literals; the server expects them.
Step 5: Query the Right Way (or Watch It Do a Full Scan)
Here’s the thing: StarRocks only hits the ANN index when three conditions line up perfectly:
- You call an approximate distance function (
approx_l2_distance
,approx_cosine_similarity
, etc.) - That function is the sole expression in
ORDER BY
- You
LIMIT
the result set
-- ✅ Uses the HNSW index
SELECT /*+ SET_VAR (ann_params='{efsearch=128}') */
id, approx_l2_distance(array<float>[0.01, 0.02, ...], vector) AS d
FROM docs_hnsw
ORDER BY d ASC
LIMIT 10;
-- ❌ Falls back to a full scan (note the extra ORDER BY term)
SELECT id, approx_l2_distance(array<float>[0.01, 0.02, ...], vector) AS d
FROM docs_hnsw
ORDER BY d ASC, id DESC
LIMIT 10;
Always double-check with EXPLAIN
. You want to see VECTORINDEX: ON
under the OlapScanNode
.
What About Filters?
Here’s the catch: inline predicates force a full scan today. The workaround? Wrap the vector search in a subquery, then filter:
SELECT *
FROM (
SELECT /*+ SET_VAR (ann_params='{efsearch=128}') */
id, approx_l2_distance(array<float>[...] , vector) AS d
FROM docs_hnsw
ORDER BY d ASC
LIMIT 100
) AS hits
WHERE id BETWEEN 100 AND 2000
AND id % 2 = 0
ORDER BY d ASC
LIMIT 10;
This keeps the ANN index engaged for the inner query while still letting you filter metadata afterwards. Not perfect, but it works.
Step 6: Keep an Eye on Things
- Index builds:
SHOW ALTER TABLE COLUMN FROM vecdemo;
- Table stats:
SHOW TABLE STATUS FROM vecdemo;
- Cluster health: the same
SHOW BACKENDS;
/SHOW FRONTENDS;
we started with - Explain plans: run them often; they’re the fastest way to confirm hybrid queries behave
FAQ Grab Bag
Do I still need to normalize vectors? Not unless your model produced unit vectors and you plan to use cosine similarity. StarRocks will happily store the raw floats you insert.
Can I bulk-drop/rebuild an index? Yes—ALTER TABLE ... DROP INDEX
followed by another ADD INDEX
works, but remember to poll the build status again.
Can I mix vector and analytical queries? Absolutely. That’s the whole point: join ANN hits with dimension tables, aggregate, window—whatever SQL normally buys you.
Where to Go Next
- Wire the Python loader into your data pipeline (Airflow, Dagster, whatever makes sense)
- Experiment with different
efsearch
/nprobe
values directly in SQL to balance latency vs. recall - Layer feature flags around the ANN queries so you can fall back to exact search if needed
Once you’re comfortable with the operational basics, then you can go measure recall/latency against your favorite dedicated vector store. But honestly? This might be all you need to treat StarRocks as a legit vector database alongside your existing analytics workload.