When StarRocks introduced vector search, I had to test and see myself. Is it just a toy feature just bolted on an OLAP DB or not? Turns out this thing is actually good enough that you can fold your approximate nearest-neighbor (ANN) workloads into the same cluster that’s already powering your analytics dashboards.

In this article, I decided to skip skip the benchmarking details and focus on what you actually need to get vector search running.

Why Bother?

If your team already speaks SQL and runs StarRocks in production, adding vectors means you get:

  • One cluster to operate (frontends/backends already HA? great.)
  • Familiar tooling: same MySQL wire protocol, same auth, same monitoring
  • Real joins: mix embeddings with transactional/analytical tables without ETL hops
  • Optionality: you can still call out to Pinecone/Qdrant later, but you might not need to

Prerequisites

  • StarRocks 3.5.5 with 3 FE + 3 BE nodes (vector search is still flagged experimental)
  • Ubuntu 24.04, OpenJDK 17, Python 3.12 (for ingestion scripts)
  • uv for Python dependency management (ships its own virtualenv)
  • The cluster already running and reachable at 127.0.0.1:9030
  • To make your life easier, you can use my docker-starrocks repo. The following steps will help you to setup a StarRocks cluster super quick.
git clone https://github.com/ndemir/docker-starrocks.git
git checkout 3.5.5
cd docker-starrocks

# Start cluster
docker compose up -d
./init-cluster.sh

Step 1: Sanity-Check the Cluster

First things first: make sure every backend/frontend is alive before you start creating vector tables. Trust me on this one:

mysql -h 127.0.0.1 -P 9030 -u root -e "SHOW BACKENDS;"
mysql -h 127.0.0.1 -P 9030 -u root -e "SHOW FRONTENDS;"

I expect to see all nodes Alive = true and running. If anything is missing, fix that first.

Step 2: Flip On the Experimental Vector Flag

Vector indexing is still gated. Enable it once per cluster (it’s dynamic, no restart required):

ADMIN SET FRONTEND CONFIG ("enable_experimental_vector" = "true");
ADMIN SHOW FRONTEND CONFIG LIKE 'enable_experimental_vector';

You want the second statement to return Value = true across every FE.

Step 3: Set Up the Schema and Indexes

Create a dedicated database and the two tables. You’ll probably start with—one HNSW and another one IVFPQ, for testing purposes. HNSW builds a multi-level graph connecting vectors to their neighbors for fast traversal, while IVFPQ clusters vectors into groups and compresses them to save memory. Both are supported in StarRocks. Notice how indexes are added after the base table exists.

CREATE DATABASE IF NOT EXISTS vecdemo;
USE vecdemo;

CREATE TABLE docs_hnsw (
    id BIGINT NOT NULL,
    vector ARRAY<FLOAT> NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;

ALTER TABLE docs_hnsw ADD INDEX ix_hnsw (vector) USING VECTOR (
    'index_type' = 'hnsw',
    'dim' = '384',
    'metric_type' = 'l2_distance',
    'is_vector_normed' = 'false',
    'M' = '32',
    'efconstruction' = '200'
);

CREATE TABLE docs_ivfpq (
    id BIGINT NOT NULL,
    vector ARRAY<FLOAT> NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;

ALTER TABLE docs_ivfpq ADD INDEX ix_ivfpq (vector) USING VECTOR (
    'index_type' = 'ivfpq',
    'dim' = '384',
    'metric_type' = 'l2_distance',
    'is_vector_normed' = 'false',
    'nbits' = '8',
    'nlist' = '512',
    'M_IVFPQ' = '48'
);

Wait for the Index Build

Vector indexes build asynchronously. Poll their status to see the progress:

SHOW ALTER TABLE COLUMN FROM vecdemo;

When State = FINISHED for both tables, you’re clear to load/query data. You can wire this poller directly into your data-loading script so CI (or workflow) can fail fast if a build stalls.

Step 4: Load Vectors

Two options:

  1. Bulk CSV/Parquet – use LOAD DATA once your embeddings are staged somewhere the BEs can reach.
  2. Python + PyMySQL – convenient for smaller batches or quick demos.

Here’s the Python helper I keep around for demos. It assumes you already ran uv add pymysql sentence-transformers numpy tqdm and that you’re cool with downloading the MiniLM encoder.

#!/usr/bin/env python3
import ast
import pymysql
from sentence_transformers import SentenceTransformer

HOST, PORT = "127.0.0.1", 9030

def arr_literal(vec):
    return "array<float>[" + ",".join(f"{x:.7f}" for x in vec) + "]"

conn = pymysql.connect(host=HOST, port=PORT, user="root", password="", database="vecdemo", autocommit=True)
texts = [f"Document {i}: StarRocks vector search tip #{i}" for i in range(1000)]
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = model.encode(texts, convert_to_numpy=True).astype("float32")

with conn.cursor() as cur:
    cur.execute("TRUNCATE TABLE docs_hnsw;")
    cur.execute("TRUNCATE TABLE docs_ivfpq;")
    for idx, vec in enumerate(embeddings, start=1):
        lit = arr_literal(vec.tolist())
        cur.execute(f"INSERT INTO docs_hnsw (id, vector) VALUES (%s, {lit})", (idx,))
        cur.execute(f"INSERT INTO docs_ivfpq (id, vector) VALUES (%s, {lit})", (idx,))

conn.close()

Run it with:

uv run python scripts/load_vectors.py

Name the file whatever makes sense for your setup. The important bit is using array<float>[...] literals; the server expects them.

Step 5: Query the Right Way (or Watch It Do a Full Scan)

Here’s the thing: StarRocks only hits the ANN index when three conditions line up perfectly:

  1. You call an approximate distance function (approx_l2_distance, approx_cosine_similarity, etc.)
  2. That function is the sole expression in ORDER BY
  3. You LIMIT the result set
-- ✅ Uses the HNSW index
SELECT /*+ SET_VAR (ann_params='{efsearch=128}') */
       id, approx_l2_distance(array<float>[0.01, 0.02, ...], vector) AS d
FROM docs_hnsw
ORDER BY d ASC
LIMIT 10;

-- ❌ Falls back to a full scan (note the extra ORDER BY term)
SELECT id, approx_l2_distance(array<float>[0.01, 0.02, ...], vector) AS d
FROM docs_hnsw
ORDER BY d ASC, id DESC
LIMIT 10;

Always double-check with EXPLAIN. You want to see VECTORINDEX: ON under the OlapScanNode.

What About Filters?

Here’s the catch: inline predicates force a full scan today. The workaround? Wrap the vector search in a subquery, then filter:

SELECT *
FROM (
    SELECT /*+ SET_VAR (ann_params='{efsearch=128}') */
           id, approx_l2_distance(array<float>[...] , vector) AS d
    FROM docs_hnsw
    ORDER BY d ASC
    LIMIT 100
) AS hits
WHERE id BETWEEN 100 AND 2000
  AND id % 2 = 0
ORDER BY d ASC
LIMIT 10;

This keeps the ANN index engaged for the inner query while still letting you filter metadata afterwards. Not perfect, but it works.

Step 6: Keep an Eye on Things

  • Index builds: SHOW ALTER TABLE COLUMN FROM vecdemo;
  • Table stats: SHOW TABLE STATUS FROM vecdemo;
  • Cluster health: the same SHOW BACKENDS; / SHOW FRONTENDS; we started with
  • Explain plans: run them often; they’re the fastest way to confirm hybrid queries behave

FAQ Grab Bag

Do I still need to normalize vectors? Not unless your model produced unit vectors and you plan to use cosine similarity. StarRocks will happily store the raw floats you insert.

Can I bulk-drop/rebuild an index? Yes—ALTER TABLE ... DROP INDEX followed by another ADD INDEX works, but remember to poll the build status again.

Can I mix vector and analytical queries? Absolutely. That’s the whole point: join ANN hits with dimension tables, aggregate, window—whatever SQL normally buys you.

Where to Go Next

  • Wire the Python loader into your data pipeline (Airflow, Dagster, whatever makes sense)
  • Experiment with different efsearch / nprobe values directly in SQL to balance latency vs. recall
  • Layer feature flags around the ANN queries so you can fall back to exact search if needed

Once you’re comfortable with the operational basics, then you can go measure recall/latency against your favorite dedicated vector store. But honestly? This might be all you need to treat StarRocks as a legit vector database alongside your existing analytics workload.