An Introduction to StarRocks: My Experience Using It (Plus My Docker Setup)

I’ve been working on projects including StarRocks for the last couple of years. When I first used it, that shocked me how lightning-fast the queries are. I decided to write an article about StarRocks. This article will serve me as future notes too ;) Let’s dive.

Table of Contents

  1. What is StarRocks?
  2. My Journey with StarRocks
  3. Why Choose StarRocks?
  4. Architecture Deep Dive
  5. Getting Started with StarRocks Using Docker
  6. Your First StarRocks Query
  7. Advanced Features to Explore
  8. Performance Optimization Tips
  9. Monitoring and Operations
  10. Community
  11. Conclusion

1. What is StarRocks?

We can define StarRocks as high-performance analytical database designed for online analytical processing (OLAP) workloads. It was forked from Apache Doris but it has evolved into a powerful solution.

2. My Journey with StarRocks

Before diving into features, let me share the experience that made me a StarRocks advocate. I was called in to help a health insurance analytics team that was drowning in data chaos. They had billions of rows of claims, member, and provider data scattered across different data sources - some in legacy databases, others in CSV files, JSON exports from various systems, each with hundreds of columns of claims metrics, member demographics, and provider performance data. Their analysts were spending more time hunting for data than analyzing it.

My first step was organizing everything into S3 as Parquet files.. We then tried the usual suspects: Spark with Parquet/S3 and AWS Athena for ad-hoc querying. While this was better than their previous chaos, Athena queries were still taking >5 minutes for the complex analytical reports the insurance team needed. Spark jobs were fast for batch processing but required too much overhead for the interactive analytics they desperately needed.

The Aha Moment: That’s when I decided to try ingesting the organized Parquet data into StarRocks. We loaded the same datasets and ran the exact same complex queries that were taking 5+ minutes in Athena. StarRocks executed them in 18-25 seconds. We loved what we saw, the team could finally run interactive analyses on member demographics, claims patterns, provider performance, and fraud detection.

That’s when it hit me: this wasn’t just an incremental improvement. This was a fundamental shift in what was possible with analytical workloads.

Since that successful implementation, I’ve been called in to help several other companies migrate to StarRocks. It’s become my go-to recommendation for any organization requiring fast analytics on large datasets, if they are willing to manage their own analytics DB. Unlike ready-to-use solutions, you need to install and manage StarRocks clusters.

2.1. Key Features That Makes StarRocks StarRocks

1. Blazing Fast Performance

  • Vectorized query execution engine – I have an article about what vectorization is about, you can check it out.
  • Cost-based optimizer (CBO) for intelligent query planning – CBO is query planner that selects the most efficient execution plan by analyzing the statistics about the data distribution.
  • Materialized views for pre-computed results
  • Parallel processing across distributed nodes

2. Real-Time Analytics

  • Sub-second query latency on billions of rows
  • Real-time data ingestion with configurable delivery semantics (Stream Load and Flink connector support exactly-once guarantees, while the Kafka connector provides at-least-once delivery)
  • Support for both batch and streaming data – StarRocks can process large chunks of historical data all at once (batch) as well as analyze data in real-time as it arrives (streaming).

3. Flexible Data Models

  • Supports multiple table types: Duplicate, Aggregate, Unique, and Primary Key
  • Native JSON support for semi-structured data, plus ARRAY, STRUCT and MAP types
  • Complex data structures for modern analytics workloads

4. Easy to Use

  • MySQL protocol compatibility – You can use use your existing mysql client.
  • Standard SQL support

3. Why Choose StarRocks?

3.1. Performance That Matters

CelerData (the company behind StarRocks) has some tests that show StarRocks delivers 2.2× performance vs ClickHouse, 5.5× vs Trino, and 8.9× vs Apache Druid (benchmark details). That is also what I experienced myself in my day to day work.

3.2. Cost-Effective Scaling

Since StarRocks performs well, as a consequence, you can use cheaper hardware/VM configuraiton. That creates a cost-effective solution. Its intelligent data distribution and compression algorithms significantly reduce storage requirements.

3.3. Real-World Use Cases

The followings are typical use-cases for OLAP databases that also apply to StarRocks:

  • E-commerce Analytics: Track user behavior, analyze purchase patterns, and generate real-time recommendations.
  • Financial Services: Monitor transactions, detect fraud, and generate compliance reports in real-time.
  • IoT and Telemetry: Process millions of sensor readings per second while maintaining sub-second query performance.
  • Business Intelligence: Power interactive dashboards with complex aggregations across massive datasets.

4. Architecture Deep Dive

StarRocks follows a massively parallel processing (MPP) architecture with two main components:

4.1. Frontend (FE) Nodes

  • Handle query planning and optimization
  • Manage metadata and cluster coordination
  • Provide SQL interfaces for client connections

4.2. Backend (BE) Nodes

  • Store and manage data
  • Execute distributed queries
  • Handle data ingestion and compaction

This separation of concerns allows StarRocks to scale compute and storage independently.

5. Getting Started with StarRocks Using Docker

Setting up a production-ready StarRocks cluster can be complex. StarRocks is not a hosted solution that you can use directly as you are using BigQuery or Snowflake. You need to set it up.

I’ve created a Docker-based solution that makes it easy to get started. This setup provides a complete 6-node cluster (3 FE + 3 BE nodes) that mirrors a production environment.

5.1. Quick Setup Guide

Prerequisites:

  • Docker and Docker Compose installed

Clone the Repository:

git clone https://github.com/ndemir/docker-starrocks.git
cd docker-starrocks

Launch the Cluster:

docker compose down -v  # Clean any existing data
docker compose up -d    # Start the cluster
./init-cluster.sh      # Initialize cluster configuration

Within a couple of minutes, you’ll have a fully functional StarRocks cluster ready for experimentation! I am planning to write another article to explain

Connect to Your Cluster:

# MySQL client connection
mysql -h 127.0.0.1 -P 9030 -u root

# Web UI
open http://localhost:8030

5.2. Repository Features

The Docker setup includes:

  • High Availability: 3 FE nodes with automatic failover
  • Distributed Storage: 3 BE nodes for data redundancy
  • Persistent Volumes: Data survives container restarts
  • Health Checks: Automatic monitoring and recovery
  • Easy Management: Simple commands for common operations

Find the complete setup with detailed documentation at: https://github.com/ndemir/docker-starrocks

6. Your First StarRocks Query

Let’s create a simple analytics table and run some queries:

-- Create a database
CREATE DATABASE demo;
USE demo;

-- Create a user behavior table
CREATE TABLE user_events (
    event_time DATETIME,
    user_id BIGINT,
    event_type VARCHAR(50),
    product_id BIGINT,
    revenue DECIMAL(10, 2)
) 
DISTRIBUTED BY HASH(user_id) BUCKETS 8
PROPERTIES (
    "replication_num" = "3"
);

-- Insert sample data
INSERT INTO user_events VALUES
('2025-01-22 10:00:00', 1001, 'view', 5001, 0),
('2025-01-22 10:01:00', 1001, 'add_to_cart', 5001, 0),
('2025-01-22 10:05:00', 1001, 'purchase', 5001, 29.99),
('2025-01-22 10:10:00', 1002, 'view', 5002, 0),
('2025-01-22 10:12:00', 1002, 'purchase', 5002, 49.99);

-- Real-time analytics query
SELECT 
    event_type,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users,
    SUM(revenue) as total_revenue
FROM user_events
WHERE event_time >= '2025-01-22'
GROUP BY event_type
ORDER BY total_revenue DESC;

7. Advanced Features to Explore

7.1. Materialized Views for Acceleration

CREATE MATERIALIZED VIEW daily_revenue_mv
DISTRIBUTED BY HASH(date) BUCKETS 8
AS
SELECT 
    DATE(event_time) as date,
    SUM(revenue) as daily_revenue,
    COUNT(DISTINCT user_id) as daily_active_users
FROM user_events
GROUP BY DATE(event_time);

7.2. Real-Time Data Ingestion

StarRocks supports multiple ingestion methods:

  • Stream Load: HTTP-based streaming for real-time data
  • Broker Load: Large-scale batch loading from HDFS/S3
  • Routine Load: Continuous ingestion from Kafka
  • Spark/Flink Connectors: Integration with big data ecosystems

7.3. Window Functions for Advanced Analytics

SELECT 
    user_id,
    event_time,
    revenue,
    SUM(revenue) OVER (
        PARTITION BY user_id 
        ORDER BY event_time 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_spend
FROM user_events
WHERE revenue > 0
ORDER BY user_id, event_time;

8. Performance Optimization Tips

  1. Choose the Right Table Type

    • Use Duplicate tables for detailed fact data
    • Use Aggregate tables for pre-aggregated metrics
    • Use Primary Key tables for dimension data
  2. Optimize Data Distribution

    • Select high-cardinality columns for distribution
    • Aim for even data distribution across buckets
    • Monitor skew with system tables
  3. Leverage Indexes

    • Bitmap indexes for low-cardinality columns
    • Bloom filter indexes for string matching
    • Zone maps for range queries
  4. Query Optimization

    • Use partition pruning for time-series data
    • Push filters down to storage layer
    • Utilize materialized views for repeated aggregations

9. Monitoring and Operations

9.1. Key Metrics to Track

-- Cluster health
SHOW FRONTENDS;
SHOW BACKENDS;

-- Query performance
SHOW PROCESSLIST;

-- Storage usage
SHOW DATA;
SELECT * FROM information_schema.tables_config;

9.2. Backup and Recovery

StarRocks provides built-in backup capabilities:

-- First, create a repository
CREATE REPOSITORY s3_repo
WITH BROKER
ON LOCATION "s3a://your-bucket/backups/"
PROPERTIES (
    "aws.s3.endpoint" = "s3.amazonaws.com",
    "aws.s3.region" = "us-east-1",
    "aws.s3.access_key" = "your_access_key",
    "aws.s3.secret_key" = "your_secret_key"
);

-- Create a backup (v3.4.0+ syntax)
BACKUP DATABASE demo SNAPSHOT snapshot_20250122
TO s3_repo
ON (TABLE user_events);

-- View available snapshots to get timestamp
SHOW SNAPSHOT ON s3_repo;

-- Restore from backup (requires backup_timestamp)
RESTORE SNAPSHOT snapshot_20250122
FROM s3_repo
DATABASE demo
ON (TABLE user_events)
PROPERTIES (
    "backup_timestamp" = "2025-01-22-10-00-00-123"
);

10. Community

StarRocks has a active open-source community:

  • GitHub: Active development with frequent releases. That is not a product that has a last commit from years ago (at least, that is the case as of July 2025)
  • Slack Channel: Real-time support from community members. I am there too ;)

11. Conclusion

I think it would be fair to say StarRocks is a leap forward in analytics database technology. Its combination of exceptional performance, ease of use, and cost-effectiveness. It is an ideal choice for organizations looking to modernize their analytics infrastructure.

Ready to experience StarRocks yourself? Check out my Docker-based setup on GitHub to get a production-ready cluster running in minutes: https://github.com/ndemir/docker-starrocks