An Introduction to StarRocks: My Experience Using It (Plus My Docker Setup)
I’ve been working on projects including StarRocks for the last couple of years. When I first used it, that shocked me how lightning-fast the queries are. I decided to write an article about StarRocks. This article will serve me as future notes too ;) Let’s dive.
Table of Contents
- What is StarRocks?
- My Journey with StarRocks
- Why Choose StarRocks?
- Architecture Deep Dive
- 4.1. Frontend (FE) Nodes
- 4.2. Backend (BE) Nodes
- Getting Started with StarRocks Using Docker
- 5.1. Quick Setup Guide
- 5.2. Repository Features
- Your First StarRocks Query
- Advanced Features to Explore
- Performance Optimization Tips
- Monitoring and Operations
- 9.1. Key Metrics to Track
- 9.2. Backup and Recovery
- Community
- Conclusion
1. What is StarRocks?
We can define StarRocks as high-performance analytical database designed for online analytical processing (OLAP) workloads. It was forked from Apache Doris but it has evolved into a powerful solution.
2. My Journey with StarRocks
Before diving into features, let me share the experience that made me a StarRocks advocate. I was called in to help a health insurance analytics team that was drowning in data chaos. They had billions of rows of claims, member, and provider data scattered across different data sources - some in legacy databases, others in CSV files, JSON exports from various systems, each with hundreds of columns of claims metrics, member demographics, and provider performance data. Their analysts were spending more time hunting for data than analyzing it.
My first step was organizing everything into S3 as Parquet files.. We then tried the usual suspects: Spark with Parquet/S3 and AWS Athena for ad-hoc querying. While this was better than their previous chaos, Athena queries were still taking >5 minutes for the complex analytical reports the insurance team needed. Spark jobs were fast for batch processing but required too much overhead for the interactive analytics they desperately needed.
The Aha Moment: That’s when I decided to try ingesting the organized Parquet data into StarRocks. We loaded the same datasets and ran the exact same complex queries that were taking 5+ minutes in Athena. StarRocks executed them in 18-25 seconds. We loved what we saw, the team could finally run interactive analyses on member demographics, claims patterns, provider performance, and fraud detection.
That’s when it hit me: this wasn’t just an incremental improvement. This was a fundamental shift in what was possible with analytical workloads.
Since that successful implementation, I’ve been called in to help several other companies migrate to StarRocks. It’s become my go-to recommendation for any organization requiring fast analytics on large datasets, if they are willing to manage their own analytics DB. Unlike ready-to-use solutions, you need to install and manage StarRocks clusters.
2.1. Key Features That Makes StarRocks StarRocks
1. Blazing Fast Performance
- Vectorized query execution engine – I have an article about what vectorization is about, you can check it out.
- Cost-based optimizer (CBO) for intelligent query planning – CBO is query planner that selects the most efficient execution plan by analyzing the statistics about the data distribution.
- Materialized views for pre-computed results
- Parallel processing across distributed nodes
2. Real-Time Analytics
- Sub-second query latency on billions of rows
- Real-time data ingestion with configurable delivery semantics (Stream Load and Flink connector support exactly-once guarantees, while the Kafka connector provides at-least-once delivery)
- Support for both batch and streaming data – StarRocks can process large chunks of historical data all at once (batch) as well as analyze data in real-time as it arrives (streaming).
3. Flexible Data Models
- Supports multiple table types: Duplicate, Aggregate, Unique, and Primary Key
- Native JSON support for semi-structured data, plus ARRAY, STRUCT and MAP types
- Complex data structures for modern analytics workloads
4. Easy to Use
- MySQL protocol compatibility – You can use use your existing mysql client.
- Standard SQL support
3. Why Choose StarRocks?
3.1. Performance That Matters
CelerData (the company behind StarRocks) has some tests that show StarRocks delivers 2.2× performance vs ClickHouse, 5.5× vs Trino, and 8.9× vs Apache Druid (benchmark details). That is also what I experienced myself in my day to day work.
3.2. Cost-Effective Scaling
Since StarRocks performs well, as a consequence, you can use cheaper hardware/VM configuraiton. That creates a cost-effective solution. Its intelligent data distribution and compression algorithms significantly reduce storage requirements.
3.3. Real-World Use Cases
The followings are typical use-cases for OLAP databases that also apply to StarRocks:
- E-commerce Analytics: Track user behavior, analyze purchase patterns, and generate real-time recommendations.
- Financial Services: Monitor transactions, detect fraud, and generate compliance reports in real-time.
- IoT and Telemetry: Process millions of sensor readings per second while maintaining sub-second query performance.
- Business Intelligence: Power interactive dashboards with complex aggregations across massive datasets.
4. Architecture Deep Dive
StarRocks follows a massively parallel processing (MPP) architecture with two main components:
4.1. Frontend (FE) Nodes
- Handle query planning and optimization
- Manage metadata and cluster coordination
- Provide SQL interfaces for client connections
4.2. Backend (BE) Nodes
- Store and manage data
- Execute distributed queries
- Handle data ingestion and compaction
This separation of concerns allows StarRocks to scale compute and storage independently.
5. Getting Started with StarRocks Using Docker
Setting up a production-ready StarRocks cluster can be complex. StarRocks is not a hosted solution that you can use directly as you are using BigQuery or Snowflake. You need to set it up.
I’ve created a Docker-based solution that makes it easy to get started. This setup provides a complete 6-node cluster (3 FE + 3 BE nodes) that mirrors a production environment.
5.1. Quick Setup Guide
Prerequisites:
- Docker and Docker Compose installed
Clone the Repository:
git clone https://github.com/ndemir/docker-starrocks.git
cd docker-starrocks
Launch the Cluster:
docker compose down -v # Clean any existing data
docker compose up -d # Start the cluster
./init-cluster.sh # Initialize cluster configuration
Within a couple of minutes, you’ll have a fully functional StarRocks cluster ready for experimentation! I am planning to write another article to explain
Connect to Your Cluster:
# MySQL client connection
mysql -h 127.0.0.1 -P 9030 -u root
# Web UI
open http://localhost:8030
5.2. Repository Features
The Docker setup includes:
- High Availability: 3 FE nodes with automatic failover
- Distributed Storage: 3 BE nodes for data redundancy
- Persistent Volumes: Data survives container restarts
- Health Checks: Automatic monitoring and recovery
- Easy Management: Simple commands for common operations
Find the complete setup with detailed documentation at: https://github.com/ndemir/docker-starrocks
6. Your First StarRocks Query
Let’s create a simple analytics table and run some queries:
-- Create a database
CREATE DATABASE demo;
USE demo;
-- Create a user behavior table
CREATE TABLE user_events (
event_time DATETIME,
user_id BIGINT,
event_type VARCHAR(50),
product_id BIGINT,
revenue DECIMAL(10, 2)
)
DISTRIBUTED BY HASH(user_id) BUCKETS 8
PROPERTIES (
"replication_num" = "3"
);
-- Insert sample data
INSERT INTO user_events VALUES
('2025-01-22 10:00:00', 1001, 'view', 5001, 0),
('2025-01-22 10:01:00', 1001, 'add_to_cart', 5001, 0),
('2025-01-22 10:05:00', 1001, 'purchase', 5001, 29.99),
('2025-01-22 10:10:00', 1002, 'view', 5002, 0),
('2025-01-22 10:12:00', 1002, 'purchase', 5002, 49.99);
-- Real-time analytics query
SELECT
event_type,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users,
SUM(revenue) as total_revenue
FROM user_events
WHERE event_time >= '2025-01-22'
GROUP BY event_type
ORDER BY total_revenue DESC;
7. Advanced Features to Explore
7.1. Materialized Views for Acceleration
CREATE MATERIALIZED VIEW daily_revenue_mv
DISTRIBUTED BY HASH(date) BUCKETS 8
AS
SELECT
DATE(event_time) as date,
SUM(revenue) as daily_revenue,
COUNT(DISTINCT user_id) as daily_active_users
FROM user_events
GROUP BY DATE(event_time);
7.2. Real-Time Data Ingestion
StarRocks supports multiple ingestion methods:
- Stream Load: HTTP-based streaming for real-time data
- Broker Load: Large-scale batch loading from HDFS/S3
- Routine Load: Continuous ingestion from Kafka
- Spark/Flink Connectors: Integration with big data ecosystems
7.3. Window Functions for Advanced Analytics
SELECT
user_id,
event_time,
revenue,
SUM(revenue) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_spend
FROM user_events
WHERE revenue > 0
ORDER BY user_id, event_time;
8. Performance Optimization Tips
Choose the Right Table Type
- Use Duplicate tables for detailed fact data
- Use Aggregate tables for pre-aggregated metrics
- Use Primary Key tables for dimension data
Optimize Data Distribution
- Select high-cardinality columns for distribution
- Aim for even data distribution across buckets
- Monitor skew with system tables
Leverage Indexes
- Bitmap indexes for low-cardinality columns
- Bloom filter indexes for string matching
- Zone maps for range queries
Query Optimization
- Use partition pruning for time-series data
- Push filters down to storage layer
- Utilize materialized views for repeated aggregations
9. Monitoring and Operations
9.1. Key Metrics to Track
-- Cluster health
SHOW FRONTENDS;
SHOW BACKENDS;
-- Query performance
SHOW PROCESSLIST;
-- Storage usage
SHOW DATA;
SELECT * FROM information_schema.tables_config;
9.2. Backup and Recovery
StarRocks provides built-in backup capabilities:
-- First, create a repository
CREATE REPOSITORY s3_repo
WITH BROKER
ON LOCATION "s3a://your-bucket/backups/"
PROPERTIES (
"aws.s3.endpoint" = "s3.amazonaws.com",
"aws.s3.region" = "us-east-1",
"aws.s3.access_key" = "your_access_key",
"aws.s3.secret_key" = "your_secret_key"
);
-- Create a backup (v3.4.0+ syntax)
BACKUP DATABASE demo SNAPSHOT snapshot_20250122
TO s3_repo
ON (TABLE user_events);
-- View available snapshots to get timestamp
SHOW SNAPSHOT ON s3_repo;
-- Restore from backup (requires backup_timestamp)
RESTORE SNAPSHOT snapshot_20250122
FROM s3_repo
DATABASE demo
ON (TABLE user_events)
PROPERTIES (
"backup_timestamp" = "2025-01-22-10-00-00-123"
);
10. Community
StarRocks has a active open-source community:
- GitHub: Active development with frequent releases. That is not a product that has a last commit from years ago (at least, that is the case as of July 2025)
- Slack Channel: Real-time support from community members. I am there too ;)
11. Conclusion
I think it would be fair to say StarRocks is a leap forward in analytics database technology. Its combination of exceptional performance, ease of use, and cost-effectiveness. It is an ideal choice for organizations looking to modernize their analytics infrastructure.
Ready to experience StarRocks yourself? Check out my Docker-based setup on GitHub to get a production-ready cluster running in minutes: https://github.com/ndemir/docker-starrocks