Key Concepts
Key Concepts in Elasticsearch
Understanding these fundamental concepts is essential for working effectively with Elasticsearch. Let's explore each concept with examples to make them clear for beginners.
Cluster
A cluster is a collection of one or more nodes (servers) that work together to store your data and provide search capabilities across all nodes.
Key Points:
- Each cluster has a unique name (default: "elasticsearch")
- Nodes join a cluster by using the cluster name
- A cluster can have just one node (single-node cluster)
- Provides automatic load balancing and failover
Example:
// Check cluster health
GET /_cluster/health
// Response
{
"cluster_name": "my-application",
"status": "green",
"number_of_nodes": 3,
"number_of_data_nodes": 3
}
Node
A node is a single server that is part of your cluster, stores data, and participates in the cluster's indexing and search capabilities.
Types of Nodes:
- Master Node: Manages cluster-wide operations
- Data Node: Stores data and executes data-related operations
- Ingest Node: Preprocesses documents before indexing
- Coordinating Node: Routes requests and aggregates results
Example:
// Check node information
GET /_nodes
// Each node has:
{
"name": "node-1",
"transport_address": "127.0.0.1:9300",
"host": "127.0.0.1",
"roles": ["master", "data", "ingest"]
}
Index
An index is a collection of documents that have similar characteristics. It's similar to a database in the relational world.
Key Points:
- Index names must be lowercase
- An index can contain multiple document types (deprecated in newer versions)
- Each index has its own settings and mappings
Example:
// Create an index
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
// Index structure example:
// products (index)
// ├── laptop (document)
// ├── phone (document)
// └── tablet (document)
Document
A document is a basic unit of information that can be indexed. It's expressed in JSON format and stored within an index.
Key Points:
- Similar to a row in a relational database
- Each document has a unique ID
- Documents are immutable (updates create new versions)
- Can contain nested objects and arrays
Example:
// A document in the products index
{
"_index": "products",
"_id": "1",
"_source": {
"name": "iPhone 13",
"category": "smartphones",
"price": 999,
"features": ["5G", "A15 Bionic", "Dual Camera"],
"manufacturer": {
"name": "Apple",
"country": "USA"
}
}
}
Shard
A shard is a single Lucene instance and a fundamental unit of storage in Elasticsearch. Each index is divided into shards for scalability.
Types of Shards:
- Primary Shards: Original shards that hold the data
- Replica Shards: Copies of primary shards for redundancy
Key Points:
- Number of primary shards is fixed at index creation
- Each shard is a fully functional index
- Shards allow horizontal scaling
- Default: 1 primary shard per index
Example:
// Creating an index with custom shard settings
PUT /logs
{
"settings": {
"number_of_shards": 5, // 5 primary shards
"number_of_replicas": 1 // 1 replica per primary shard
}
}
// Total shards = 5 primary + 5 replica = 10 shards
Replicas
Replicas are copies of primary shards that provide redundancy and improve search performance.
Benefits:
- High Availability: If a node fails, replicas ensure no data loss
- Increased Performance: Search queries can be executed on replicas
- Load Balancing: Distributes query load across multiple copies
Example:
// Update replica settings
PUT /products/_settings
{
"number_of_replicas": 2
}
// With 3 primary shards and 2 replicas:
// Total shards = 3 primary + (3 × 2) replicas = 9 shards
Near Real-Time (NRT)
Elasticsearch is a near real-time search platform, meaning there's a slight delay between indexing a document and when it becomes searchable.
Key Points:
- Default refresh interval: 1 second
- Documents are searchable within ~1 second of indexing
- Can be configured per index
- Trade-off between real-time and performance
Example:
// Index a document
POST /products/_doc
{
"name": "New Product",
"price": 299
}
// Document is searchable after ~1 second
// Force immediate refresh (not recommended for production)
POST /products/_refresh
// Configure refresh interval
PUT /products/_settings
{
"index": {
"refresh_interval": "30s" // Refresh every 30 seconds
}
}
Additional Important Concepts
Mapping
Defines how documents and their fields are stored and indexed.
PUT /products/_mapping
{
"properties": {
"name": { "type": "text" },
"price": { "type": "float" },
"in_stock": { "type": "boolean" },
"created_at": { "type": "date" }
}
}
Inverted Index
The core data structure that makes searching fast:
- Maps terms to the documents containing them
- Similar to an index in a book
- Enables full-text search capabilities
Example:
Term "apple" appears in:
- Document 1
- Document 5
- Document 12
Term "phone" appears in:
- Document 2
- Document 5
- Document 8
How These Concepts Work Together
-
Data Storage Flow:
Document → Index → Shard → Node → Cluster
-
Search Flow:
Query → Cluster → Nodes → Shards → Documents → Results
-
Redundancy:
Primary Shard → Replica Shards → Different Nodes
Best Practices
-
Cluster Planning:
- Use odd number of master-eligible nodes (3, 5, 7)
- Separate master and data nodes for large clusters
-
Index Design:
- Keep indices focused on specific data types
- Use meaningful, lowercase names
- Plan shard count based on data volume
-
Shard Sizing:
- Aim for 20-40GB per shard
- Avoid too many small shards
- Consider future growth
-
Replica Strategy:
- At least 1 replica for production
- More replicas for read-heavy workloads
- Ensure enough nodes to distribute replicas
Common Beginner Mistakes
- Too Many Shards: Creating hundreds of small shards impacts performance
- No Replicas: Running without replicas risks data loss
- Wrong Refresh Settings: Setting refresh to 0 for real-time at the cost of performance
- Ignoring Cluster Health: Not monitoring yellow or red cluster states
Next Steps
Now that you understand the key concepts, let's proceed to install Elasticsearch and start working with these concepts hands-on.