Key Concepts

Key Concepts in Elasticsearch

Understanding these fundamental concepts is essential for working effectively with Elasticsearch. Let's explore each concept with examples to make them clear for beginners.

Cluster

A cluster is a collection of one or more nodes (servers) that work together to store your data and provide search capabilities across all nodes.

Key Points:

  • Each cluster has a unique name (default: "elasticsearch")
  • Nodes join a cluster by using the cluster name
  • A cluster can have just one node (single-node cluster)
  • Provides automatic load balancing and failover

Example:

// Check cluster health
GET /_cluster/health

// Response
{
  "cluster_name": "my-application",
  "status": "green",
  "number_of_nodes": 3,
  "number_of_data_nodes": 3
}

Node

A node is a single server that is part of your cluster, stores data, and participates in the cluster's indexing and search capabilities.

Types of Nodes:

  1. Master Node: Manages cluster-wide operations
  2. Data Node: Stores data and executes data-related operations
  3. Ingest Node: Preprocesses documents before indexing
  4. Coordinating Node: Routes requests and aggregates results

Example:

// Check node information
GET /_nodes

// Each node has:
{
  "name": "node-1",
  "transport_address": "127.0.0.1:9300",
  "host": "127.0.0.1",
  "roles": ["master", "data", "ingest"]
}

Index

An index is a collection of documents that have similar characteristics. It's similar to a database in the relational world.

Key Points:

  • Index names must be lowercase
  • An index can contain multiple document types (deprecated in newer versions)
  • Each index has its own settings and mappings

Example:

// Create an index
PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

// Index structure example:
// products (index)
//   ├── laptop (document)
//   ├── phone (document)
//   └── tablet (document)

Document

A document is a basic unit of information that can be indexed. It's expressed in JSON format and stored within an index.

Key Points:

  • Similar to a row in a relational database
  • Each document has a unique ID
  • Documents are immutable (updates create new versions)
  • Can contain nested objects and arrays

Example:

// A document in the products index
{
  "_index": "products",
  "_id": "1",
  "_source": {
    "name": "iPhone 13",
    "category": "smartphones",
    "price": 999,
    "features": ["5G", "A15 Bionic", "Dual Camera"],
    "manufacturer": {
      "name": "Apple",
      "country": "USA"
    }
  }
}

Shard

A shard is a single Lucene instance and a fundamental unit of storage in Elasticsearch. Each index is divided into shards for scalability.

Types of Shards:

  1. Primary Shards: Original shards that hold the data
  2. Replica Shards: Copies of primary shards for redundancy

Key Points:

  • Number of primary shards is fixed at index creation
  • Each shard is a fully functional index
  • Shards allow horizontal scaling
  • Default: 1 primary shard per index

Example:

// Creating an index with custom shard settings
PUT /logs
{
  "settings": {
    "number_of_shards": 5,      // 5 primary shards
    "number_of_replicas": 1      // 1 replica per primary shard
  }
}

// Total shards = 5 primary + 5 replica = 10 shards

Replicas

Replicas are copies of primary shards that provide redundancy and improve search performance.

Benefits:

  1. High Availability: If a node fails, replicas ensure no data loss
  2. Increased Performance: Search queries can be executed on replicas
  3. Load Balancing: Distributes query load across multiple copies

Example:

// Update replica settings
PUT /products/_settings
{
  "number_of_replicas": 2
}

// With 3 primary shards and 2 replicas:
// Total shards = 3 primary + (3 × 2) replicas = 9 shards

Near Real-Time (NRT)

Elasticsearch is a near real-time search platform, meaning there's a slight delay between indexing a document and when it becomes searchable.

Key Points:

  • Default refresh interval: 1 second
  • Documents are searchable within ~1 second of indexing
  • Can be configured per index
  • Trade-off between real-time and performance

Example:

// Index a document
POST /products/_doc
{
  "name": "New Product",
  "price": 299
}

// Document is searchable after ~1 second

// Force immediate refresh (not recommended for production)
POST /products/_refresh

// Configure refresh interval
PUT /products/_settings
{
  "index": {
    "refresh_interval": "30s"  // Refresh every 30 seconds
  }
}

Additional Important Concepts

Mapping

Defines how documents and their fields are stored and indexed.

PUT /products/_mapping
{
  "properties": {
    "name": { "type": "text" },
    "price": { "type": "float" },
    "in_stock": { "type": "boolean" },
    "created_at": { "type": "date" }
  }
}

Inverted Index

The core data structure that makes searching fast:

  • Maps terms to the documents containing them
  • Similar to an index in a book
  • Enables full-text search capabilities

Example:

Term "apple" appears in:
  - Document 1
  - Document 5
  - Document 12

Term "phone" appears in:
  - Document 2
  - Document 5
  - Document 8

How These Concepts Work Together

  1. Data Storage Flow:

    Document → Index → Shard → Node → Cluster
    
  2. Search Flow:

    Query → Cluster → Nodes → Shards → Documents → Results
    
  3. Redundancy:

    Primary Shard → Replica Shards → Different Nodes
    

Best Practices

  1. Cluster Planning:

    • Use odd number of master-eligible nodes (3, 5, 7)
    • Separate master and data nodes for large clusters
  2. Index Design:

    • Keep indices focused on specific data types
    • Use meaningful, lowercase names
    • Plan shard count based on data volume
  3. Shard Sizing:

    • Aim for 20-40GB per shard
    • Avoid too many small shards
    • Consider future growth
  4. Replica Strategy:

    • At least 1 replica for production
    • More replicas for read-heavy workloads
    • Ensure enough nodes to distribute replicas

Common Beginner Mistakes

  1. Too Many Shards: Creating hundreds of small shards impacts performance
  2. No Replicas: Running without replicas risks data loss
  3. Wrong Refresh Settings: Setting refresh to 0 for real-time at the cost of performance
  4. Ignoring Cluster Health: Not monitoring yellow or red cluster states

Next Steps

Now that you understand the key concepts, let's proceed to install Elasticsearch and start working with these concepts hands-on.