Key Concepts

Key Concepts in Elasticsearch

Understanding these fundamental concepts is essential for working effectively with Elasticsearch. Let's explore each concept with examples to make them clear for beginners.

Cluster

A cluster is a collection of one or more nodes (servers) that work together to store your data and provide search capabilities across all nodes.

Key Points:

  • Each cluster has a unique name (default: "elasticsearch")
  • Nodes join a cluster by using the cluster name
  • A cluster can have just one node (single-node cluster)
  • Provides automatic load balancing and failover

Example:


Node

A node is a single server that is part of your cluster, stores data, and participates in the cluster's indexing and search capabilities.

Types of Nodes:

  1. Master Node: Manages cluster-wide operations
  2. Data Node: Stores data and executes data-related operations
  3. Ingest Node: Preprocesses documents before indexing
  4. Coordinating Node: Routes requests and aggregates results

Example:


Index

An index is a collection of documents that have similar characteristics. It's similar to a database in the relational world.

Key Points:

  • Index names must be lowercase
  • An index can contain multiple document types (deprecated in newer versions)
  • Each index has its own settings and mappings

Example:


Document

A document is a basic unit of information that can be indexed. It's expressed in JSON format and stored within an index.

Key Points:

  • Similar to a row in a relational database
  • Each document has a unique ID
  • Documents are immutable (updates create new versions)
  • Can contain nested objects and arrays

Example:


Shard

A shard is a single Lucene instance and a fundamental unit of storage in Elasticsearch. Each index is divided into shards for scalability.

Types of Shards:

  1. Primary Shards: Original shards that hold the data
  2. Replica Shards: Copies of primary shards for redundancy

Key Points:

  • Number of primary shards is fixed at index creation
  • Each shard is a fully functional index
  • Shards allow horizontal scaling
  • Default: 1 primary shard per index

Example:


Replicas

Replicas are copies of primary shards that provide redundancy and improve search performance.

Benefits:

  1. High Availability: If a node fails, replicas ensure no data loss
  2. Increased Performance: Search queries can be executed on replicas
  3. Load Balancing: Distributes query load across multiple copies

Example:


Near Real-Time (NRT)

Elasticsearch is a near real-time search platform, meaning there's a slight delay between indexing a document and when it becomes searchable.

Key Points:

  • Default refresh interval: 1 second
  • Documents are searchable within ~1 second of indexing
  • Can be configured per index
  • Trade-off between real-time and performance

Example:


Additional Important Concepts

Mapping

Defines how documents and their fields are stored and indexed.


Inverted Index

The core data structure that makes searching fast:

  • Maps terms to the documents containing them
  • Similar to an index in a book
  • Enables full-text search capabilities

Example:


How These Concepts Work Together

  1. Data Storage Flow:

    
    
  2. Search Flow:

    
    
  3. Redundancy:

    
    

Best Practices

  1. Cluster Planning:

    • Use odd number of master-eligible nodes (3, 5, 7)
    • Separate master and data nodes for large clusters
  2. Index Design:

    • Keep indices focused on specific data types
    • Use meaningful, lowercase names
    • Plan shard count based on data volume
  3. Shard Sizing:

    • Aim for 20-40GB per shard
    • Avoid too many small shards
    • Consider future growth
  4. Replica Strategy:

    • At least 1 replica for production
    • More replicas for read-heavy workloads
    • Ensure enough nodes to distribute replicas

Common Beginner Mistakes

  1. Too Many Shards: Creating hundreds of small shards impacts performance
  2. No Replicas: Running without replicas risks data loss
  3. Wrong Refresh Settings: Setting refresh to 0 for real-time at the cost of performance
  4. Ignoring Cluster Health: Not monitoring yellow or red cluster states

Next Steps

Now that you understand the key concepts, let's proceed to install Elasticsearch and start working with these concepts hands-on.