Inserting Data

Inserting Data in Elasticsearch

Learn how to insert documents into Elasticsearch using various methods and options. This tutorial covers single document insertion, bulk operations, and important concepts for data indexing.

Prerequisites

Before inserting data, ensure you have:

Elasticsearch running (check with curl localhost:9200)
An index created (or use automatic index creation)

Basic Document Insertion

Create Document with ID

Add a document to an index with a specific ID:

PUT /products/_doc/1
{
  "name": "Laptop Pro",
  "category": "Electronics",
  "price": 1299.99,
  "in_stock": true,
  "specs": {
    "cpu": "Intel i7",
    "ram": "16GB",
    "storage": "512GB SSD"
  }
}

Response:

{
  "_index": "products",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Understanding the Response

_index: The index where document was stored
_id: The document's unique identifier
_version: Version number (increments with updates)
result: Operation result (created/updated)
_shards: Shard replication information
_seq_no: Sequence number for optimistic concurrency control
_primary_term: Primary term for the primary shard

Document Creation Methods

1. PUT with ID (Create or Update)

PUT /users/_doc/100
{
  "username": "john_doe",
  "email": "[email protected]",
  "registered_date": "2024-01-15"
}

This will create a new document or update if ID exists.

2. Create Only (Fail if Exists)

Using op_type parameter:

PUT /users/_doc/100?op_type=create
{
  "username": "jane_doe",
  "email": "[email protected]"
}

Or using _create endpoint:

PUT /users/_create/100
{
  "username": "jane_doe",
  "email": "[email protected]"
}

3. Automatic ID Generation

Let Elasticsearch generate a unique ID:

POST /logs/_doc
{
  "timestamp": "2024-01-15T10:30:00",
  "level": "INFO",
  "message": "Application started successfully",
  "service": "auth-service"
}

Response includes generated ID:

{
  "_index": "logs",
  "_id": "dXuSt4sBX_Z_kb8rP3qY",  // Auto-generated ID
  "_version": 1,
  "result": "created"
}

Bulk Operations

For inserting multiple documents efficiently:

POST /_bulk
{ "index": { "_index": "products", "_id": "2" } }
{ "name": "Smartphone", "price": 699.99, "category": "Electronics" }
{ "index": { "_index": "products", "_id": "3" } }
{ "name": "Tablet", "price": 499.99, "category": "Electronics" }
{ "index": { "_index": "products" } }
{ "name": "Headphones", "price": 199.99, "category": "Audio" }

Note: Each action and document must be on separate lines, ending with a newline.

Bulk Insert from File

curl -X POST "localhost:9200/_bulk" \
  -H "Content-Type: application/json" \
  --data-binary @products.json

Advanced Insertion Options

1. With Routing

Route documents to specific shards:

PUT /orders/_doc/1001?routing=user123
{
  "order_id": "1001",
  "user_id": "user123",
  "total": 299.99,
  "items": ["item1", "item2"]
}

2. With Refresh

Make document immediately searchable:

PUT /realtime/_doc/1?refresh=true
{
  "message": "This will be immediately searchable"
}

Refresh options:

true: Refresh immediately (impacts performance)
wait_for: Wait for next refresh
false: Don't wait (default)

3. With Pipeline

Apply ingest pipeline during indexing:

PUT /logs/_doc/1?pipeline=add-timestamp
{
  "message": "Log entry",
  "level": "INFO"
}

4. With Timeout

Set operation timeout:

PUT /products/_doc/1?timeout=5m
{
  "name": "Special Product",
  "processing_required": true
}

Working with Different Data Types

Nested Objects

POST /employees/_doc
{
  "name": "Alice Johnson",
  "department": "Engineering",
  "contact": {
    "email": "[email protected]",
    "phone": "+1-555-0123"
  },
  "projects": [
    {
      "name": "Project A",
      "status": "active"
    },
    {
      "name": "Project B",
      "status": "completed"
    }
  ]
}

Arrays

POST /articles/_doc
{
  "title": "Elasticsearch Tutorial",
  "tags": ["elasticsearch", "search", "database"],
  "authors": [
    "John Smith",
    "Jane Doe"
  ],
  "ratings": [4.5, 4.8, 4.2]
}

Date Formats

POST /events/_doc
{
  "event_name": "Conference",
  "start_date": "2024-03-15",
  "timestamp": "2024-03-15T09:00:00Z",
  "epoch_millis": 1710489600000
}

Index Templates and Dynamic Mapping

Create with Dynamic Fields

POST /dynamic_index/_doc
{
  "static_field": "known value",
  "dynamic_string": "Elasticsearch will detect this as text",
  "dynamic_number": 42,
  "dynamic_boolean": true,
  "dynamic_date": "2024-01-15"
}

Explicit Mapping Before Insert

PUT /products
{
  "mappings": {
    "properties": {
      "name": { "type": "text" },
      "price": { "type": "float" },
      "in_stock": { "type": "boolean" },
      "created_at": { "type": "date" }
    }
  }
}

Best Practices

1. Use Bulk API for Multiple Documents

Instead of:

PUT /index/_doc/1 {...}
PUT /index/_doc/2 {...}
PUT /index/_doc/3 {...}

Use:

POST /_bulk
{"index": {"_index": "index", "_id": "1"}}
{...}
{"index": {"_index": "index", "_id": "2"}}
{...}

2. Choose Appropriate ID Strategy

User-provided IDs: When you need predictable, meaningful IDs
Auto-generated IDs: For logs, events, or when ID doesn't matter

3. Consider Document Size

Keep documents under 100MB (hard limit)
Optimal size: 1KB - 100KB
Split large documents into smaller ones

4. Handle Versioning

PUT /products/_doc/1?version=2&version_type=external
{
  "name": "Updated Product",
  "version": 2
}

Common Errors and Solutions

1. Index Not Found

{
  "error": {
    "type": "index_not_found_exception",
    "reason": "no such index [products]"
  }
}

Solution: Create index first or enable auto-create:

PUT /products

2. Document Already Exists

When using _create:

{
  "error": {
    "type": "version_conflict_engine_exception",
    "reason": "[1]: version conflict, document already exists"
  }
}

Solution: Use PUT without _create or update the document.

3. Mapping Conflict

{
  "error": {
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [price] of type [long]"
  }
}

Solution: Ensure data types match the mapping.

Performance Tips

Bulk Size: Keep bulk requests between 5-15 MB
Refresh Interval: Increase for better indexing performance
Replicas: Set to 0 during initial bulk load, then increase
Sharding: Plan shard count based on data volume

Next Steps

After mastering data insertion:

Learn about reading and searching data
Understand update operations
Explore bulk processing patterns
Study index optimization techniques