Distributed search and analytics engine - indexing, querying, cluster management, and performance tuning

Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It provides near real-time search capabilities and is the core storage and query engine of the ELK Stack.

Core Concepts

Concept	Description
Index	A collection of documents with similar characteristics (analogous to a database)
Document	A JSON object stored in an index (analogous to a row)
Shard	A subdivision of an index; each shard is a self-contained Lucene index
Replica	A copy of a primary shard for high availability and read throughput
Node	A single Elasticsearch instance in a cluster
Cluster	A collection of nodes that holds all data

Index Operations

Create Index with Mapping

PUT /logs-2024
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "5s",
    "index.lifecycle.name": "logs-policy"
  },
  "mappings": {
    "properties": {
      "@timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "service": { "type": "keyword" },
      "message": { "type": "text", "analyzer": "standard" },
      "host": { "type": "keyword" },
      "duration_ms": { "type": "float" },
      "request_id": { "type": "keyword" },
      "metadata": { "type": "object", "dynamic": true }
    }
  }
}

Index a Document

POST /logs-2024/_doc
{
  "@timestamp": "2024-03-15T10:30:00Z",
  "level": "ERROR",
  "service": "api-gateway",
  "message": "Connection timeout to upstream service",
  "host": "prod-api-01",
  "duration_ms": 30000,
  "request_id": "abc-123"
}

Bulk Indexing

POST /_bulk
{"index": {"_index": "logs-2024"}}
{"@timestamp": "2024-03-15T10:30:00Z", "level": "INFO", "message": "Request processed"}
{"index": {"_index": "logs-2024"}}
{"@timestamp": "2024-03-15T10:30:01Z", "level": "ERROR", "message": "Connection failed"}

Query DSL

Full-Text Search

GET /logs-2024/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "connection timeout" } }
      ],
      "filter": [
        { "term": { "level": "ERROR" } },
        { "term": { "service": "api-gateway" } },
        {
          "range": {
            "@timestamp": {
              "gte": "2024-03-15T00:00:00Z",
              "lte": "2024-03-15T23:59:59Z"
            }
          }
        }
      ]
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "size": 50
}

Aggregations

GET /logs-2024/_search
{
  "size": 0,
  "query": {
    "range": {
      "@timestamp": { "gte": "now-24h" }
    }
  },
  "aggs": {
    "errors_per_service": {
      "terms": { "field": "service", "size": 20 },
      "aggs": {
        "error_count": {
          "filter": { "term": { "level": "ERROR" } }
        },
        "avg_duration": {
          "avg": { "field": "duration_ms" }
        },
        "percentile_duration": {
          "percentiles": {
            "field": "duration_ms",
            "percents": [50, 90, 95, 99]
          }
        }
      }
    },
    "errors_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1h"
      },
      "aggs": {
        "by_level": {
          "terms": { "field": "level" }
        }
      }
    }
  }
}

Cluster Management

Node Roles

Role	Flag	Purpose
Master	`node.roles: [master]`	Cluster state management, index creation/deletion
Data	`node.roles: [data]`	Stores data, executes search and aggregation
Data Hot	`node.roles: [data_hot]`	Stores frequently queried recent data (fast SSDs)
Data Warm	`node.roles: [data_warm]`	Stores less frequently queried data (standard disks)
Data Cold	`node.roles: [data_cold]`	Stores rarely queried data (large, slow disks)
Ingest	`node.roles: [ingest]`	Runs ingest pipelines before indexing
Coordinating	`node.roles: []`	Routes requests, merges results (no data)

Production Cluster Configuration

# elasticsearch.yml (data node)
cluster.name: production
node.name: data-node-01
node.roles: [data_hot, ingest]

path.data: /var/data/elasticsearch
path.logs: /var/log/elasticsearch

network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - master-01:9300
  - master-02:9300
  - master-03:9300

# Security
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

Index Lifecycle Management (ILM)

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": { "priority": 0 },
          "allocate": {
            "require": { "data": "cold" }
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Ingest Pipelines

PUT _ingest/pipeline/log-pipeline
{
  "description": "Parse and enrich log data",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"]
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": ["ISO8601"],
        "target_field": "@timestamp"
      }
    },
    {
      "geoip": {
        "field": "client_ip",
        "target_field": "geo",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "timestamp"
      }
    }
  ]
}

Performance Tuning

Indexing Performance

Setting	Recommendation	Impact
`refresh_interval`	30s for bulk indexing, 1s for near real-time	Higher interval = better indexing throughput
`number_of_replicas`	0 during bulk load, then increase	No replication overhead during indexing
`translog.durability`	`async` for high throughput	Slight risk of data loss on crash
Bulk size	5-15 MB per request	Balance between throughput and memory
Thread pool	Default is usually optimal	Monitor `rejected` count

Search Performance

Strategy	Description
Use `filter` context	Filters are cached and don't compute relevance scores
Prefer `keyword` fields	Exact match on `keyword` is faster than `text` search
Limit `_source` fields	Return only needed fields with `_source: ["field1", "field2"]`
Use `search_after`	More efficient than `from/size` for deep pagination
Avoid wildcard leading	`term` is expensive; prefer `term` or exact match
Shard routing	Route related documents to the same shard

JVM Settings

# jvm.options
-Xms16g
-Xmx16g
# Never exceed 50% of available RAM or 31GB
# Always set Xms = Xmx to avoid GC pauses

Monitoring

Key Metrics

Metric	Healthy Range	Warning Sign
Cluster status	Green	Yellow or Red
JVM heap usage	< 75%	> 85% sustained
Search latency (p99)	< 500ms	> 1s
Indexing rate	Stable	Sudden drops
Disk usage	< 80%	> 85% per node
Pending tasks	0	Growing queue
Circuit breaker trips	0	Any trips

Useful Cluster APIs

# Cluster health
GET _cluster/health

# Node stats
GET _nodes/stats

# Index stats
GET /logs-*/_stats

# Pending tasks
GET _cluster/pending_tasks

# Hot threads (debugging)
GET _nodes/hot_threads

# Shard allocation explanation
GET _cluster/allocation/explain

Elasticsearch

On this page