Docs For AI
Elk

Elasticsearch

Distributed search and analytics engine - indexing, querying, cluster management, and performance tuning

Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It provides near real-time search capabilities and is the core storage and query engine of the ELK Stack.

Core Concepts

ConceptDescription
IndexA collection of documents with similar characteristics (analogous to a database)
DocumentA JSON object stored in an index (analogous to a row)
ShardA subdivision of an index; each shard is a self-contained Lucene index
ReplicaA copy of a primary shard for high availability and read throughput
NodeA single Elasticsearch instance in a cluster
ClusterA collection of nodes that holds all data

Index Operations

Create Index with Mapping

PUT /logs-2024
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "5s",
    "index.lifecycle.name": "logs-policy"
  },
  "mappings": {
    "properties": {
      "@timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "service": { "type": "keyword" },
      "message": { "type": "text", "analyzer": "standard" },
      "host": { "type": "keyword" },
      "duration_ms": { "type": "float" },
      "request_id": { "type": "keyword" },
      "metadata": { "type": "object", "dynamic": true }
    }
  }
}

Index a Document

POST /logs-2024/_doc
{
  "@timestamp": "2024-03-15T10:30:00Z",
  "level": "ERROR",
  "service": "api-gateway",
  "message": "Connection timeout to upstream service",
  "host": "prod-api-01",
  "duration_ms": 30000,
  "request_id": "abc-123"
}

Bulk Indexing

POST /_bulk
{"index": {"_index": "logs-2024"}}
{"@timestamp": "2024-03-15T10:30:00Z", "level": "INFO", "message": "Request processed"}
{"index": {"_index": "logs-2024"}}
{"@timestamp": "2024-03-15T10:30:01Z", "level": "ERROR", "message": "Connection failed"}

Query DSL

GET /logs-2024/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "connection timeout" } }
      ],
      "filter": [
        { "term": { "level": "ERROR" } },
        { "term": { "service": "api-gateway" } },
        {
          "range": {
            "@timestamp": {
              "gte": "2024-03-15T00:00:00Z",
              "lte": "2024-03-15T23:59:59Z"
            }
          }
        }
      ]
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "size": 50
}

Aggregations

GET /logs-2024/_search
{
  "size": 0,
  "query": {
    "range": {
      "@timestamp": { "gte": "now-24h" }
    }
  },
  "aggs": {
    "errors_per_service": {
      "terms": { "field": "service", "size": 20 },
      "aggs": {
        "error_count": {
          "filter": { "term": { "level": "ERROR" } }
        },
        "avg_duration": {
          "avg": { "field": "duration_ms" }
        },
        "percentile_duration": {
          "percentiles": {
            "field": "duration_ms",
            "percents": [50, 90, 95, 99]
          }
        }
      }
    },
    "errors_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1h"
      },
      "aggs": {
        "by_level": {
          "terms": { "field": "level" }
        }
      }
    }
  }
}

Cluster Management

Node Roles

RoleFlagPurpose
Masternode.roles: [master]Cluster state management, index creation/deletion
Datanode.roles: [data]Stores data, executes search and aggregation
Data Hotnode.roles: [data_hot]Stores frequently queried recent data (fast SSDs)
Data Warmnode.roles: [data_warm]Stores less frequently queried data (standard disks)
Data Coldnode.roles: [data_cold]Stores rarely queried data (large, slow disks)
Ingestnode.roles: [ingest]Runs ingest pipelines before indexing
Coordinatingnode.roles: []Routes requests, merges results (no data)

Production Cluster Configuration

# elasticsearch.yml (data node)
cluster.name: production
node.name: data-node-01
node.roles: [data_hot, ingest]

path.data: /var/data/elasticsearch
path.logs: /var/log/elasticsearch

network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - master-01:9300
  - master-02:9300
  - master-03:9300

# Security
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

Index Lifecycle Management (ILM)

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": { "priority": 0 },
          "allocate": {
            "require": { "data": "cold" }
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Ingest Pipelines

PUT _ingest/pipeline/log-pipeline
{
  "description": "Parse and enrich log data",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"]
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": ["ISO8601"],
        "target_field": "@timestamp"
      }
    },
    {
      "geoip": {
        "field": "client_ip",
        "target_field": "geo",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "timestamp"
      }
    }
  ]
}

Performance Tuning

Indexing Performance

SettingRecommendationImpact
refresh_interval30s for bulk indexing, 1s for near real-timeHigher interval = better indexing throughput
number_of_replicas0 during bulk load, then increaseNo replication overhead during indexing
translog.durabilityasync for high throughputSlight risk of data loss on crash
Bulk size5-15 MB per requestBalance between throughput and memory
Thread poolDefault is usually optimalMonitor rejected count

Search Performance

StrategyDescription
Use filter contextFilters are cached and don't compute relevance scores
Prefer keyword fieldsExact match on keyword is faster than text search
Limit _source fieldsReturn only needed fields with _source: ["field1", "field2"]
Use search_afterMore efficient than from/size for deep pagination
Avoid wildcard leading*term is expensive; prefer term* or exact match
Shard routingRoute related documents to the same shard

JVM Settings

# jvm.options
-Xms16g
-Xmx16g
# Never exceed 50% of available RAM or 31GB
# Always set Xms = Xmx to avoid GC pauses

Monitoring

Key Metrics

MetricHealthy RangeWarning Sign
Cluster statusGreenYellow or Red
JVM heap usage< 75%> 85% sustained
Search latency (p99)< 500ms> 1s
Indexing rateStableSudden drops
Disk usage< 80%> 85% per node
Pending tasks0Growing queue
Circuit breaker trips0Any trips

Useful Cluster APIs

# Cluster health
GET _cluster/health

# Node stats
GET _nodes/stats

# Index stats
GET /logs-*/_stats

# Pending tasks
GET _cluster/pending_tasks

# Hot threads (debugging)
GET _nodes/hot_threads

# Shard allocation explanation
GET _cluster/allocation/explain

On this page