Elasticsearch 導入のための基礎知識

[履歴] [最終更新] (2019/02/15 09:22:41)

プログラミング/IoT の関連商品 (Amazonのアソシエイトとして、当メディアは適格販売により収入を得ています。)

実用ロボット開発のためのROSプログラミング

【永久保証付き】Arduino Uno

新版暗号技術入門秘密の国のアリス

ESP-WROOM-02ピッチ変換済みモジュール《シンプル版》

ROSロボットプログラミングバイブル

tybot

概要

Elasticsearch は Apache Solr と同様に内部的に Apache Lucene を利用した全文検索アプリケーションです。公式ページをもとに導入手順および基本的なコマンド例をまとめます。

参考にしたページ

用語

Cluster : Node の集合。名前の既定値は "elasticsearch" (複数環境がある場合は重複しないようにすること: "logging-dev", "logging-stage", and "logging-prod" など)
Node : Cluster を構成するサーバ。Shard (またはその Replica) を保持する
Index : Document の集合
Type : Document の型
Document : Index を構成する JSON 形式で表現可能な情報。いずれかの Type に属する
Shard : Index を分割したもの
Replica : Shard のコピー。オリジナル Shard を保持する Node とは別の Node がこれを保持すること (Replica の作成は必須ではない)

インストール

最新バージョンへのリンクは Downloads | Elasticsearch から調べてください。

$ sudo yum install java-1.8.0-openjdk
$ sudo rpm -ivh https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.2.noarch.rpm
$ sudo chkconfig --add elasticsearch
$ sudo service elasticsearch start

API 例

状態を調査

curl 'localhost:9200/_cat/health?v'
curl 'localhost:9200/_cat/nodes?v'
curl 'localhost:9200/_cat/indices?v'
curl 'localhost:9200/_cat/shards?v'

green : すべて正常
yellow : replica が作成できていない (恐らく replica 用の別 node がいない)
red : 異常あり

インデックスを作成および削除

customer という名称でインデックスを作成/削除するためには以下のようにします。pretty を指定することでレスポンス JSON を整形して表示できます。

$ curl -XPUT 'localhost:9200/customer?pretty'
$ curl -XDELETE 'localhost:9200/customer?pretty'

ドキュメントを作成、更新、削除

external タイプのドキュメントを customer インデックスに追加するためには以下のようにします。既に指定 ID のドキュメント存在している場合は更新になります。

$ curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
  "name": "John Doe"
}'
$ curl -XGET 'localhost:9200/customer/external/1?pretty'

ID を自動生成させることも可能です。

$ curl -XPOST 'localhost:9200/customer/external?pretty' -d '
{
  "name": "Jane Doe"
}'

明示的に更新するためには以下のようにします。

$ curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
  "doc": { "name": "Jane Doe", "age": 20 }
}'

削除には DELETE を指定します。_query で条件指定すれば複数ドキュメントを一括削除できます。

$ curl -XDELETE 'localhost:9200/customer/external/1?pretty'
$ curl -XDELETE 'localhost:9200/customer/external/_query?pretty' -d '
{
  "query": { "match": { "name": "John" } }
}'

バッチ処理

ドキュメントを二つ作成

$ curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'

id:1 を更新して id:2 を削除

$ curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
'

検索

事前準備

JSON GENERATOR をもとに作成されたダミーデータをインデックスにドキュメントとして追加しましょう。

$ curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary @accounts.json
$ curl 'localhost:9200/_cat/indices?v'

URI 形式での検索

'*' ワイルドカードですべてに一致かつ pretty で整形して表示。既定では 10 ドキュメントだけ表示。

$ curl 'localhost:9200/bank/_search?q=*&pretty'

POST の body で検索

並べ替えや LIMIT と OFFSET 指定もできます。_source を指定すれば必要な情報だけを取得できます。

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance", "age"],
  "sort": { "age": { "order": "desc" } },
  "from": 10,
  "size": 1
}'

match クエリ

account_number が 20

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": { "match": { "account_number": 20 } }
}'

address に mill が含まれる (大文字小文字を区別しない)

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": { "match": { "address": "mill" } }
}'

address に mill または lane のどちらかが含まれる (大文字小文字を区別しない)

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": { "match": { "address": "mill lane" } }
}'

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": {
    "bool": {
      "should": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}'

address に "mill lane" が含まれる (大文字小文字を区別しない)

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": { "match_phrase": { "address": "mill lane" } }
}'

address に mill と lane が両方含まれる (大文字小文字を区別しない)

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}'

address に mill も lane も含まない (大文字小文字を区別しない)

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": {
    "bool": {
      "must_not": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}'

must, should, must_not は同時に複数指定できます。

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}'

フィルター

query と異なり、検索結果に点数を付与せず結果をメモリキャッシュするため高速です。関連度を示す値である score を必要としない場合は filter を使用します。query と併用できます。以下は「20000 以上 30000 以下」の例です。

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": {
    "filtered": {
      "query": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}'

集約 (Aggregation)

SQL の GROUP BY のような概念です。通常の query 結果も同時に返されるため、以下の例ではその size を 0 にして結果を取得しないようにしています。state で集約したグループの COUNT(*) が大きい順に 10 件取得します。

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state"
      }
    }
  }
}'

平均値を取得しつつ平均値が大きい順に並べ替え (既定値: 上位 10 件を取得)

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state",
        "order": {
          "average_balance": "desc"
        }
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}'

20-29, 30-39, 40-49 の 3 グループに分類して、それぞれのグループ内で更に gender 毎に集約して balance の平均値を表示

$ curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "size": 0,
  "aggs": {
    "group_by_age": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 20,
            "to": 30
          },
          {
            "from": 30,
            "to": 40
          },
          {
            "from": 40,
            "to": 50
          }
        ]
      },
      "aggs": {
        "group_by_gender": {
          "terms": {
            "field": "gender"
          },
          "aggs": {
            "average_balance": {
              "avg": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
  }
}'

より詳細な API 情報は公式ドキュメントへ (Elasticsearch 1.7)

システム設定

本ページ上述の手順にしたがい rpm でインストールした場合は以下のファイルを編集することでシステム設定が可能です。

$ vi /etc/sysconfig/elasticsearch  ← /etc/init.d/elasticsearch で読み込まれます
$ vi /etc/elasticsearch/elasticsearch.yml

データ保存ディレクトリの指定

/etc/sysconfig/elasticsearch

# Elasticsearch data directory
DATA_DIR=/var/lib/elasticsearch

メモリ設定

/etc/sysconfig/elasticsearch

# Heap size defaults to 256m min, 1g max
# Set ES_HEAP_SIZE to 50% of available RAM, but no more than 31g
ES_HEAP_SIZE=256m

IP 制限

/etc/elasticsearch/elasticsearch.yml

# Set both 'bind_host' and 'publish_host':
#
network.host: 127.0.0.1