Elasticsearch 先輩との戯れ日記(データの投入)

この記事何？

お仕事で Elasticsearch 先輩を使うことになったので、そのお戯れの記録

複数ドキュメントを扱う API って何がある？

Document APIs | Elasticsearch Reference [5.4] | Elastic

Multi-document API いっぺんにデータ突っ込んだり取ってきたりするやつっぽいな。

Multi Get API(データの取得)
Bulk API(データの挿入)
Delete By Query API(データの削除)
Update By Query API(データの更新)
Reindex API(インデックスの貼り直し)

データの投入について

Bulk API

Bulk API | Elasticsearch Reference [5.4] | Elastic

The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed.

endpoint は /_bulk, /{index}/_bulk, {index}/{type}/_bulk の３つ。パラメーターに JSON 形式で挿入したいデータを指定する。もし、データを送りたい場合は、

データの最終行に改行(\n)を入れる
もし改行したい場合は、 carriage return (\r) を使ってね
Content-Type: application/x-ndjson をヘッダーに含めて送ってね

を守ること。

$curl -X POST -H 'Content-Type: application/x-ndjson' 'localhost:9200/test/account/_bulk?pretty&refresh' --data-binary "@accounts.json"  
   ...
    {
      "index" : {
        "_index" : "test",
        "_type" : "account",
        "_id" : "990",
        "_version" : 1,
        "result" : "created",
        "forced_refresh" : true,
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "test",
        "_type" : "account",
        "_id" : "995",
        "_version" : 1,
        "result" : "created",
        "forced_refresh" : true,
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    }
  ]
}

CSV ファイルから直接いけるか？

:no_good: :no_good: デフォルトでは CSV ファイルから、直接ぶち込むことはできず、一旦 JSON 形式に変換してあげる必要がありそう。

$ curl -s -X POST -H 'Content-Type: text/csv' 'localhost:9200/test/feed/_bulk?pretty&refresh' --data-binary "@test_feed1000.csv" | jq
{
  "error": "Content-Type header [text/csv] is not supported",
  "status": 406
}

打開案

アプリ内で CSV を JSON になおしてぶち込む
- Elasticsearchを使って日経平均株価データでMachine Learningを体験する - Qiita
Embulk 使ってぶち込む
- CSVファイルをElasticsearchに取り込んでみる - Qiita

１手目はできそうだけど、そこまでして、 Elasticsearch の内部でデータを持つ必要があるのか説はあるな。。。

注意点

設計によっては、 OOM とかが頻発しちゃうのは結構怖さあるな。 OOM 発動しなくてもデータ破損とかはかなり痛いしなもうちょっとここは追加調査が必要。場合によっては、検索/保存で仕組みを切り離しちゃうのも１つなんだろうなというお気持ちになりました。

余談

データの取得

ついでに確認のために、調べたので、データの取得も _search に query を投げてあげれば取れる。 from で指定する値が行数。 size が何行取り出すか。

以下の例は100行目から２行取り出すとなるので、100,101行目が取り出せている。

$ curl -s -X GET http://localhost:9200/test/account/_search -d '{"query":{"match_all": {}}, "from": "100", "size": "2"}' | jq .
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "account",
        "_id": "383",
        "_score": 1,
        "_source": {
          "account_number": 383,
          "balance": 48889,
          "firstname": "Knox",
          "lastname": "Larson",
          "age": 28,
          "gender": "F",
          "address": "962 Bartlett Place",
          "employer": "Bostonic",
          "email": "knoxlarson@bostonic.com",
          "city": "Smeltertown",
          "state": "TX"
        }
      },
      {
        "_index": "test",
        "_type": "account",
        "_id": "408",
        "_score": 1,
        "_source": {
          "account_number": 408,
          "balance": 34666,
          "firstname": "Lidia",
          "lastname": "Guerrero",
          "age": 30,
          "gender": "M",
          "address": "254 Stratford Road",
          "employer": "Snowpoke",
          "email": "lidiaguerrero@snowpoke.com",
          "city": "Fairlee",
          "state": "LA"
        }
      }
    ]
  }
}

というわけで、100行単位でページングみたいな使い方はできそうということが分かった。