2017-11-06

Serverless Conf Tokyo 2017 に行ってきたぞ

先日の11月3日文化の日に Serverless Conf Tokyo 2017 に参加してきました！

tokyo.serverlessconf.io

モチベーション的には、最近業務的なところで、サーバーレスなシステムの実装をしているので、事例だったり　AWS Batch の話が気になって参加しました。

LT の中で、『アウトプットしないのは知的な便秘。』 と言われたので、積極的にアウトプットしてこうな。

今回は、その中で特に気になったものとまとめての感想を書きたいと思います。

資料

まとめていた方がいらっしゃるので、引用させていただきます。素晴らしいまとめ(-人-)

www.n-novice.com

会場の雰囲気とか

外人の方も多かった印象。

ブースにも外人の方が多くて、英語でガッツリ話してたのが印象的でした。

ブースで話をきくとバッチがもらえて、全部のブースを回ると Tシャツがもらえる仕組みは素敵だった。

以下、戦利品

f:id:pokotyamu:20171106012147j:plain

Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤

Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤 from Yu Yamada

ピタゴラスイッチ部分の構成図をしっかり見せていただけたのはすごくありがたかった。 S3 にデータが上がったのをトリガーに Lambda を使って、 Step Functions を使って AWS Batch のステートを管理。

一つ一つの状態を変化させるのにも Lambda を使って、Dynamo DB の情報を update する、と言った流れが非常に分かりやすかったです。

ほんとに最小単位で関数を用意するんだなという発見でした。

The mind of Serverless as a Software

slideship.com

サーバーレス全般の話について分かりやすかった。新しくサーバーレスでなんかするって人にとりあえず読んどけレベルな資料だと思いました。初心者僕が思ったんだから間違いないはず。

データの流れを一方通行にして、次の関数次の関数とどんどん渡していくことが大事。そうなれば、一つの関数ごとに単体テストがかけるはず。

またサーバーレスに向いているのは、横にスケール（並列性）する分野。向いていないのは、縦にスケール（処理速度）する分野。

ここらへんは今システム開発してて、納得感がありました。単体テストを信頼して、ちゃんと通しでも E2E で確認というのが個人的にはしっくりきています。

まとめ

そもそも Serverless が出てきたのは、アプリ開発者がアプリにコミットできるようにってことだろうし、その仕組を知っておく意味で行く価値のあるイベントでした。

インフラ/バックエンド/デザイン/フロントエンドここがすべてそれぞれの分野にコミットできるようなチームが組めるとホントに強いんだろうなと感じました。もちろん無関心は良くないと思うんですけどね。インフラもやってくぞ！！

また、今回のイベントで、Serveless な構成はビジネス面でスケールさせるためには、絶対に必要になってくる分野な気がしています。特に開発スピードが求められる環境ならば、なおさらインフラのことを考えずに開発できることはメリットでしかないので、ここらへんも使いこなしたいな。

最近、Rails エンジニアーだけじゃなく、AWS 周りも色々触り始めるテラフォーマーになったので、有識者から学べるところをどんどん吸収してきたいと思いますので、今後共よろしくお願いしますん。

個人的には、 AWS が用意した、 Python の Serverless フレームワークの chalice が気になってます！！！

github.com

Serverless フレームワークとどう違うんだろ？触ってみる！！！

2017-07-25

Elasticsearch 先輩で価格周りを触りたい

Elasticsearch 自分メモ

やりたいこと

1000 JPY みたいな文字列をいい感じに数値検索の対象にできないかなという実験。

今回の実験は以下のデータに対して、実験方法の curl を実行し、正しく指定した範囲のデータが返ってくることができるか？ということを調べる。

{"index":{"_index":"sample_index","_type":"sample_type","_id":"1"}}
{"PRICE1":"3000 JPY"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"2"}}
{"PRICE1":"5000 JPY"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"3"}}
{"PRICE1":"100 JPY"}

$ curl -XGET 'localhost:9200/sample_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
   "query" : {
        "range" : {
            "PRICE1": {
                "gte": 10,
                "lte": 3000
             }
        }
    }
}'

# id 1, 2 のみ検索結果にヒットすることを期待している。

自動マッピングに任せる

$ curl -X POST -H 'Content-Type: application/x-ndjson' 'http://localhost:9200/_bulk?pretty&refresh' --data-binary "@price.json"
$ curl -XGET 'localhost:9200/sample_index?pretty'
{
  "sample_index" : {
    "aliases" : { },
    "mappings" : {
      "sample_type" : {
        "properties" : {
          "PRICE1" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1500889167417",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "eZdsR51MRkeqfem48th0Rw",
        "version" : {
          "created" : "6000002"
        },
        "provided_name" : "sample_index"
      }
    }
  }
}

text 型で入ってるため、上手く数値の範囲検索を行うことが出来ない。

Numeric Detection

Dynamic field mapping | Elasticsearch Reference [5.5] | Elastic

text 型で入ってきた数値に関して、検索できるようにするぜという設定。これによって、以下のような text の中身が数値というデータに関しては、数値検索可能となる。

{"index":{"_index":"sample_index","_type":"sample_type","_id":"1"}}
{"PRICE1":"1000"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"2"}}
{"PRICE1":"10"}

$ curl -XPUT 'localhost:9200/sample_index?pretty' -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "sample_type": {
      "numeric_detection": true
    }
  }
}'
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}
$ curl -X POST -H 'Content-Type: application/x-ndjson' 'http://localhost:9200/_bulk?pretty&refresh' --data-binary "@test-price.json"
$ curl -XGET 'localhost:9200/sample_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
   "query" : {
        "range" : {
            "PRICE1": {
                "gte": 100,
                "lte": 1000
             }
        }
    }
}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "sample_index",
        "_type" : "sample_type",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "PRICE1" : "1000"
        }
      }
    ]
  }
}

Tokenize

あとは、 100 JPY となっている部分を 100 と JPY に分けたら 100 に対する検索にヒットするようになりそうなので、やってみる。

$ curl -XPOST 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
  "tokenizer": "whitespace",
  "text": "100 JPY"
}
'

{
  "tokens" : [
    {
      "token" : "100",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "JPY",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    }
  ]
}

空白で、トークンを区切る whitespace が使えそうなので、これでトークンを　100 と JPY に分けることはできそう。

{"index":{"_index":"sample_index","_type":"sample_type","_id":"1"}}
{"analyzer": "whitespace","PRICE1":"3000 JPY"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"2"}}
{"analyzer": "whitespace","PRICE1":"5000 JPY"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"3"}}
{"analyzer": "whitespace","PRICE1":"100 JPY"}

というわけで、実験データに analyzer を足した物を実験データとして再度やり直した。

しかし、挙動が安定しない 意図通りの挙動になってくれない感じバグっぽい気もしている。またフォーラム行きかしら。。。

ちなみにこんな感じ↓ 100000 以上について指定している。

$ curl -XGET 'localhost:9200/sample_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
   "query" : {
        "range" : {
            "PRICE1": {
                "gte": 100000
             }
        }
    }
}'

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "sample_index",
        "_type" : "sample_type",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "analyzer" : "whitespace",
          "PRICE1" : "5000 JPY"
        }
      },
      {
        "_index" : "sample_index",
        "_type" : "sample_type",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "analyzer" : "whitespace",
          "PRICE1" : "3000 JPY"
        }
      },
      {
        "_index" : "sample_index",
        "_type" : "sample_type",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "analyzer" : "whitespace",
          "PRICE1" : "100 JPY"
        }
      }
    ]
  }
}

そもそもトークンが上手くいっていない感じだな。。。うーん。。。わからん。。。。

$ curl -XGET 'localhost:9200/sample_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
   "query" : {
         "term": { "PRICE1": "JPY"}
        }
    }
}'

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

【追記】

johtani さんからアドバイスを頂いた。

numeric_detectionは文字が数値に型変換できるなら数値にするという機能であり、トークナイズとは関係ないので望んでる動きにはならないかと。
— Jun Ohtani (@johtani) 2017年7月25日

アプローチの仕方が間違っていた。やっぱり、数値にアプリ側で変換してインデックス貼ってあげるのが楽な気がする。

2017-07-18

Elasticsearch 先輩で日付を触りたい

Elasticsearch 自分メモ

やりたかったこと

日付の取り回しについて調べてみた。具体的には、以下のようなデータを想定する。

20170713
2017/07/13
2017-07-13
2017年07月13日
2017713
2017/7/13
2017-7-13
2017年7月13日

これらが不特定なカラムに入っている時に、 Elasticsearch で、いい感じに受け取るにはどうすればいいか？を調べてみた。

自動的にやってくれるもの

Elasticsearch は自動的に型を判断していい感じにマッピングしてくれる機能がある。そこで使われる日付のフォーマットはこちらに一覧されている。

format | Elasticsearch Reference [5.5] | Elastic

というわけで、以下の実験データを何もマッピングされていない index に突っ込んでみる

{"index":{"_index":"sample_index","_type":"sample_type","_id":"1"}}
{"DATE1":"2017年07月13日"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"2"}}
{"DATE2":"2017/07/13"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"3"}}
{"DATE3":"2017-07-13"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"4"}}
{"DATE4":"2017.07.13"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"5"}}
{"DATE5":"20170713"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"6"}}
{"DATE6":"2017年7月13日"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"7"}}
{"DATE7":"2017/7/13"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"8"}}
{"DATE8":"2017-7-13"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"9"}}
{"DATE9":"2017.7.13"}
{"index":{"_index":"sample_index","_type":"sample_type","_id":"10"}}
{"DATE10":"2017713"}

$ curl -s -X GET 'http://localhost:9200/sample_index/' | jq .
{
  "sample_index": {
    "aliases": {},
    "mappings": {
      "sample_type": {
        "properties": {
          "DATE10": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "DATE2": {
            "type": "date",
            "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis"
          },
          "DATE3": {
            "type": "date"
          },
          "DATE4": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "DATE5": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "DATE6": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "DATE7": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "DATE8": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "DATE9": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1499935832320",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "Tf1HqJSfQzmoIPVKrMgnQQ",
        "version": {
          "created": "5040299"
        },
        "provided_name": "sample_index"
      }
    }
  }
}

結果から分かる通り、YYYY/MM/DD の形のみ受け付けている

そして、 format | Elasticsearch Reference [5.5] | Elastic　にある通り、07月のように0埋めしないと自動ではマッピングしてもらえない。

なので、 DATE5~10 のフォーマットは、事前に何かをしてやるとかしないと、難しそう。

フォーマットを自分で定義してみる

次に自分で mapping を指定してデータの登録を行ってみる。

{
  "mappings": {
    "sample_type": {
      "properties": {
        "DATE1": {
          "type":   "date",
          "format": "yyyy年MM月dd日"
        },
        "DATE3": {
          "type":   "date",
          "format": "yyyy-MM-dd"
        },
        "DATE4": {
          "type":   "date",
          "format": "yyyy.MM.dd"
        },
        "DATE5": {
          "type":   "date",
          "format": "yyyy年M月dd日"
        }
      }
    }
  }
}

登録は次の用に行う。

curl -X PUT 'http://localhost:9200/sample_index' --data-binary @mapping.json | jq .

これを行った後、 DATE1~5について再度データの挿入をしてみた。

$curl -X GET 'http://localhost:9200/sample_index/' | jq .
{
  "sample_index": {
    "aliases": {},
    "mappings": {
      "sample": {
        "properties": {
          "DATE1": {
            "type": "date",
            "format": "yyyy年MM月dd日"
          },
          "DATE3": {
            "type": "date",
            "format": "yyyy-MM-dd"
          },
          "DATE4": {
            "type": "date",
            "format": "yyyy.MM.dd"
          },
          "DATE5": {
            "type": "date",
            "format": "yyyy年M月dd日"
          }
        }
      },
      "sample_type": {
        "properties": {
          "DATE1": {
            "type": "date",
            "format": "yyyy年MM月dd日"
          },
          "DATE2": {
            "type": "date",
            "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis"
          },
          "DATE3": {
            "type": "date",
            "format": "yyyy-MM-dd"
          },
          "DATE4": {
            "type": "date",
            "format": "yyyy.MM.dd"
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1500369904641",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "EN99Zmc7TuK-7ePO1S9T6A",
        "version": {
          "created": "5040299"
        },
        "provided_name": "sample_index"
      }
    }
  }
}

これで、データ型での登録ができるようになった。しかし、これでは、予め決められているカラムにしかデータの取込が出来ないため、デフォルトとして登録することを考えるもやり方が分からない。。。

Dynamic template は index の正規表現に則って、決まった index に対してマッピングを当てはめるみたいなことはできるが、不特定のカラム名に対して割り当てるのは難しい。

Dynamic templates | Elasticsearch Reference [5.5] | Elastic

うーん。。。

フォーラムで質問して見るかしら。。。

（追記 7月20日）フォーラムで質問したら、大谷さんから回答が返ってきた。

不特定カラムに対して日付のフォーマットを指定する - Elastic In Your Native Tongue / 日本語による質問・議論はこちら - Discuss the Elastic Stack

6.0.0-alpha2 だと試せるそうなので、試し見た。

$ curl -X PUT -H 'Content-Type: application/json' http://localhost:9200/my_index -d '
{
  "mappings": {
    "my_type": {
      "dynamic_date_formats": ["yyyy年MM月dd日", "yyyy/MM/dd", "yyyy.MM.dd", "yyyy-MM-dd"]
    }
  }
}' | jq .
{
  "acknowledged": true,
  "shards_acknowledged": true
}
$ curl -X POST -H 'Content-Type: application/json' http://localhost:9200/_bulk --data-binary @test.json 
$ curl -X GET http://localhost:9200/my_index | jq .
{
  "my_index": {
    "aliases": {},
    "mappings": {
      "my_type": {
        "dynamic_date_formats": [
          "yyyy年MM月dd日",
          "yyyy/MM/dd",
          "yyyy.MM.dd",
          "yyyy-MM-dd"
        ],
        "properties": {
          "DATE1": {
            "type": "date",
            "format": "yyyy年MM月dd日"
          },
          "DATE2": {
            "type": "date",
            "format": "yyyy/MM/dd"
          },
          "DATE3": {
            "type": "date",
            "format": "yyyy-MM-dd"
          },
          "DATE4": {
            "type": "date",
            "format": "yyyy.MM.dd"
          }
        }
      }
    },
  以下略
  }
}

意図通り、 date 型でのデータの保存ができるようになった。ポイントは、事前に dynamic_date_formats を指定することで、予め想定される date を確定することだった。これを template 化しておくことで、任意の index に対して、任意のカラムの日付を割り当てることができそう。

Dynamic field mapping | Elasticsearch Reference [master] | Elastic

日付が扱えるとどんな検索ができるか？

Range Query | Elasticsearch Reference [5.5] | Elastic

○日から○日の範囲内
今から何日以内
この日から何日以内

といった検索が行えそう

まとめ

文字列をトークン化するとか、予め決められたデータを処理することに対しては、 Elasticsearch は有効打にできそうだけど、日付周りは、フォーマットの関係に一工夫必要だった。

2017-06-30

Elasticsearch 先輩との戯れ日記(登場人物の整理)

Elasticsearch 自分メモ

この記事何？

お仕事で Elasticsearch 先輩を使うことになったので、そのお戯れの記録

登場人物の整理

事例を中心に調査していたが、色々混乱してきたので、 Elasticsearch の世界の登場人物を整理することにした。

雑な説明

hoge

オブジェクト	概要
Cluster	Node を束ねる存在
Node	Elasticsearch のインスタンスのこと
Shard	Index を分割したもの
Index	インデックス
Type	テーブル
Document	レコード

雑すぎるので１個ずつ見ていく。

Cluster

Node を束ねるもの。 Cluster Health を使って、 Cluster の状態を確認することができる。

green: すべての Shard が Cluster 上に配置されてる
yellow: PrimaryShard は Cluster 上に配置されているが、配置されていない ReplicaShard が存在している
red: Cluster 上に配置されていない Shard が存在している

Shard については、後述。

Node

Node | Elasticsearch Reference [5.4] | Elastic

Node には Master (eligible) Node と Data Node と Ingest node が主要な感じ(Tribe Node はよくわからなかった)

Master Eligible Node

クラスタ内で行う処理の分配を行う
もし、 Master Node が死んだら、別の Master Eligible Node が Master になってくれる(デフォルトでは全 Node)
Master Node は、基本的に分配だけをメインにやらせると安定運用につながる

Data Node

document の管理を行う
主に CRUD 操作を担当
Master Node とは役割分離させようね
I/O 系の処理はメモリいっぱい使うからちゃんと監視しようね

Data Node が死んじゃったらデータは欠損する？ discovery.zen.minimum_master_nodes を設定すれば、設定した値の数で最小構成として、データの書き込みとかを行うみたい(おそらく) なので、一応データの欠損は防げるらしい。

参考：Avoiding split brain with minimum_master_nodes

Ingest Node

Elasticsearch 5.0 から登場
index を貼る前にドキュメントを変換するために使う Node
パイプラインを作って複数の処理を組み合わせて実行もできる

Shard

Shards & Replicas

10億個の document を複数に分割して読み込めるように、 index を分割する仕組み。これによって、オペレーションの分散と並列化ができるため、パフォーマンスの向上につながる。

Primary shard を増やせば増やすほど並列数が上がって１ノードのパフォーマンスが上がるかも？ →ただ、その分スペックは要求されるデフォルトは5つの Primary Shard と各 Primary Shard 毎に 1つの Replica Shard

Replica Shard は Primary Shard コピーを複製してあるもの。このコピーは同一 Node には作成出来ない。

出典 : How Primary and Replica Shards Interact | Elasticsearch: The Definitive Guide [2.x] | Elastic

Index

Document を纏めるもの。インデックス付け、検索、更新、削除全てを行う。

Reindex | Curator Reference [5.1] | Elastic

インデックスを貼り直すとかって事もできる。ここらへんが検索周りのチューニングポイントらしい。

Type

Document を種類別に分別することができる。データベースで言うところの table

Document

１レコード分の情報が入ってるもの。 JSON で表現される。

Index Type Document の相関図はこんな感じ

出典：データ構造について – AWSで始めるElasticSearch(4) ｜ Developers.IO

まとめ

登場人物整理したら、なんとなく事例でチューニングした話とかも分かりそうな気がした。あくまで、気がしているだけ。図とかは出典元を見ていただけるとうれしいです。

番外編

こいつで、docker-compose up ってするだけで、 kibana と elasticsearch がどっちも立ち上がるんだって :sugoi:

version: '2'
services:
  kibana:
    image: kibana
    links:
      - elasticsearch:elasticsearch
    ports:
      - 5601:5601

  elasticsearch:
    image: elasticsearch
    ports:
      - 9200:9200
      - 9300:9300

公式もサポートしているっぽい。 library/elasticsearch - Docker Hub

2017-06-27

Elasticsearch 先輩との戯れ日記(データの投入)

Elasticsearch 自分メモ

この記事何？

お仕事で Elasticsearch 先輩を使うことになったので、そのお戯れの記録

複数ドキュメントを扱う API って何がある？

Document APIs | Elasticsearch Reference [5.4] | Elastic

Multi-document API いっぺんにデータ突っ込んだり取ってきたりするやつっぽいな。

Multi Get API(データの取得)
Bulk API(データの挿入)
Delete By Query API(データの削除)
Update By Query API(データの更新)
Reindex API(インデックスの貼り直し)

データの投入について

Bulk API

Bulk API | Elasticsearch Reference [5.4] | Elastic

The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed.

endpoint は /_bulk, /{index}/_bulk, {index}/{type}/_bulk の３つ。パラメーターに JSON 形式で挿入したいデータを指定する。もし、データを送りたい場合は、

データの最終行に改行(\n)を入れる
もし改行したい場合は、 carriage return (\r) を使ってね
Content-Type: application/x-ndjson をヘッダーに含めて送ってね

を守ること。

$curl -X POST -H 'Content-Type: application/x-ndjson' 'localhost:9200/test/account/_bulk?pretty&refresh' --data-binary "@accounts.json"  
   ...
    {
      "index" : {
        "_index" : "test",
        "_type" : "account",
        "_id" : "990",
        "_version" : 1,
        "result" : "created",
        "forced_refresh" : true,
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "test",
        "_type" : "account",
        "_id" : "995",
        "_version" : 1,
        "result" : "created",
        "forced_refresh" : true,
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    }
  ]
}

CSV ファイルから直接いけるか？

:no_good: :no_good: デフォルトでは CSV ファイルから、直接ぶち込むことはできず、一旦 JSON 形式に変換してあげる必要がありそう。

$ curl -s -X POST -H 'Content-Type: text/csv' 'localhost:9200/test/feed/_bulk?pretty&refresh' --data-binary "@test_feed1000.csv" | jq
{
  "error": "Content-Type header [text/csv] is not supported",
  "status": 406
}

打開案

アプリ内で CSV を JSON になおしてぶち込む
- Elasticsearchを使って日経平均株価データでMachine Learningを体験する - Qiita
Embulk 使ってぶち込む
- CSVファイルをElasticsearchに取り込んでみる - Qiita

１手目はできそうだけど、そこまでして、 Elasticsearch の内部でデータを持つ必要があるのか説はあるな。。。

注意点

設計によっては、 OOM とかが頻発しちゃうのは結構怖さあるな。 OOM 発動しなくてもデータ破損とかはかなり痛いしなもうちょっとここは追加調査が必要。場合によっては、検索/保存で仕組みを切り離しちゃうのも１つなんだろうなというお気持ちになりました。

余談

データの取得

ついでに確認のために、調べたので、データの取得も _search に query を投げてあげれば取れる。 from で指定する値が行数。 size が何行取り出すか。

以下の例は100行目から２行取り出すとなるので、100,101行目が取り出せている。

$ curl -s -X GET http://localhost:9200/test/account/_search -d '{"query":{"match_all": {}}, "from": "100", "size": "2"}' | jq .
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "account",
        "_id": "383",
        "_score": 1,
        "_source": {
          "account_number": 383,
          "balance": 48889,
          "firstname": "Knox",
          "lastname": "Larson",
          "age": 28,
          "gender": "F",
          "address": "962 Bartlett Place",
          "employer": "Bostonic",
          "email": "knoxlarson@bostonic.com",
          "city": "Smeltertown",
          "state": "TX"
        }
      },
      {
        "_index": "test",
        "_type": "account",
        "_id": "408",
        "_score": 1,
        "_source": {
          "account_number": 408,
          "balance": 34666,
          "firstname": "Lidia",
          "lastname": "Guerrero",
          "age": 30,
          "gender": "M",
          "address": "254 Stratford Road",
          "employer": "Snowpoke",
          "email": "lidiaguerrero@snowpoke.com",
          "city": "Fairlee",
          "state": "LA"
        }
      }
    ]
  }
}

というわけで、100行単位でページングみたいな使い方はできそうということが分かった。

2017-06-22

Elasticsearch 先輩との戯れ日記(セットアップなどなど)

Elasticsearch 自分メモ

この記事何？

お仕事で Elasticsearch 先輩を使うことになったので、そのお戯れの記録

今回調べたものなど

まず Elasticsearch でどんなことができるかを調べるために、公式のチュートリアルとここ載ってる記事を読み漁った。

環境準備

公式さんから最新版(5.4)を Download してきた。 Kibana も合わせてローカルで準備しておく(戯れ用)

Download Elasticsearch Free • Get Started Now | Elastic Download Kibana Free • Get Started Now | Elastic

日本語全文検索用のプラグインである analysis-kuromoji を入れてみる。

# /Users/pokotyamu/Downloads/elasticsearch-5.4.2
$ bin/elasticsearch-plugin install analysis-kuromoji
-> Downloading analysis-kuromoji from elastic
[=================================================] 100%
-> Installed analysis-kuromoji

curl で叩いて、 plugin が入っていることを確認。

$ curl -X GET 'http://localhost:9200/_nodes/plugins?pretty'
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "tfQ1ZsOzQ9CYZtA0jXw2ww" : {
      "name" : "tfQ1ZsO",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1",
      "version" : "5.4.2",
      "build_hash" : "929b078",
      "roles" : [
        "master",
        "data",
        "ingest"
      ],
      "plugins" : [
        {
          "name" : "analysis-kuromoji",
          "version" : "5.4.2",
          "description" : "The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.",
          "classname" : "org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin",
          "has_native_controller" : false
        }
      ],
     ...
}

基本コマンド

起動は以下で行う。終了は　Ctrl-C 。 -d オプションでバックグラウンド起動可能。

# /Users/pokotyamu/Downloads/elasticsearch-5.4.2
$ bin/elasticsearch

ポートはデフォルトで 9200番が使われるみたい。 crul で叩いてみて、次のように出たらOK

$ curl localhost:9200
{
  "name" : "tfQ1ZsO",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "Y2-U0XlwSxWX771QXbIN0w",
  "version" : {
    "number" : "5.4.2",
    "build_hash" : "929b078",
    "build_date" : "2017-06-15T02:29:28.122Z",
    "build_snapshot" : false,
    "lucene_version" : "6.5.1"
  },
  "tagline" : "You Know, for Search"
}

データ投入してみる

マッピング定義

MySQL	Elasticsearch
database	index
schema	mapping
table	type
record	document

だいたいこんな感じの対応をしている。マッピングの定義はこんな感じ↓

{
    "mappings": {
        "sample": {
            "properties": {
                "name": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "coord": {
                    "type": "geo_point"
                },
                "description": {
                    "type": "string",
                    "analyzer": "kuromoji"
                }
            }
        }
    }
}

sample の部分が type の名前になっている。これを landmark という index に突っ込むためには PUT で書き換えれる。

# curl -X PUT http://localhost:9200/<Index Name>
$ curl -X PUT http://localhost:9200/landmark --data-binary @mapping.json
$ curl -s -X GET http://localhost:9200/landmark | jq .
{
  "landmark": {
    "aliases": {},
    "mappings": {
      "sample": {
        "properties": {
          "coord": {
            "type": "geo_point"
          },
          "description": {
            "type": "text",
            "analyzer": "kuromoji"
          },
          "name": {
            "type": "keyword"
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1498037230489",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "9tFY4-F5S-KTyVEAa40DCg",
        "version": {
          "created": "5040299"
        },
        "provided_name": "landmark"
      }
    }
  }
}

ちなみに、このマッピング定義をしないでも、登録はできるらしい。その時は Elasticsearch 側がよしなに値から型定義をしてくれるとか。凄い。

データ投入

今回は Bulk API でデータ投入してみる。 Bulk だと一回のリクエストで一気にデータを突っ込めるみたい。

{ "index" : {} }
{ "name": "スカイツリー", "coord": { "lat": "35.710063", "lon": "139.8107"}, "description": "東京スカイツリー（とうきょうスカイツリー、英: TOKYO SKYTREE）は東京都墨田区押上一丁目にある電波塔（送信所）である。"}
{ "index" : {} }
{ "name": "BLUE NOTE TOKYO", "coord": { "lat": "35.661198", "lon": "139.716207"}, "description": "N.Y.の「Blue Note」を本店に持つジャズクラブとして、南青山にオープンしたブルーノート東京。ジャズをはじめとする多様な音楽ジャンルのトップアーティストたちが、連夜渾身のプレイを繰り広げている。"}
{ "index" : {} }
{ "name": "東京タワー", "coord": { "lat": "35.65858", "lon": "139.745433"}, "description": "東京タワー（とうきょうタワー、英: Tokyo Tower）は、東京都港区芝公園にある総合電波塔とその愛称である。正式名称は日本電波塔（にっぽんでんぱとう）。"}

# curl -X POST http://localhost:9200/<Index Name>/<Type Name>/_bulk --data-binary @landmark.json
$ curl -X POST http://localhost:9200/landmark/sample/_bulk --data-binary @landmark.json
{
    "errors": false, 
    "items": [
        {
            "index": {
                "_id": "AVzJ-slvXdDNsyfvvuMV", 
                "_index": "landmark", 
                "_shards": {
                    "failed": 0, 
                    "successful": 1, 
                    "total": 2
                }, 
                "_type": "sample", 
                "_version": 1, 
                "created": true, 
                "result": "created", 
                "status": 201
            }
        }, 
        {
            "index": {
                "_id": "AVzJ-slwXdDNsyfvvuMW", 
                "_index": "landmark", 
                "_shards": {
                    "failed": 0, 
                    "successful": 1, 
                    "total": 2
                }, 
                "_type": "sample", 
                "_version": 1, 
                "created": true, 
                "result": "created", 
                "status": 201
            }
        }, 
        {
            "index": {
                "_id": "AVzJ-slwXdDNsyfvvuMX", 
                "_index": "landmark", 
                "_shards": {
                    "failed": 0, 
                    "successful": 1, 
                    "total": 2
                }, 
                "_type": "sample", 
                "_version": 1, 
                "created": true, 
                "result": "created", 
                "status": 201
            }
        }
    ], 
    "took": 286
}

検索

最後に検索。検索も json で query を作って実行する。メソッドは GET を使用。検索に使えるのは、調べ中。。。 AND　検索はもちろん、数値の比較とか範囲指定とかもできるみたい。

$ curl -s -X GET http://localhost:9200/landmark/sample/_search -d '{"query":{"match":{"description":"ジャズ"}}}' | jq .
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.928525,
    "hits": [
      {
        "_index": "landmark",
        "_type": "sample",
        "_id": "AVzJ-slwXdDNsyfvvuMW",
        "_score": 0.928525,
        "_source": {
          "name": "BLUE NOTE TOKYO",
          "coord": {
            "lat": "35.661198",
            "lon": "139.716207"
          },
          "description": "N.Y.の「Blue Note」を本店に持つジャズクラブとして、南青山にオープンしたブルーノート東京。ジャズをはじめとする多様な音楽ジャンルのトップアーティストたちが、連夜渾身のプレイを繰り広げている。"
        }
      }
    ]
  }
}

まとめ

ひとまず、ローカルで立ち上げる→データを突っ込む→検索するの流れは試せてた。次はどーやったら、がつっとデータ突っ込めるかを調べてみる。

2017-04-03

社会人２年目もチェスト！！！

ついに社会人１年目終了。

東京の生活には全然慣れませんが、健康体でいれたので、病欠による休みはゼロでした！ ~~(5分ぐらいの遅刻1回)~~

ただ、オリジン弁当とズブズブの関係なので、体重は7キロぐらい太りました。適正体重まであとちょっとですw

１年目は、技術力が高い先輩や同期に圧倒されつつも、ちょっとずつ自分の出来ることが増えていくのを感じれることができました。

自分の中で、学びたかったスクラムの知識を整理できたのも非常に身になりました！

pokotyamu.hatenablog.com

ただ、プログラミングスキルに関しては、まだまだ発展途上… 結局 Rails しか満足に勉強できてないし…(その Rails もまだまだ修行中)

今の会社では、自分のエンジニアとしてのキャリアパスに対して、色々挑戦させていただいたり、サポートしてくれる環境が整っているので、それに応えれるように準備してこうな！

この一年は自分のプログラミングスキルを一から見直す。

そのための勉強をするときのコストはなるべく惜しまないようにしていきたいと考えていて、２年目の初日、 Mac Book Pro の 2016年モデル買いました。

f:id:pokotyamu:20170403003051p:plain

2012年7月1日から使い続けてたのね。初めてのプログラミング、研究、Ruby 合宿、就活、留学、社会人１年目、全てを共にしたこの子を讃えたい。 pic.twitter.com/ZWx2fv4kr0
— チキン☆えーちゃん先輩@６月病 (@pokotyamu) 2017年4月2日

２年目もチェスト！！！！！(宗りんおかえり！！！！！！)