ES使用教程

安装环境

安装ES

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.3-windows-x86_64.zip

默认端口：http:localhost:9200
默认用户名：elastic
默认密码：changeme

启动日志乱码：进入config/jvm.options 文件—>末尾添加 -Dfile.encoding=GBK

安装分词器

https://release.infinilabs.com/analysis-ik/stable/

安装后解压至es的plugins目录下

安装kibana(可视化页面)

https://artifacts.elastic.co/downloads/kibana/kibana-8.14.3-windows-x86_64.zip

启动成功后可看到访问链接:控制台 - 开发工具 - Elastic

分词测试：

 # es默认分词时standard，会按单字拆分
 POST _analyze
 {
   "analyzer":"standard",
   "text":"中华人民共和国"
 }
 
 # ik_smart:做最粗力度的拆分
 POST _analyze
 {
   "analyzer":"ik_smart",
   "text":"中华人民共和国"
 }
 
 #ik_max_word:会将文本做最细粒度的拆分
 POST _analyze 
 { 
  "analyzer":"ik_max_word", 
  "text":"中华人民共和国" 
 }
 
 
 # 创建索引，指定默认的分词器
 PUT /employee
 {
   "settings":{
     "index":{
       "analysis.analyzer.default.type":"ik_max_word"
     }
   }
 }
 # 查看索引的setting信息
 GET /employee/_settings
 
 -------------------------------------------------
 # 索引文档，插入文档
 POST /index/_create/1
 {"content":"中国共产党万岁1"}
 POST /index/_create/2
 {"content":"中国共产党万岁2"}
 POST /index/_create/3
 {"content":"中国共产党万3岁"}
 
 #带高亮的查询
 POST /index/_search
 {
   "query": {
     "match": {
       "content": "中国"
     }
   },
   "highlight": {
     "pre_tags": ["<tag111>","<tag2>"],
     "post_tags": ["<tag1>","<tag2>"],
     "fields": {"content":{}}
   }
 }
 
 #高亮查询结果(部分)
 {
     "_index": "index",
     "_id": "2",
     "_score": 0.13167393,
     "_source": {
         "content": "中国共产党万岁2"
     },
     "highlight": {
         "content": [
             "<tag111>中国<tag1>共产党万岁2"
         ]
     }
 }

/index/_mapping 映射属性的解释：
"properties"：这是一个包含字段定义的JSON对象。在这个例子中，它只包含了一个字段content。
"content"：这是索引中要定义的字段名。
"type": "text"：指定content字段的数据类型为text。在Elasticsearch中，text类型用于全文搜索的文本
字段，它可以被分词器（analyzer）处理成多个词条（tokens）用于索引和搜索。
"analyzer": "ik_max_word"：指定在索引（写入）content字段时使用的分词器为ik_max_word。ik_max_word是Elasticsearch的IK分词器插件提供的一个分词器，它会对文本进行最细粒度的切分，以便尽可能多地捕获文本中的关键词，提高搜索的召回率。
"search_analyzer": "ik_smart"：指定在搜索（查询）content字段时使用的分词器为ik_smart。ik_smart是IK分词器的另一种分词模式，它尝试对文本进行更智能的切分，以提高搜索的准确率。通过在索引和搜索时使用不同的分词器，可以在提高召回率的同时保持搜索的精度。

基础

倒排索引

正排索引

索引(index)

索引是Elasticsearch中用于存储和管理相关数据的逻辑容器。索引可以看作数据库中的一个表，它包含了一组具有相似结构的文档。在Elasticsearch中，数据以JSON格式的文档存储在索引内。每个索引具有唯一的名称，以便在执行搜索、更新和删除操作时进行引用。索引的名称可以由用户自定义，但必须全部小写。总之，索引是Elasticsearch中用于组织、存储和检索数据的一个核心概念。通过将数据划分为不同的索引，用户可以更有效地管理和查询相关数据

映射(mapping)

类似于关系型数据库的表结构

 # 创建索引结构
 PUT /employee
 {
   "mappings": {
     "properties": {
       "name":{
         "type": "keyword"
       },
       "sex":{
         "type": "integer"
       },
       "age":{
         "type": "integer"
       },
       "address":{
         "type": "text",
         "analyzer": "ik_max_word"
       },
       "remark":{
         "type": "text",
         "analyzer": "ik_smart"
       }
     }
   }
 }
 
 # 查看索引结构
 GET /employee/_mapping

文档(document)

为Elasticsearch的基本存储单元，文档是指存储在Elasticsearch索引中的JSON对象。文档中的数据由键值对构成。键是字段的名称，值是不同数据类型的字段。不同的数据类型包含但不限于字符串类型、数字类型、布尔类型、对象类型等。

 GET /employee/_search
 
 {
     "_index": "employee",
     "_id": "1",
     "_score": 1,
     "_source": {
         "name": "张三",
         "sex": 1,
         "age": 10,
         "address": "上海浦东新区",
         "remark": "java is so great"
     }
 }

文档元数据：用于标注文档的相关信息

_index : 文档的索引名
_type : 文档的所属类型名
_id : 文档唯一id
_source : 文档的原始json数据
_version : 文档的版本号，修改删除操作都会自增1
seqno : 和version一样,数据发生更改时也会自增1，保证后写入的Doc的seq_no大于先写入的Doc的_seq_no
primaryterm:主要是用来恢复数据时处理当多个文档的seq_no一样时的冲突，避免Primary Shard上的写入被覆盖。每当Primary Shard发生重新分配时，比如重启，Primary选举等，primary_term会递增1

索引的基本操作

创建索引

 put /index_name
 {
     "settings":{
         // 索引设置
     },
     "mappings":{
         "properties":{
             // 字段映射
         }
     }
 }

索引名称(index_name)：必须是小写字母，可以包含数字和下划线
索引设置(settings):
- 分片数量(number_of_shards):索引的分片数决定了索引的并行度和数据分布
```
 "number_of_shards": 1
```
- 副本数量(number_of_replicas):副本提高了数据的可用性和容错能力
```
 "number_of_replicas": 1
```

映射(mapping)

 "properties":{
     "field1":{
         "type":"text"
     },
     "field2":{
         "type":"keyword"
     }
 }

只定义索引名，而settings、mappings取默认值

 # 创建索引
 put /index
 # 查看索引
 get /index
 get index

删除索引

delete /index

查询索引

查询索引中的文档：

get /index_name/_search
{
    "query":{
        // 查询条件
    }
}

# 示例：搜索name字段包含zhang的文档
get /employee/_search
{
    "query":{
        "match":{
            "name":"zhang"
        }
    }
}

修改索引

修改settings：

put /index_name/_settings
{
    "index":{
        "setting_name":"setting_value"
    }
}

# 示例：更新索引副本数量为2
put /index_name/_settings
{
    "index":{
        "number_of_replicas": 2
    }
}

修改mapping:

PUT /index_name/_mapping 1
{ 
     "properties": { 
         "new_field": { 
         	"type": "field_type" 
         }
     }
}

# 示例：添加一个grade字段，属性为integer
PUT /index_name/_mapping
{
     "properties": {
         "grade": { 
            "type": "integer" 
         }
     }
}

索引别名

Elasitcsearch创建索引后，就不允许改索引名了。而在很多业务场景下，单一索引可能无法满足要求。

创建时添加：

PUT myindex
{
     "aliases": {
     	"myindex_alias": {}
     },
     "settings": {
     "refresh_interval": "30s",
     "number_of_shards": 1,
     "number_of_replicas": 0 
 	}	
}

已有索引时添加：

POST /_aliases
{
    "actions": [
        {
            "add": {
                "index": "index_name",
                "alias": "alias_name"
            }
        }
    ] 
}

# 示例：为 my_index 索引添加一个别名 my_index_alias
POST /_aliases
{
    "actions": [
        {
            "add": {
                "index": "my_index",
                "alias": "my_index_alias"
            }
        }
    ] 
}

多索引检索：

不使用别名
```
post logs_1,logs_2,logs_3/_search
```
不使用别名，使用通配符
```
post logs_*/_search
```

使用别名

# 关联索引
put logs_1
put logs_2
put logs_3

post _aliasses
{
    "actions": [
     {
         "add": {
             "index": "logs_1",
             "alias": "logs_2024"
         	}
         },
         {
         "add": {
             "index": "logs_2",
             "alias": "logs_2024"
        	 }
         },
         {
         "add": {
             "index": "logs_3",
             "alias": "logs_2024"
       	  }
         }
     ]
}

# 别名检索

post logs_2024/_search

文档操作

新增、更新文档

指定id：可用post、put
不指定id：只可用post

即使用put必须带id

格式：<op> /<index_name/_doc/<id>

# 不指定id
post /<index_name>/_doc
{
	"filed1": "value1",
	"filed2": "value2"
}
-----------------------
# 指定id时，且端点为_doc;文档不存在则新增，存在则会直接对文档进行覆盖
post /<index_name>/_doc/1
{
	"filed1": "value1",
	"filed2": "value2"
}

put /<index_name>/_doc/1
{
	"filed1": "value1",
	"filed2": "value2"
}
------------------------------
# 使用_create,只有post才有；如果指定的id不存在，则新增，存在则返回409异常
post /<index_name>/_create/1
{
	"filed1": "value1",
	

----------------------
# 如果想更新指定字段，可使用_update,而不是替换整个文档
post /<index_name>/_update/1
{
	"doc":{
        "filed1": "value1",
        "filed2": "value2"
    }
}

批量操作

基础语法

POST /<index_name>/_bulk
{ "index" : { "_index" : "<index_name>", "_id" : "<optional_document_id>"}}{ "field1" : "value1", "field2" : "value2", ...}
{ "update" : { "_index" : "<index_name>", "_id" : "<document_id>" }}
{ "doc" : {"field1" : "new_value1", "field2" : "new_value2", ...}, "_op_type" :"update" }
{ "delete" : { "_index" : "<index_name>", "_id" : "<document_id>" } }
{ "index" : { "_index" : "<index_name>", "_id" : "<optional_document_id>"} }{ "field1" : "value1", "field2" : "value2", ... }

标准格式：

{ action: { metadata }}
{ request body        }
{ action: { metadata }}
{ request body        }

# 格式必须对齐！不能随意换行！！
# 示例：
POST _bulk
{"create":{"_index":"employee","_id":2}}
{"id":2,"remark":"张三"}
{"create":{"_index":"employee","_id":3}}
{"id":3,"remark":"李四"}

# 错误示例：
POST _bulk
{"create":
{"_index":"employee","_id":2}}
{"id":2,"remark":"张三"}
{"create":{"_index":"employee","_id":3}}
{"id":3,"remark":"李四"}

index:用于创建新文档或替换已有文档。
create: 如果文档不存在则创建，如果文档已存在则返回错误。
update: 用于更新现有文档。
delete: 用于删除指定的文档。

# 批量新增
POST _bulk
{"create":{"_index":"employee","_id":2}}
{"id":2,"remark":"张三"}
{"create":{"_index":"employee","_id":3}}
{"id":3,"remark":"李四"}
----------------------------------
# 批量修改：部分更新，只更新给予的字段
# doc_as_upsert(不存在是否插入，默认false)：如果存在，则部分更新，如果不存在，则使用# put进行插入，即只插入给予的部分字段

POST _bulk
{"update":{"_index":"employee","_id":2}}
{"doc":{"id":2,"name":"张三2"}}
{"update":{"_index":"employee","_id":3}}
{"doc":{"id":3,"name":"李四333"},"doc_as_upsert":true}
{"update":{"_index":"employee","_id":4}}
{"script":{"lang":"painless","source":"ctx._source.age += params.age","params":{"age":1}}}

# Painless是es内置的脚本语言，可以执行各种操作，如文档更新、聚合计算等
# source:脚本源码
# params:脚本使用的参数
# ctx：代表该文档的上下文，固定写法
# """ :可跨多行的字符串
POST _bulk
{"update":{"_index":"employee","_id":4}}
{"script":{"lang":"painless","source":"""
int i = params.i;
int age = params.age;
while(i-->0){
  age += 1;
}
if(ctx._source.age < 15){
  ctx._source.age = age;
}
""","params":{"age":1,"i":10}}}

删除文档

# 指定删除
DELETE employee/_doc/1

# 批量删除
POST employee/_bulk
{"delete":{"_id":1}}
{"delete":{"_id":2}}
# 操作时，路径中指定索引时，操作中可不指定_index
POST _bulk
{"delete":{"_id":1,"_index":"employee"}}

查询文档

格式：

get /index_name/_search
{json请求体}

常用查询语法

// 所有文档
{
    "query":{
        "match_all":{}
    }
}

//字段匹配，和分词结果有关
{
  "query": {
    "match": {
      "address": "上海"
    }
  }
}

// 精确匹配，一般用于非text类型的字段，用于text类型的字段会导致查询结果不准确
GET employee/_search
{
  "query": {
    "term": {
      "address": "上海市"
    }
  }
}

// 范围查询
// gt,gte,lt,lte
{
  "query": {
    "range": {
      "age": {
        "gt": 10,
        "gte": 10
      }
    }
  }
}

根据查询结果更新、删除

POST /employee/_delete_by_query{}
POST /employee/_update_by_query
{
  "query": {
    // 查询语法
  }
}

# 更新名称为张三的age为30
POST /employee/_update_by_query
{
  "query": {
    "term": {
      "name": "张三"
    }
  },
  "script": {
    "source": "ctx._source.age = 30"
  }
}

并发场景下更新文档时保证线程安全

在Elasticsearch 7.x及以后的版本中，seq_no和primary_term取代了旧版本的version字段，用于控制文档的版本。seq_no代表文档在特定分片中的序列号，而primary_term代表文档所在主分片的任期编号。这两个字段共同构成了文档的唯一版本标识符，用于实现乐观锁机制，确保在高并发环境下文档的一致性和正确更新。当在高并发环境下使用乐观锁机制修改文档时，要带上当前文档的seq_no和_primary_term进行更新:
POST /employee/_doc/1?if_seq_no=10&if_primary_term=10
{
  "name": "张三xxxx",
  "sex": 1,
  "age": 25
}

高级查询语法

# 数据准备

DELETE /employee 
PUT /employee 
{ 
 "settings": { 
 "number_of_shards": 1, 
 "number_of_replicas": 1
 }, 
 "mappings": { 
 "properties": { 
 "name": { 
 "type": "keyword" 
 }, 
 "sex": { 
 "type": "integer" 
 }, 
 "age": { 
 "type": "integer" 
 }, 
 "address": { 
 "type": "text", 
 "analyzer": "ik_max_word", 
 "fields": { 
 "keyword": { 
 "type": "keyword" 
 } 
 } 
 },
 "remark": { 
 "type": "text", 
 "analyzer": "ik_smart", 
 "fields": { 
 "keyword": { 
 "type": "keyword" 
 } 
 } 
 } 
 } 
 } 
}


POST /employee/_bulk 
{"index":{"_index":"employee","_id":"1"}}
{"name":"张三","sex":1,"age":25,"address":"广州天河公园","remark":"java developer"}
{"index":{"_index":"employee","_id":"2"}}
{"name":"李四","sex":1,"age":28,"address":"广州荔湾大厦","remark":"java assistant"}
{"index":{"_index":"employee","_id":"3"}}
{"name":"王五","sex":0,"age":26,"address":"广州白云山公园","remark":"php developer"}
{"index":{"_index":"employee","_id":"4"}}
{"name":"赵六","sex":0,"age":22,"address":"长沙橘子洲","remark":"python assistant"}
{"index":{"_index":"employee","_id":"5"}}
{"name":"张龙","sex":0,"age":19,"address":"长沙麓谷企业广场","remark":"java architect assistant"}
{"index":{"_index":"employee","_id":"6"}}
{"name":"赵虎","sex":1,"age":32,"address":"长沙麓谷兴工国际产业园","remark":"java architect"}

分页查询

// 查询全部；size:返回条目数；from：跨越多少个条目
{
  "query": {
    "match_all": {}
  },
  "size": 2,
  "from":1
}

// _source: 
// "true":展示所有字段(默认);
// "false":不展示字段
// "obj*"：查询以obj开头的字段
{
  "query": {
    "match_all": {}
  },
  "size": 2,
  "_source": ["a","obj*"]
}

// sort排序
{
  "query": {
    "match_all": {}
  },
  "size": 2,
  "sort": [
    {
      "name": {
        "order": "desc"
      }
    },
    {
      "age": {
        "order": "desc"
      }
    }
  ]
}

term

term精确匹配,一般用于未经过分词处理的keyword字段类型;如果查询text字段可能什么也查不了

GET /employee/_search
{
  "query": {
    "term": {
      "filed.property": {
        "value": "your_value"
      }
    }
  }
}
// 不分词查询，将字段类型映射成keyword
GET /employee/_search
{
  "query": {
    "term": {
      "address.keyword": {
        "value": "广州白云山公园"
      }
    }
  }
}

// term处理多值字段（数组）时，查询的是包含，不是等于
POST /people/_bulk 
{"index":{"_id":1}}
{"name":"小明","interest":["跑步","篮球"]}
{"index":{"_id":2}}
{"name":"小红","interest":["跳舞","画画"]}
{"index":{"_id":3}} 
{"name":"小丽","interest":["跳舞","唱歌","跑步"]}

POST /people/_search 
{ 
   "query": { 
     "term": { 
       "interest.keyword": { 
         "value": "跑步"  
       } 
     } 
   } 
}

在ES中，Term查询，对输入不做分词。会将输入作为一个整体，在倒排索引中查找准确的词项，并且使用相关度算分公式为每个包含该词项的文档进行相关度算分
可以通过 Constant Score 将查询转换成一个 Filtering，避免算分，并利用缓存，提高性能。
将Query 转成 Filter，忽略TF-IDF计算，避免相关性算分的开销
Filter可以有效利用缓存
GET /employee/_search
{
"query": {
 "constant_score": {
   "filter": {
     "term": {
       "address.keyword": "广州白云山公园"
     }
   }
 }
}
}

terms

terms：多值精确匹配。terms检索是针对未分析的字段进行精确匹配的，因此它在处理关键词、数字、日期等结构化数据时表现良好。

POST /employee/_search
{
  "query": {
    "terms": {
      "remark.keyword": [
        "java assistant",
        "java architect"
      ]
    }
  }
}

range

range：范围查询
gte:大于等于：greater than or equal
gt:大于：greater than
lte：小于等于：less than or equal
lt：小于：less than

GET /employee/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 10,
        "lte": 20
      }
    }
  }
}

日期范围查询

// 日期范围查询
PUT /notes 
{ 
 "settings": { 
 "number_of_shards": 1, 
 "number_of_replicas": 0 
 }, 
 "mappings": { 
 "properties": { 
 "title": {"type": "text"}, 
 "content": {"type": "text"}, 
 "created_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss"} 
 } 
 } 
} 

POST /notes/_bulk 
{"index":{"_id":"1"}}
{"title":"Note 1","content":"This is the first note.","created_at":"2023-07-01 12:00:00"}
{"index":{"_id":"2"}}
{"title":"Note 2","content":"This is the second note.","created_at":"2023-07-05 15:30:00"}
{"index":{"_id":"3"}}
{"title":"Note 3","content":"This is the third note.","created_at":"2023-07-10 08:45:00"}
{"index":{"_id":"4"}}
{"title":"Note 4","content":"This is the fourth note.","created_at":"2023-07-15 20:15:00"}


GET /notes/_search
{
  "query": {
    "range": {
      "created_at": {
        "gte": "2023-07-05 00:00:00",
        "lte": "2023-07-10 23:59:59"
      }
    }
  }
}

// 日期表达式
now:当前时间
now-1d:当前时间-1天
now-1w:当前时间-1周
now-1M:当前时间-1月
now-1y:当前时间-1年
now+1d:当前时间+1天


// 返回两年前的数据
POST /product/_bulk
{"index":{"_id":1}}
{"price":100,"date":"2023-01-01","productId":"XHDK-1293"}
{"index":{"_id":2}}
{"price":200,"date":"2022-01-01","productId":"KDKE-5421"}


GET /product/_search
{
  "query": {
    "range": {
      "date": {
        "gte": "now-2y"
      }
    }
  }
}

exists

exists：是否存在适用于检查文档中是否存在某个字段，或者该字段是否包含非空值。通过使用exists检索，可以有效地过滤掉缺少关键信息的文档，从而专注于包含所需数据的结果。应用场景包括但不限于数据完整性检查、查询特定属性的文档以及对可选字段进行筛选等。

// 查询存在remark字段的文档
GET /employee/_search
{
  "query": {
    "exists": {
      "field": "remark"
    }
  }
}

ids

ids:根据一组id查询

GET /employee/_search
{
  "query": {
    "ids": {
      "values": [1,2]
    }
  }
}

_mget

_mget:可获取多个文档

GET /_mget
{
  "docs": [
    {"_index":"employee","_id":1},
    {"_index":"product","_id":1}
  ]
}

prefix

prefix：前缀匹配 prefix会对分词后的term进行前缀搜索。它不会对要搜索的字符串分词，传入的前缀就是想要查找的前缀默认状态下，前缀查询不做相关性分数计算，它只是将所有匹配的文档返回，然后赋予所有相关分数值为1
原理：需要遍历所有倒排索引，并比较每个词项是否以所搜索的前缀开头。
这种查询通常用于自动补全或搜索功能，其中用户输入的搜索词可能是更长文本的一部分。仅适用于keyword类型的字段

// 不会查到记录
GET /employee/_search
{
  "query": {
    "prefix": {
      "address": {
        "value": "广州白云山"
      }
    }
  }
}
// 此方式才能查到
GET /employee/_search
{
  "query": {
    "prefix": {
      "address.keyword": {
        "value": "广州白云山"
      }
    }
  }
}

wildcard

wildcard:通配符匹配
适用于对部分已知内容的文本字段进行模糊检索。通配符查询可能会导致较高的计算负担，因此在实际应用中应谨慎使用
* : 匹配0或多个字符
？: 匹配任意单个字符

GET /employee/_search
{
  "query": {
    "wildcard": {
      "address.keyword": {
        "value": "*州*公园"
      }
    }
  }
}

regexp

regexp:正则匹配查询功能强大，但非必要情况避免使用 .* : 表示在java后可以跟随任意数量的任意字符

GET /employee/_search
{
  "query": {
    "regexp": {
      "remark": "java.*"
    }
  }
}

fuzzy

fuzzy：支持编辑距离的模糊查询
fuzzy检索是一种强大的搜索功能，它能够在用户输入内容存在拼写错误或上下文不一致时，仍然返回与搜索词相似的文档。通过使用编辑距离算法来度量输入词与文档中词条的相似程度，模糊查询在保证搜索结果相关性的同时，有效地提高了搜索容错能力。
编辑距离是指从一个单词转换到另一个单词需要编辑单字符的次数。如中文集团到中威集团编辑距离就是1，只需要修改一个字符；如果fuzziness值在这里设置成2，会把编辑距离为2的东东集团也查出来。

GET /employee/_search
{
  "query": {
    "fuzzy": {
      "address": {
        "value": "白运山",
        "fuzziness": 1
      }
    }
  }
}

term set

不可使用？？？？
term set :用于解决多值字段中的文档匹配问题
terms set检索是Elasticsearch中一种功能强大的检索类型，主要用于解决多值字段中的文档匹配问题，在处理具有多个属性、分类或标签的复杂数据时非常有用。从应用场景来说，terms set检索在处理多值字段和特定匹配条件时具有很大的优势。它适用于标签系统、搜索引擎、电子商务系统、文档管理系统和技能匹配等场景。

PUT /movies
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text"
      },
      "tags":{
        "type": "keyword"
      },
      "tags_count":{
        "type": "integer"
      }
    }
  }
}

POST /movies/_bulk
{"index":{"_id":1}}
{"title":"电影1", "tags":["喜剧","动作","科幻"], "tags_count":3}
{"index":{"_id":2}}
{"title":"电影2", "tags":["喜剧","爱情","家庭"], "tags_count":3}
{"index":{"_id":3}}
{"title":"电影3", "tags":["动作","科幻","家庭"], "tags_count":3}

// 1.使用固定数量的term进行匹配
GET /movies/_search
{
  "query": {
    "terms_set":{
      "tags":{
        "terms":["动作","科幻"],
        "minimum_should_matcch_script":{
          "source":"2"
        }
      }
    }
  }
}
// 2.使用动态计算的term数量进行匹配

...

match

match：分词查询此类检索主要应用于非结构化文本数据，如文章和评论等。底层逻辑：
分词：首先，输入的查询文本会被分词器进行分词。分词器会将文本拆分成一个个词项（terms），如单词、短语或特定字符。分词器通常根据特定的语言规则和配置进行操作。
匹配计算：一旦查询被分词，ES将根据查询的类型和参数计算文档与查询的匹配度。对于match查询，ES将比较查询的词项与倒排索引中的词项，并计算文档的相关性得分。相关性得分衡量了文档与查询的匹配程度。
结果返回：根据相关性得分，ES将返回最匹配的文档作为搜索结果。搜索结果通常按照相关性得分进行排序，以便最相关的文档排在前面。

// 分词后  or
GET /employee/_search
{
  "query": {
    "match": {
      "address": "广州白云山公园"
    }
  }
}

// 分词后 and
GET /employee/_search
{
  "query": {
    "match": {
      "address": {
        "query": "广州白云山公园",
        "operator": "and"
      }
    }
  }
}


// 当operator为or时，使用minnum_should_match参数用来控制匹配的分词的最少数量。
// 最少匹配广州、公园 两个词
GET /employee/_search
{
  "query": {
    "match": {
      "address": {
        "query": "广州公园",
        "minimum_should_match": 2
      }
    }
  }
}

multi_match

multi_match:多字段查询
用于在多个字段上进行相同的搜索操作

GET /employee/_search
{
  "query": {
    "multi_match": {
      "query": "长沙java",
      "fields": ["address","remark"]
    }
  }
}

match_phrase

match_phrase:短语查询
不仅匹配整个短语，而且还考虑了短语中各个词的顺序和位置。
这种查询类型对于搜索精确短语非常有用，尤其是在用户输入的查询与文档中的文本表达方式需要严格匹配时。

// 有数据
GET /employee/_search
{
  "query": {
    "match_phrase": {
      "address": "广州白云山"
    }
  }
}
// 无数据
GET /employee/_search
{
  "query": {
    "match_phrase": {
      "address": "广州白云"
    }
  }
}

// 原因：广州和白云不是相邻的词条，中间会隔一个白云山，而match_phrase匹配的是相邻的词条，所以查询广州白云山有结果，但查询广州白云没有结果。

POST _analyze
{
  "analyzer":"ik_max_word",
  "text":"广州白云山"
}
// 结果：
{
  "tokens": [
    {
      "token": "广州",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "白云山",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "白云",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "云山",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

// 可以借助slop参数，slop参数告诉match_phrase查询词条能够相隔多远时仍然将文档视为匹配。

GET /employee/_search
{
  "query": {
    "match_phrase": {
      "address": {
        "query": "广州云山",
        "slop": 2
      }
      
    }
  }
}

query_string

query_string:支持与或非表达式查询
它允许使用Lucene查询语法来构建复杂的搜索查询。这种查询类型支持多种逻辑运算符，包括与（AND）、或（OR）和非（NOT），以及通配符、模糊搜索和正则表达式等功能。query_string查询可以在单个或多个字段上进行搜索，并且可以处理复杂的查询逻辑

GET /<index_name>/_search 
{
  "query": {
     "query_string": {
       "query": "<your_query_string>",
       "default_field": "<field_name>"
      }
    }
}
// <your_query_string> 是查询逻辑，可以包含上述提到的逻辑运算符和通配符等
// <field_name> 是默认搜索字段，如果省略则会搜索所有可索引字段。

// 未指定查询字段
// AND 需要大写
GET /employee/_search
{
  "query": {
    "query_string": {
      "query": "赵六 AND 橘子洲"
    }
  }
}

// 单个字段查询
// 查询字段分词就将查询条件分词查询，查询字段不分词将查询条件不分词查询！
GET /employee/_search
{
  "query": {
    "query_string": {
      "default_field": "address",
      "query": "白云 OR 橘子洲"
    }
  }
}

// 指定多个字段查询
// 查询name或address含有张三的，或 (name和address的并集中 有广州和王五的)
GET /employee/_search
{
  "query": {
    "query_string": {
      "fields": ["name","address"],
      "query": "张三 OR (广州 AND 王五)"
    }
  }
}

simple_query_string

simple_query_string:类似Query String，但是会忽略错误的语法,同时只支持部分查询语法，不支持AND OR NOT，会当作字符串处理
支持逻辑：
+ -> AND
| -> OR
- -> NOT
在生产环境中推荐使用 simple_query_string 而不是 query_string 主要是因为simple_query_string 提供了更宽松的语法，能够容忍一定程度的输入错误，而不会导致整个查询失败。

GET /<index_name>/_search
{
   "query": {
     "simple_query_string": {
       "query": "<query_string>",
       "fields": ["<field1>", "<field2>", ...],
       "default_operator": "OR" // 或 "AND"
     }
   }
}

// <query_string> 是要搜索的查询表达式
// <field1>, <field2>, ... 是搜索可以在其中进行的字段列表
// default_operator 定义了查询字符串中未指定操作符时的默认逻辑运算符，可以是 "OR" 或"AND"
// simple_query_string 默认的operator是OR

GET /employee/_search
{
  "query": {
    "simple_query_string": {
      "query": "广州公园",
      "fields": ["name","address"],
      "default_operator": "AND"
    }
  }
}

GET /employee/_search
{
  "query": {
    "simple_query_string": {
      "query": "广州+公园",
      "fields": ["name","address"]
    }
  }
}

精确匹配与全文检索的本质区别主要表现在两个方面：
精确不对待检索文本进行分词处理，而是将整个文本视为一个完整的词条进行匹配。
全文检索则需要对文本进行分词处理。在分词后，每个词条将单独进行检索，并通过布尔逻辑（如与、或、非等）进行组合检索，以找到最相关的结果。

bool

bool query ：布尔查询
搜索上下文: 使用搜索上下文时，Elasticsearch需要计算每个文档与搜索条件的相关度得分，这个得分的计算需使用一套复杂的计算公式，有一定的性能开销，带文本分析的全文检索的查询语句很适合放在搜索上下文中 过滤上下文: 使用过滤上下文时，Elasticsearch只需要判断搜索条件跟文档数据是否匹配，例如使用Term query判断一个值是否跟搜索内容一致，使用Range query判断某数据是否位于某个区间等。过滤上下文的查询不需要进行相关度得分计算，还可以使用缓存加快响应速度，很多术语级查询语句都适合放在过滤上下文中

搜索上下文

must :可包含多个查询条件，每个条件均满足的文档才能被搜索到，每次查询需要计算相关度得分，属于搜索上下文
should : 可包含多个查询条件，不存在must和fiter条件时，至少要满足多个查询条件中的一个，文档才能被搜索到，否则需满足的条件数量不受限制,匹配到的查询越多相关度越高，也属于搜索上下文

过滤上下文

filter :可包含多个过滤条件，每个条件均满足的文档才能被搜索到，每个过滤条件不计算相关度得分,结果在一定条件下会被缓存，属于过滤上下文 must_not :可包含多个过滤条件，每个条件均不满足的文档才能被搜索到，每个过滤条件不计算相关度得分，结果在一定条件下会被缓存，属于过滤上下文

地理空间位置

确保索引中存在一个geo_point 类型的字段

PUT /my_index
{
  "mappings": {
    "properties": {
      "location":{
        "type": "geo_point"
      }
    }
  }
}

查询给定坐标小于等于10km的所有文档

GET /my-index/_search
{
  "query": {
    "bool": {
      "must": [
        {"match_all": {}}
      ],
      "filter": [
        {"geo_distance": {
          "distance": "10km",
          "distance_type": "arc", 
          "location": {
            "lat": 39.9,
            "lon": 116.4
          }
        }}
      ]
    }
  }
}

bool:组合多个查询
match_all:匹配所有文档的查询子句
geo_distance:允许指定一个距离和一个坐标
distance_type:
arc:以地球表面的弧长为单位
plane：以直线距离为单位
通常使用arc
distance:查询的最大距离，单位可以是m，km等。
location：查询的参考点，包含经度和纬度

示例：

// 创建索引
PUT /tourist_spots
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_max_word"
      },
      "location":{
        "type": "geo_point"
      }
    }
  }
}
// 插入文档
POST /tourist_spots/_bulk
{"index":{"_id":1}}
{"id":1,"name":"故宫博物院","location":{"lat":39.9259,"lon":116.3945},"city":"北京"}
{"index":{"_id":2}}
{"id":2,"name":"西湖","location":{"lat":30.2614,"lon":120.1479},"city":"杭州"}
{"index":{"_id":3}}
{"id":3,"name":"雷峰塔","location":{"lat": 30.2511,"lon": 120.1347},"city":"杭州"}
{"index":{"_id":4}}
{"id":4,"name":"苏堤春晓","location":{"lat":30.2584,"lon": 120.1383},"city":"杭州"}

// 查询北京附近的景点
GET /tourist_spots/_search
{
  "query": {
    "bool": {
      "must": {"match_all": {}},
      "filter": [
        {"geo_distance": {
          "distance": "10km",
          "distance_type": "arc", 
          "location": {
            "lat": 39.9259,
            "lon": 116.3945
          }
        }}
      ]
    }
  }
}

向量检索

....

聚合查询

"aggregations" : {  "<aggregation_name>" : {  "<aggregation_type>" : {  <aggregation_body>  } [,"meta" : { [<meta_data_body>] } ]?  [,"aggregations" : { [<sub_aggregation>]+ } ]?  } [,"<aggregation_name_2>" : { ... } ]*  }

桶聚合

GET /employee/_search
{
  "size": 0,// 默认会返回搜索结果，这里只聚合数据 
  "aggs": {
    "alls": {
      "terms": {
        "script": {
          "source": "return 'all_docs';"// 强制聚合到一个桶
        }
      },
      "aggs": {
        "total_count": {// 统计文档中age的数量
          "value_count": {
            "field": "age"
          }
        },
        "filter_count":{// 统计age>20的个数
          "value_count": {
            "script": "if (doc['age'].value > 20) return 1"
          }
        },
        "percent_agg":{// 计算
          "bucket_script": {
            "buckets_path": {
               "totalCount":"total_count",
                "filterCount":"filter_count"
            },
            "script":"params.totalCount/params.filterCount*100"
          }
        }
      }
    }
  }
}

//结果
"aggregations": {
    "alls": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "all_docs",
          "doc_count": 6,
          "total_count": {
            "value": 6
          },
          "filter_count": {
            "value": 5
          },
          "percent_agg": {
            "value": 120
          }
        }
      ]
    }
  }

terms：基于字符串或数值字段将文档分组为多个桶。
filters：将文档分组为多个桶，每个桶对应一组过滤条件。

指标聚合

管道聚合

其他

dynamic

true : 未知字段会被自动加入，默认值
false : 新字段不会被索引，但是会保存在_source
strict : 新增字段不会被索引，文档写入失败

PUT /user
{
  "mappings": {
    "dynamic":"strict",
    "properties": {
      "name":{
        "type": "text"
      },
      "address":{
        "type": "object",
        "dynamic":"true"
      }
    }
  }
}
// 插入文档失败，索引定义的字段中没有age这个字段
PUT /user/_doc/1
{
  "name":"fox",
  "age":32,
  "address":{
    "province":"上海",
    "city":"浦东"
  }
}

数据类型

简单数据类型

text:文本类型，内容会被分词处理；不能用于排序；很少用于聚合
keyword:关键字，不会被分词处理，需要进行精确值过滤、排序、聚合等操作时使用，可设置是否存储："index":"true|false"
数字类型：
- byte:
- short:
- integer:
- long:
- float:
- double:
- half_float:16位半精度IEEE 754浮点类型
- scaled_float:缩放类型的的浮点数, 比如price字段只需精确到分, 57.34缩放因子为100, 存储结果为5734
优先考虑使用带缩放因子的浮点类型.

date:日期

代表时间秒数的整数.
代表时间毫秒数的长整型数字.
包含格式化日期的字符串, "2018-10-01", 或"2018/10/01 12:10:30".
solr默认格式："2018-10-10T12:00:00Z".

支持多种格式：

// 添加映射
PUT blog
{
    "mappings": {
        "blog": {
            "properties": {
                "date": {
                    "type": "date",  // 可以接受如下类型的格式
                    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                }
            }
        }
    }
}

boolean:布尔值
- 真值：true, "true", "on", "yes", "1"...
- 假值：false, "false", "off", "no", "0", ""(空字符串), 0.0, 0
binary：二进制
- 不以默认的方式存储，且不能搜索
- 两个设置选项：
  - doc_values:该字段是否需要存储到磁盘上, 方便以后用来排序、聚合或脚本查询. 接受true和false(默认);
  - store: 该字段的值是否要和_source分开存储、检索, 意思是除了_source中, 是否要单独再存储一份. 接受true或false(默认).

range:范围类型

integer_range:
long_range:
float_range:
double_range:
date_range:
ip_range:IP值的范围, 支持IPV4和IPV6, 或者这两种同时存在

// 定义
"expected_number": {  // 预期数
    "type": "integer_range"
}
"time_frame": {       // 发展时间线
    "type": "date_range", 
    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
//新增
"expected_number" : {
    "gte" : 10,
    "lte" : 20
}
"time_frame" : { 
    "gte" : "2018-10-01 12:00:00", 
    "lte" : "2018-11-01"
}
//查询
"query": {
    "term": {
        "expected_number": {
            "value": 12
        }
    }
}
"query": {
    "range": {
        "time_frame": {
            "gte": "208-08-01",
            "lte": "2018-12-01",
            "relation": "within" 
        }
    }
}

复杂数据类型

array：数组
不支持混合类型的数组
- 字符串数组: ["one", "two"];
- 整数数组: [1, 2];
- 由数组组成的数组: [1, [2, 3]], 等价于[1, 2, 3];
- 对象数组: [{"name": "Tom", "age": 20}, {"name": "Jerry", "age": 18}].
- 动态添加数据时, 数组中第一个值的类型决定整个数组的类型;
- 不支持混合数组类型, 比如[1, "abc"];
- 数组可以包含null值, 空数组[]会被当做missing field —— 没有值的字段.

object：对象

// 新增
PUT employee/developer/1
{
    "name": "ma_shoufeng",
    "address": {
        "region": "China",
        "location": {"province": "GuangDong", "city": "GuangZhou"}
    }
}
// 存储结构
{
    "name":                       "ma_shoufeng",
    "address.region":             "China",
    "address.location.province":  "GuangDong", 
    "address.location.city":      "GuangZhou"
}
//映射结构
PUT employee
{
    "mappings": {
        "developer": {
            "properties": {
                "name": { "type": "text", "index": "true" }, 
                "address": {
                    "properties": {
                        "region": { "type": "keyword", "index": "true" },
                        "location": {
                            "properties": {
                                "province": { "type": "keyword", "index": "true" },
                                "city": { "type": "keyword", "index": "true" }
                            }
                        }
                    }
                }
            }
        }
    }
}

nested:嵌套类型

嵌套类型是对象数据类型的一个特例, 可以让array类型的对象被独立索引和搜索.

// 新增
PUT game_of_thrones/role/1
{
    "group": "stark",
	"performer": [
        {"first": "John", "last": "Snow"},
        {"first": "Sansa", "last": "Stark"}
    ]
}

// 存储结构
{
    "group": 	         "stark",
    "performer.first": [ "john", "sansa" ],
    "performer.last":  [ "snow", "stark" ]
}
//user.first和user.last会被平铺为多值字段, 这样一来, John和Snow之间的关联性就丢失了.在查询时, 可能出现John Stark的结果.

如果需要对以最对象进行索引, 且保留数组中每个对象的独立性, 就应该使用嵌套数据类型.
嵌套对象实质是将每个对象分离出来, 作为隐藏文档进行索引.

// 创建
PUT game_of_thrones
{
    "mappings": {
        "role": {
            "properties": {
                "performer": {"type": "nested" }
            }
        }
    }
}
//新增
PUT game_of_thrones/role/1
{
    "group" : "stark",
    "performer" : [
        {"first": "John", "last": "Snow"},
        {"first": "Sansa", "last": "Stark"}
    ]
}
//检索
GET game_of_thrones/_search
{
    "query": {
        "nested": {
            "path": "performer",
            "query": {
                "bool": {
                    "must": [
                        { "match": { "performer.first": "John" }},
                        { "match": { "performer.last":  "Snow" }} 
                    ]
                }
            }, 
            "inner_hits": {
                "highlight": {
                    "fields": {"performer.first": {}}
                }
            }
        }
    }
}

geo_point:地理点类型
地理点类型用于存储地理位置的经纬度对, 可用于:
- 查找一定范围内的地理点;
- 通过地理位置或相对某个中心点的距离聚合文档;
- 将距离整合到文档的相关性评分中;
- 通过距离对文档进行排序.
geo_shape:地理形状类型