diff --git a/docs/ai/text-search/search-function.md b/docs/ai/text-search/search-function.md index 51fc680fcf544..304661025313c 100644 --- a/docs/ai/text-search/search-function.md +++ b/docs/ai/text-search/search-function.md @@ -35,6 +35,29 @@ Usage When `default_field` is provided, Doris expands bare terms or functions to that field. For example, `SEARCH('foo bar', 'tags', 'and')` behaves like `SEARCH('tags:ALL(foo bar)')`, while `SEARCH('foo bark', 'tags')` expands to `tags:ANY(foo bark)`. Explicit boolean operators inside the DSL always take precedence over the default operator. +### Options Parameter (JSON format) + +The second parameter can also be a JSON string for advanced configuration: + +```sql +SEARCH('', '') +``` + +**Supported options:** + +| Option | Type | Description | +|--------|------|-------------| +| `default_field` | string | Column name for terms without explicit field | +| `default_operator` | string | `and` or `or` for multi-term expressions | +| `mode` | string | `standard` (default) or `lucene` | +| `minimum_should_match` | integer | Minimum SHOULD clauses to match (Lucene mode only) | + +**Example:** +```sql +SELECT * FROM docs WHERE search('apple banana', + '{"default_field":"title","default_operator":"and","mode":"lucene"}'); +``` + `SEARCH()` follows SQL three-valued logic. Rows where all referenced fields are NULL evaluate to UNKNOWN (filtered out in the `WHERE` clause) unless other predicates short-circuit the expression (`TRUE OR NULL = TRUE`, `FALSE OR NULL = NULL`, `NOT NULL = NULL`), matching the behavior of dedicated text search operators. ### Current Supported Queries @@ -100,6 +123,38 @@ SELECT id, title FROM search_test_basic WHERE SEARCH('tags:ANY(python javascript) AND (category:Technology OR category:Programming)'); ``` +#### Lucene Boolean Mode + +Lucene mode mimics Elasticsearch/Lucene query_string behavior where boolean operators work as left-to-right modifiers instead of traditional boolean algebra. + +**Key differences from standard mode:** +- AND/OR/NOT are modifiers that affect adjacent terms +- Operator precedence is left-to-right +- Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) + +**Enable Lucene mode:** +```sql +-- Basic Lucene mode +SELECT * FROM docs WHERE search('apple AND banana', + '{"default_field":"title","mode":"lucene"}'); + +-- With minimum_should_match +SELECT * FROM docs WHERE search('apple AND banana OR cherry', + '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); +``` + +**Behavior comparison:** + +| Query | Standard Mode | Lucene Mode | +|-------|--------------|-------------| +| `a AND b` | a ∩ b | +a +b (both MUST) | +| `a OR b` | a ∪ b | a b (both SHOULD, min=1) | +| `NOT a` | ¬a | -a (MUST_NOT) | +| `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | +| `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | + +**Note:** In Lucene mode, `a AND b OR c` parses left-to-right: the OR operator changes `b` from MUST to SHOULD. Use `minimum_should_match` to require SHOULD matches. + #### Phrase query - Syntax: `column:"quoted phrase"` - Semantics: matches contiguous tokens in order using the column's analyzer; quotes must wrap the entire phrase. @@ -253,6 +308,31 @@ WHERE SEARCH('properties.message:hello OR properties.category:beta') ORDER BY id; ``` +#### Escape Characters + +Use backslash (`\`) to escape special characters in DSL: + +| Escape | Description | Example | +|--------|-------------|---------| +| `\ ` | Literal space (joins terms) | `title:First\ Value` matches "First Value" | +| `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | +| `\:` | Literal colon | `title:key\:value` matches "key:value" | +| `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" | + +**Example:** +```sql +-- Search for value containing space as single term +SELECT * FROM docs WHERE search('title:First\\ Value'); + +-- Search for value with parentheses +SELECT * FROM docs WHERE search('title:hello\\(world\\)'); + +-- Search for value with colon +SELECT * FROM docs WHERE search('title:key\\:value'); +``` + +**Note:** In SQL strings, backslashes need double escaping. Use `\\` in SQL to produce a single `\` in the DSL. + ### Current Limitations - Range and list clauses (`field:[a TO b]`, `field:IN(...)`) still degrade to term lookups; rely on regular SQL predicates for numeric/date ranges or explicit `IN` filters. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-function.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-function.md index 11a45ea7256a9..bb43a4ef25b6d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-function.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-function.md @@ -35,6 +35,29 @@ SEARCH('', '', '') 提供 `default_field` 后,Doris 会把裸词项或函数自动扩展到该字段。例如 `SEARCH('foo bar', 'tags', 'and')` 等价于 `SEARCH('tags:ALL(foo bar)')`,而 `SEARCH('foo bark', 'tags')` 会展开为 `tags:ANY(foo bark)`。DSL 中显式出现的布尔操作优先级最高,会覆盖默认运算符。 +### Options 参数(JSON 格式) + +第二个参数也可以是 JSON 字符串,用于高级配置: + +```sql +SEARCH('', '') +``` + +**支持的选项:** + +| 选项 | 类型 | 说明 | +|------|------|------| +| `default_field` | string | 未指定字段的词项使用的默认列名 | +| `default_operator` | string | 多词项表达式的默认运算符(`and` 或 `or`) | +| `mode` | string | `standard`(默认)或 `lucene` | +| `minimum_should_match` | integer | SHOULD 子句最小匹配数(仅 Lucene 模式) | + +**示例:** +```sql +SELECT * FROM docs WHERE search('apple banana', + '{"default_field":"title","default_operator":"and","mode":"lucene"}'); +``` + `SEARCH()` 遵循 SQL 三值逻辑。当所有参与匹配的列值均为 NULL 时结果为 UNKNOWN(在 `WHERE` 中被过滤),但若与其他子表达式组合,可按布尔短路原则返回 TRUE 或继续保留 NULL(例如 `TRUE OR NULL = TRUE`、`FALSE OR NULL = NULL`、`NOT NULL = NULL`),行为与文本检索算子保持一致。 ### 当前支持语法 @@ -100,6 +123,38 @@ SELECT id, title FROM search_test_basic WHERE SEARCH('tags:ANY(python javascript) AND (category:Technology OR category:Programming)'); ``` +#### Lucene 布尔模式 + +Lucene 模式模拟 Elasticsearch/Lucene 的 query_string 行为,布尔操作符作为左到右的修饰符工作,而非传统的布尔代数。 + +**与标准模式的主要区别:** +- AND/OR/NOT 是影响相邻词项的修饰符 +- 操作符优先级从左到右 +- 内部使用 MUST/SHOULD/MUST_NOT(类似 Lucene 的 Occur 枚举) + +**启用 Lucene 模式:** +```sql +-- 基本 Lucene 模式 +SELECT * FROM docs WHERE search('apple AND banana', + '{"default_field":"title","mode":"lucene"}'); + +-- 使用 minimum_should_match +SELECT * FROM docs WHERE search('apple AND banana OR cherry', + '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); +``` + +**行为对比:** + +| 查询 | 标准模式 | Lucene 模式 | +|------|----------|-------------| +| `a AND b` | a ∩ b | +a +b(都是 MUST) | +| `a OR b` | a ∪ b | a b(都是 SHOULD,min=1) | +| `NOT a` | ¬a | -a(MUST_NOT) | +| `a AND NOT b` | a ∩ ¬b | +a -b(MUST a,MUST_NOT b) | +| `a AND b OR c` | (a ∩ b) ∪ c | +a b c(只有 a 是 MUST) | + +**注意:** 在 Lucene 模式中,`a AND b OR c` 从左到右解析:OR 操作符将 `b` 从 MUST 改为 SHOULD。使用 `minimum_should_match` 来要求 SHOULD 子句匹配。 + #### 词组查询 - 语法:`column:"quoted phrase"` - 语义:根据列的分析器匹配连续且有序的词项,需使用双引号包裹完整短语。 @@ -253,6 +308,31 @@ WHERE SEARCH('properties.message:hello OR properties.category:beta') ORDER BY id; ``` +#### 转义字符 + +使用反斜杠(`\`)转义 DSL 中的特殊字符: + +| 转义 | 说明 | 示例 | +|------|------|------| +| `\ ` | 字面空格(连接词项) | `title:First\ Value` 匹配 "First Value" | +| `\(` `\)` | 字面括号 | `title:hello\(world\)` 匹配 "hello(world)" | +| `\:` | 字面冒号 | `title:key\:value` 匹配 "key:value" | +| `\\` | 字面反斜杠 | `title:path\\to\\file` 匹配 "path\to\file" | + +**示例:** +```sql +-- 搜索包含空格的值作为单个词项 +SELECT * FROM docs WHERE search('title:First\\ Value'); + +-- 搜索包含括号的值 +SELECT * FROM docs WHERE search('title:hello\\(world\\)'); + +-- 搜索包含冒号的值 +SELECT * FROM docs WHERE search('title:key\\:value'); +``` + +**注意:** 在 SQL 字符串中,反斜杠需要双重转义。使用 `\\` 在 SQL 中产生 DSL 中的单个 `\`。 + ### 当前限制 - 范围与列表子句(如 `field:[a TO b]`、`field:IN(...)`)仍会降级为普通词项匹配,建议使用常规 SQL 范围/`IN` 过滤。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/text-search/search-function.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/text-search/search-function.md index 11a45ea7256a9..b4e312e6d04d6 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/text-search/search-function.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/text-search/search-function.md @@ -22,11 +22,19 @@ SEARCH 是一个返回布尔值的谓词函数,可作为过滤条件出现在 SEARCH('') SEARCH('', '') SEARCH('', '', '') +SEARCH('', '') ``` - ``:SEARCH DSL 查询表达式(字符串字面量) - ``(可选):当 DSL 中的词项未显式指定字段时自动套用的列名。 - ``(可选):多词项表达式默认布尔运算符,仅接受 `and` 或 `or`(不区分大小写),默认为 `or`。 +- ``(可选):JSON 字符串,包含高级搜索选项(如多字段搜索)。支持的选项: + - `default_field`:同第二个参数 + - `default_operator`:同第三个参数(`and` 或 `or`) + - `fields`:字段名数组,用于多字段搜索(与 `default_field` 互斥) + - `type`:多字段搜索模式,可选 `best_fields`(默认)或 `cross_fields` + - `mode`:解析模式,可选 `standard`(默认)或 `lucene` + - `minimum_should_match`:lucene 模式下的最小匹配数(默认:0) 用法 @@ -121,6 +129,46 @@ SELECT id, title FROM search_test_basic WHERE SEARCH('tags:ALL(tutorial) AND category:Technology'); ``` +#### 多字段搜索(Elasticsearch 风格) +- 语法:使用带 `fields` 数组的 JSON 选项 +- 语义:在多个字段中搜索相同的词项,自动展开;支持两种模式: + - `best_fields`(默认):所有词项必须出现在同一字段中,各字段结果通过 OR 组合 + - `cross_fields`:词项可以分布在不同字段中(视为一个组合字段) +- 索引建议:为 `fields` 数组中的每个字段建立倒排索引 + +**best_fields 模式**(默认):每个字段必须包含所有词项,然后各字段结果通过 OR 组合。 + +```sql +-- 在 title 和 content 字段中搜索 "machine learning" +-- 展开为:(title:machine AND title:learning) OR (content:machine AND content:learning) +SELECT id, title FROM articles +WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and"}'); + +-- 显式指定 type 参数 +SELECT id, title FROM articles +WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and","type":"best_fields"}'); +``` + +**cross_fields 模式**:词项可以匹配不同字段,将所有字段视为一个组合字段。 + +```sql +-- 跨 title 和 content 搜索 "machine learning" +-- 展开为:(title:machine OR content:machine) AND (title:learning OR content:learning) +SELECT id, title FROM articles +WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}'); + +-- 适用于跨 firstname/lastname 字段搜索人名 +SELECT id, name FROM people +WHERE SEARCH('John Smith', '{"fields":["firstname","lastname"],"default_operator":"and","type":"cross_fields"}'); +``` + +**模式对比**: + +| 模式 | 行为 | 适用场景 | +|------|------|----------| +| `best_fields` | 所有词项必须在同一字段内匹配 | 文档搜索,相关性与字段相关 | +| `cross_fields` | 词项可以跨任意字段匹配 | 实体搜索(如人名分布在多个字段) | + #### 通配符查询 - 语法:`column:prefix*`、`column:*mid*`、`column:?ingle` - 语义:使用 `*` 匹配任意长度字符串,`?` 匹配单个字符。 diff --git a/versioned_docs/version-4.x/ai/text-search/search-function.md b/versioned_docs/version-4.x/ai/text-search/search-function.md index 51fc680fcf544..0ecdf3bdd1714 100644 --- a/versioned_docs/version-4.x/ai/text-search/search-function.md +++ b/versioned_docs/version-4.x/ai/text-search/search-function.md @@ -22,11 +22,19 @@ Syntax SEARCH('') SEARCH('', '') SEARCH('', '', '') +SEARCH('', '') ``` - `` — string literal containing the SEARCH DSL expression. - `` *(optional)* — column name automatically applied to terms that do not specify a field. - `` *(optional)* — default boolean operator for multi-term expressions; accepts `and` or `or` (case-insensitive). Defaults to `or`. +- `` *(optional)* — JSON string containing search options for advanced features like multi-field search. Supported options: + - `default_field`: same as the second parameter + - `default_operator`: same as the third parameter (`and` or `or`) + - `fields`: array of field names for multi-field search (mutually exclusive with `default_field`) + - `type`: multi-field search mode, either `best_fields` (default) or `cross_fields` + - `mode`: parsing mode, either `standard` (default) or `lucene` + - `minimum_should_match`: integer for lucene mode (default: 0) Usage @@ -121,6 +129,46 @@ SELECT id, title FROM search_test_basic WHERE SEARCH('tags:ALL(tutorial) AND category:Technology'); ``` +#### Multi-field search (Elasticsearch-style) +- Syntax: Use JSON options with `fields` array +- Semantics: search the same terms across multiple fields with automatic expansion; supports two modes: + - `best_fields` (default): matches if all terms appear in the same field, fields are ORed together + - `cross_fields`: matches if terms appear across different fields (like a single combined field) +- Indexing tip: add inverted indexes for each field in the `fields` array + +**best_fields mode** (default): Each field must contain all terms, then results from all fields are combined with OR. + +```sql +-- Search "machine learning" in both title and content fields +-- Expands to: (title:machine AND title:learning) OR (content:machine AND content:learning) +SELECT id, title FROM articles +WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and"}'); + +-- With explicit type parameter +SELECT id, title FROM articles +WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and","type":"best_fields"}'); +``` + +**cross_fields mode**: Terms can match across different fields, treating all fields as one combined field. + +```sql +-- Search "machine learning" across title and content +-- Expands to: (title:machine OR content:machine) AND (title:learning OR content:learning) +SELECT id, title FROM articles +WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}'); + +-- Useful for searching person names across firstname/lastname fields +SELECT id, name FROM people +WHERE SEARCH('John Smith', '{"fields":["firstname","lastname"],"default_operator":"and","type":"cross_fields"}'); +``` + +**Comparison of modes**: + +| Mode | Behavior | Use Case | +|------|----------|----------| +| `best_fields` | All terms must match within the same field | Document search where relevance is field-specific | +| `cross_fields` | Terms can match across any field | Entity search (e.g., person name split across fields) | + #### Wildcard query - Syntax: `column:prefix*`, `column:*mid*`, `column:?ingle` - Semantics: performs pattern matching with `*` (multi-character) and `?` (single-character) wildcards.