MySQL - 自然语言全文搜索
在深入探讨自然语言全文搜索的概念之前,让我们先了解一下它的背景。如今,用于搜索的关键词可能并不总是与用户期望的结果相匹配。因此,搜索引擎的设计重点是提高搜索相关性,以缩小搜索查询和搜索结果之间的准确率差距。因此,结果将按与搜索关键词最相关的顺序显示。
同样,在像 MySQL 这样的关系数据库中,全文搜索是一种用于检索可能与搜索关键词不完全匹配的结果集的技术。全文搜索使用三种搜索模式 -
自然语言模式
查询扩展模式
布尔模式
自然语言全文搜索
自然语言全文搜索在自然语言模式下执行常规的全文搜索。在此模式下执行全文搜索时,搜索结果将按其与关键词(执行搜索的依据)的相关性顺序显示。这是全文搜索的默认模式。
由于这是全文搜索,因此必须将全文索引应用于基于文本的列(例如 CHAR、VARCHAR 和 TEXT 数据类型的列)。全文索引是一种特殊类型的索引,用于搜索文本值中的关键字,而不是尝试将关键字与这些列值进行比较。
语法
以下是执行自然语言全文搜索的基本语法 -
SELECT * FROM table_name WHERE MATCH(column_name(s)) AGAINST ('keyword_name' IN NATURAL LANGUAGE MODE);
示例
通过以下示例,我们将了解如何在数据库表上执行自然语言全文搜索。
为此,我们首先创建一个名为 ARTICLES 的表,其中包含文章的标题和描述。FULLTEXT 索引应用于文本列 article_title 和 descriptions,如下所示 -
CREATE TABLE ARTICLES ( ID INT AUTO_INCREMENT NOT NULL PRIMARY KEY, ARTICLE_TITLE VARCHAR(100), DESCRIPTION TEXT, FULLTEXT (ARTICLE_TITLE, DESCRIPTION) ) ENGINE = InnoDB;
现在,让我们使用以下查询将文章的详细信息(如标题和描述)插入到此表中 -
INSERT INTO ARTICLES (ARTICLE_TITLE, DESCRIPTION) VALUES ('MySQL Tutorial', 'MySQL is a relational database system that uses SQL to structure data stored'), ('Java Tutorial', 'Java is an object-oriented and platform-independent programming language'), ('Hadoop Tutorial', 'Hadoop is framework that is used to process large sets of data'), ('Big Data Tutorial', 'Big Data refers to data that has wider variety of data sets in larger numbers'), ('JDBC Tutorial', 'JDBC is a Java based technology used for database connectivity');
创建表如下 −
ID | ARTICLE_TITLE | DESCRIPTION |
---|---|---|
1 | MySQL Tutorial | MySQL is a relational database system that uses SQL to structure data stored |
2 | Java Tutorial | Java is an object-oriented and platform-independent programming language |
3 | Hadoop Tutorial | Hadoop is framework that is used to process large sets of data |
4 | Big Data Tutorial | Big Data refers to data that has wider variety of data sets in larger numbers |
5 | JDBC Tutorial | JDBC is a Java based technology used for database connectivity |
使用全文搜索中的自然语言模式,以关键字"数据集"搜索与数据相关的文章记录。
SELECT * FROM ARTICLES WHERE MATCH(ARTICLE_TITLE, DESCRIPTION) AGAINST ('data set' IN NATURAL LANGUAGE MODE);
输出
以下是输出 -
ID | ARTICLE_TITLE | DESCRIPTION |
---|---|---|
4 | Big Data Tutorial | Big Data refers to data that has wider variety of data sets in larger numbers |
1 | MySQL Tutorial | MySQL is a relational database system that uses SQL to structure data stored |
3 | Hadoop Tutorial | Hadoop is framework that is used to process large sets of data |
如上所示,在表格中的所有文章中,我们获得了三条与术语"数据集"相关的搜索结果,并按其相关性排序。但请注意,关键字"数据集"在"MySQL 教程"文章记录中并非完全匹配,但由于 MySQL 也处理数据集,因此仍然会被检索到。
搜索中的停用词
自然语言全文搜索使用 tf-idf 算法,其中"tf"表示词频,"idf"表示逆文档频率。搜索指的是某个词在单个文档中出现的频率以及该词出现的文档数量。然而,搜索通常会忽略一些词,例如包含少于特定字符的词。InnoDB 会忽略少于 3 个字符的词,而 MyISAM 会忽略少于 4 个字符的词。这些词被称为停用词(the、a、an、are 等)。
示例
在下面的示例中,我们将对上面创建的 ARTICLES 表执行简单的自然语言全文搜索。让我们通过对两个关键词"Big Tutorial"和"is Tutorial"执行全文搜索来了解停用词对全文搜索的影响。
搜索"Big Tutorial":
以下查询在自然语言模式下针对"Big Tutorial"关键词执行全文搜索 -
SELECT ARTICLE_TITLE, DEscription FROM ARTICLES WHERE MATCH(ARTICLE_TITLE, DEscription) AGAINST ('Big Tutorial' IN NATURAL LANGUAGE MODE);
输出:
输出结果为:-
ARTICLE_TITLE | DESCRIPTION |
---|---|
Big Data Tutorial | Big Data refers to data that has wider variety of data sets in larger numbers |
MySQL Tutorial | MySQL is a relational database system that uses SQL to structure data stored |
Java Tutorial | Java is an object-oriented and platform-independent programming language |
Hadoop Tutorial | Hadoop is framework that is used to process large sets of data |
JDBC Tutorial | JDBC is a Java based technology used for database connectivity |
搜索"is Tutorial":
以下查询在自然语言模式下针对"is Tutorial"关键字执行全文搜索 -
SELECT ARTICLE_TITLE, DEscription FROM Articles WHERE MATCH(ARTICLE_TITLE, DEscription) AGAINST ('is Tutorial' IN NATURAL LANGUAGE MODE);
输出:
输出结果如下 -
ARTICLE_TITLE | DESCRIPTION |
---|---|
MySQL Tutorial | MySQL is a relational database system that uses SQL to structure data stored |
Java Tutorial | Java is an object-oriented and platform-independent programming language |
Hadoop Tutorial | Hadoop is framework that is used to process large sets of data |
Big Data Tutorial | Big Data refers to data that has wider variety of data sets in larger numbers |
JDBC Tutorial | JDBC is a Java based technology used for database connectivity |
如上例所示,由于"Tutorial"一词出现在表的所有记录中,因此两种情况下都会检索到所有记录。但是,相关性顺序由指定关键字的第二个单词决定。
在第一种情况下,由于"Big Data Tutorial"中包含"Big"一词,因此首先检索该记录。在第二种情况下,由于"is"是停用词,结果集中记录的顺序与原始表的顺序相同,因此被忽略。
使用客户端程序进行自然语言全文搜索
我们也可以使用客户端程序在 MySQL 数据库上执行自然语言全文搜索操作。
语法
要通过 PHP 程序执行自然语言全文搜索,我们需要使用 mysqli 函数 query() 执行以下 SELECT 语句,如下所示 -
$sql = "SELECT * FROM Articles WHERE MATCH(ARTICLE_TITLE, DEscription) AGAINST ('data set' IN NATURAL LANGUAGE MODE)"; $mysqli->query($sql);
要通过 JavaScript 程序执行自然语言全文搜索,我们需要使用 mysql2 库的 query() 函数执行以下 SELECT 语句,如下所示 -
sql = `SELECT * FROM Articles WHERE MATCH(ARTICLE_TITLE, DEscription) AGAINST ('data set' IN NATURAL LANGUAGE MODE)`; con.query(sql);
要通过 Java 程序执行自然语言全文搜索,我们需要使用 JDBC 函数 executeQuery() 执行 SELECT 语句,如下所示 -
String sql = "SELECT * FROM Articles WHERE MATCH(ARTICLE_TITLE, DEscription) AGAINST ('data set' IN NATURAL LANGUAGE MODE)"; statement.executeQuery(sql);
要通过 Python 程序执行自然语言全文搜索,我们需要使用 MySQL Connector/Python 的 execute() 函数执行 SELECT 语句,如下所示 -
natural_language_search_query = 'SELECT * FROM Articles WHERE MATCH(ARTICLE_TITLE, DEscription) AGAINST ('data set' IN NATURAL LANGUAGE MODE)' cursorObj.execute(natural_language_search_query)
示例
以下是程序 -
$dbhost = 'localhost'; $dbuser = 'root'; $dbpass = 'password'; $dbname = 'TUTORIALS'; $mysqli = new mysqli($dbhost, $dbuser, $dbpass, $dbname); if ($mysqli->connect_errno) { printf("Connect failed: %s
", $mysqli->connect_error); exit(); } // printf('Connected successfully.
'); $s = "SELECT * FROM Articles WHERE MATCH(ARTICLE_TITLE, DESCRIPTION) AGAINST ('data set' IN NATURAL LANGUAGE MODE)"; if ($r = $mysqli->query($s)) { printf("Table Records: "); while ($row = $r->fetch_assoc()) { printf(" ID: %d, Title: %s, Descriptions: %s", $row["id"], $row["ARTICLE_TITLE"], $row["DESCRIPTION"]); printf(" "); } } else { printf('Failed'); } $mysqli->close();
输出
获得的输出如下所示 -
Table Records: ID: 4, Title: Big Data Tutorial, Descriptions: Big Data refers to data that has wider variety of data sets in larger numbers ID: 1, Title: MySQL Tutorial, Descriptions: MySQL is a relational database system that uses SQL to structure data stored ID: 3, Title: Hadoop Tutorial, Descriptions: Hadoop is framework that is used to process large sets of data
var mysql = require("mysql2"); var con = mysql.createConnection({ host: "localhost", user: "root", password: "password", }); //连接到 MySQL con.connect(function (err) { if (err) throw err; // console.log("Connected successfully...!"); // console.log("--------------------------"); sql = "USE TUTORIALS"; con.query(sql); //display the table details!... sql = `SELECT * FROM Articles WHERE MATCH(ARTICLE_TITLE, DESCRIPTION) AGAINST ('data set' IN NATURAL LANGUAGE MODE)`; con.query(sql, function (err, result) { if (err) throw err; console.log(result); }); });
输出
获得的输出如下所示 -
We get the following output, after executing the above NodeJs Program. [ { id: 4, ARTICLE_TITLE: 'Big Data Tutorial', DESCRIPTION: 'Big Data refers to data that has wider variety of data sets in larger numbers' }, { id: 1, ARTICLE_TITLE: 'MySQL Tutorial', DESCRIPTION: 'MySQL is a relational database system that uses SQL to structure data stored' }, { id: 3, ARTICLE_TITLE: 'Hadoop Tutorial', DESCRIPTION: 'Hadoop is framework that is used to process large sets of data' } ]
import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.Statement; public class NaturalLanguageSearch { public static void main(String[] args) { String url = "jdbc:mysql://localhost:3306/TUTORIALS"; String username = "root"; String password = "password"; try { Class.forName("com.mysql.cj.jdbc.Driver"); Connection connection = DriverManager.getConnection(url, username, password); Statement statement = connection.createStatement(); System.out.println("Connected successfully...!"); //displaying the fulltext records in the Natural language mode: ResultSet resultSet = statement.executeQuery("SELECT * FROM Articles WHERE MATCH(ARTICLE_TITLE, descriptions) AGAINST ('data set' IN NATURAL LANGUAGE MODE)"); while (resultSet.next()){ System.out.println(resultSet.getString(1)+" "+resultSet.getString(2)+ " "+resultSet.getString(3)); } connection.close(); } catch (Exception e) { System.out.println(e); } } }
输出
获得的输出如下所示 -
Connected successfully...! 4 Big Data Tutorial Big Data refers to data that has wider variety of data sets in larger numbers 1 MySQL Tutorial MySQL is a relational database system that uses SQL to structure data stored 3 Hadoop Tutorial Hadoop is framework that is used to process large sets of data
import mysql.connector # 建立连接 connection = mysql.connector.connect( host='localhost', user='root', password='password', database='tut' ) # 创建游标对象 cursorObj = connection.cursor() natural_language_search_query = ''' SELECT * FROM Articles WHERE MATCH(ARTICLE_TITLE, DESCRIPTION) AGAINST ('data set' IN NATURAL LANGUAGE MODE) ''' cursorObj.execute(natural_language_search_query) # 获取所有结果 results = cursorObj.fetchall() # 显示结果 print("NATURAL LANGUAGE search results:") for row in results: print(row) cursorObj.close() connection.close()
输出
获得的输出如下所示 -
NATURAL LANGUAGE search results: (4, 'Big Data Tutorial', 'Big Data refers to data that has wider variety of data sets in larger numbers') (1, 'MySQL Tutorial', 'MySQL is a relational database system that uses SQL to structure data stored') (3, 'Hadoop Tutorial', 'Hadoop is framework that is used to process large sets of data')