使用 Python 中的 Whoosh 库开发文本搜索引擎

pythonserver side programmingprogramming

Whoosh 是一个 Python 类和函数库，用于索引文本，然后搜索索引。假设您正在构建一个应用程序，需要浏览各种文档，然后根据一些预定义的条件查找相似之处或从中获取数据，或者假设您想要计算项目标题在研究论文中被提及的次数，那么我们在本教程中构建的内容将派上用场。

入门

为了构建我们的文本搜索引擎，我们将使用 whoosh 库。

此库未预先打包在 Python 中。因此，我们将使用 pip 包管理器下载并安装它。

要安装 whoosh 库，请使用以下行。

pip install whoosh

现在，我们可以使用以下行将其导入到我们的脚本中。

from whoosh.fields import Schema, TEXT, ID
from whoosh import index

使用 Python 构建文本搜索引擎

首先，让我们定义一个文件夹，在需要时我们将在其中保存索引文件。

import os.path
os.mkdir("dir")

接下来，让我们定义一个模式。 Schema 指定索引中文档的字段。

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
ind = index.create_in("dir", schema)
writer = ind.writer()
writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a")
writer.commit()

现在我们已经索引了文档，我们搜索它。

from whoosh.qparser import QueryParser
with ind.searcher() as searcher:
     query = QueryParser("content", ind.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     for r in results:
         print (r, r.score)
         if results.has_matched_terms():
            print(results.matched_terms())

输出

它将产生以下输出:

<Hit {'path': '/a', 'title': 'doc', 'content': 'Py doc hello big world'}> 
1.7906976744186047
{('content', b'hello'), ('content', b'world')}

示例

以下是完整代码:

from whoosh.fields import Schema, TEXT, ID
from whoosh import index
import os.path
os.mkdir("dir")
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
ind = index.create_in("dir", schema)
writer = ind.writer()
writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") 
writer.commit()

from whoosh.qparser import QueryParser
with ind.searcher() as searcher:
     query = QueryParser("content", ind.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     for r in results:
         print (r, r.score)
         if results.has_matched_terms():
            print(results.matched_terms())