Gensim - 创建字典

在上一章中，我们讨论了向量和模型，您对字典有了一定了解。在这里，我们将更详细地讨论 Dictionary 对象。

什么是字典?

在深入研究字典的概念之前，让我们先了解一些简单的 NLP 概念 −

Token − 一个 token 表示一个"单词"。
Document − 一个 document 指的是一个句子或段落。
Corpus −它将文档集合称为词袋 (BoW)。

对于所有文档，语料库始终包含每个单词的标记 ID 及其在文档中的频率计数。

让我们转到 Gensim 中的字典概念。对于文本文档，Gensim 还需要将单词(即标记)转换为其唯一 ID。为了实现这一点，它为我们提供了 Dictionary 对象 的功能，该对象将每个单词映射到其唯一的整数 ID。它通过将输入文本转换为单词列表，然后将其传递给 corpora.Dictionary() 对象来实现此目的。

需要字典

现在出现的问题是，字典对象的实际需求是什么，以及它可以在哪里使用?在 Gensim 中，字典对象用于创建词袋 (BoW) 语料库，该语料库进一步用作主题建模和其他模型的输入。

文本输入的形式

我们可以为 Gensim 提供三种不同形式的输入文本 −

作为存储在 Python 原生列表对象中的句子(在 Python 3 中称为 str)
作为一个单一的文本文件(可以是小的或大的)
多个文本文件

使用 Gensim 创建字典

如上所述，在 Gensim 中，字典包含所有单词(又称标记)到其唯一整数 ID 的映射。我们可以从句子列表、一个或多个文本文件(包含多行文本的文本文件)创建字典。因此，首先让我们从使用句子列表创建字典开始。

从句子列表

在下面的例子中，我们将从句子列表创建字典。当我们有句子列表或您可以说多个句子时，我们必须将每个句子转换为单词列表，而理解是执行此操作的最常见方法之一。

实施示例

首先，按如下方式导入所需和必要的包 −

import gensim
from gensim import corpora
from pprint import pprint

接下来，从句子列表/文档中创建理解列表，以使用它创建字典 −

doc = [
   "CNTK formerly known as Computational Network Toolkit",
   "is a free easy-to-use open-source commercial-grade toolkit",
   "that enable us to train deep learning algorithms to learn like the human brain."
]

接下来，我们需要将句子拆分成单词。这被称为标记化。

text_tokens = [[text for text in doc.split()] for doc in doc]

现在，借助以下脚本，我们可以创建字典 −

dict_LoS = corpora.Dictionary(text_tokens)

现在让我们获取更多信息，例如字典中的标记数量 −

print(dict_LoS)

输出

字典(27 个唯一标记:['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)

我们还可以看到单词唯一整数映射如下 −

print(dict_LoS.token2id)

输出

{
   'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 
   'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9,
   'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14,
   'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 'learning': 20,
   'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26
}

完整实现示例

import gensim
from gensim import corpora
from pprint import pprint
doc = [
   "CNTK formerly known as Computational Network Toolkit",
   "is a free easy-to-use open-source commercial-grade toolkit",
   "that enable us to train deep learning algorithms to learn like the human brain."
]
text_tokens = [[text for text in doc.split()] for doc in doc]
dict_LoS = corpora.Dictionary(text_tokens)
print(dict_LoS.token2id)

从单个文本文件

在下面的例子中，我们将从单个文本文件创建字典。以类似的方式，我们也可以从多个文本文件(即文件目录)创建字典。

为此，我们将上例中使用的文档保存在名为 doc.txt 的文本文件中。Gensim 将逐行读取文件并使用 simple_preprocess 一次处理一行。这样就不需要一次性将整个文件加载到内存中了。

实现示例

首先，导入所需的和必要的包，如下所示 −

import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

下一行代码将使用名为 doc.txt 的单个文本文件创建 gensim 字典 −

dict_STF = corpora.Dictionary(
   simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’)
)

现在让我们获取更多信息，例如字典中的标记数量 −

print(dict_STF)

输出

Dictionary(27 unique tokens: ['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)

我们还可以看到单词到唯一整数的映射，如下所示 −

print(dict_STF.token2id)

输出

{
   'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 
   'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9, 
   'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14, 
   'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 
   'learning': 20, 'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26
}

完整实现示例

import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
dict_STF = corpora.Dictionary(
   simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’)
)
dict_STF = corpora.Dictionary(text_tokens)
print(dict_STF.token2id)

来自多个文本文件

现在让我们从多个文件(即保存在同一目录中的多个文本文件)创建字典。对于此示例，我们创建了三个不同的文本文件，即 first.txt、second.txt 和 third.txt，其中包含我们在上一个示例中使用的文本文件 (doc.txt) 中的三行。所有这三个文本文件都保存在名为 ABC 的目录中。

实施示例

为了实现这一点，我们需要定义一个类，其中包含一个方法，该方法可以遍历目录 (ABC) 中的所有三个文本文件 (First、Second 和 Third.txt) 并生成处理后的单词标记列表。

让我们定义名为 Read_files 的类，该类具有一个名为 __iteration__() 的方法，如下所示 −

class Read_files(object):
   def __init__(self, directoryname):
      elf.directoryname = directoryname
   def __iter__(self):
      for fname in os.listdir(self.directoryname):
         for line in open(os.path.join(self.directoryname, fname), encoding='latin'):
   yield simple_preprocess(line)

接下来，我们需要提供目录的路径，如下所示 −

path = "ABC"

#提供您保存目录的计算机系统的路径。

接下来的步骤与前面的示例类似。下一行代码将使用包含三个文本文件的目录创建 Gensim 目录 −

dict_MUL = corpora.Dictionary(Read_files(path))

输出

Dictionary(27 个唯一标记:['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)

现在我们还可以看到单词到唯一整数的映射，如下所示 −

print(dict_MUL.token2id)

输出

{
   'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 
   'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9, 
   'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14, 
   'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 
   'learning': 20, 'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26
}

保存和加载 Gensim 字典

Gensim 支持其原生的 save() 方法将字典保存到磁盘，以及 load() 方法从磁盘加载字典。

例如，我们可以借助以下脚本保存字典 −

Gensim.corpora.dictionary.save(filename)

#提供您想要保存字典的路径。

同样，我们可以使用 load() 方法加载已保存的字典。以下脚本可以做到这一点 −

Gensim.corpora.dictionary.load(filename)

#提供您保存字典的路径。

Gensim 教程

Gensim 有用资源

Gensim - 创建字典

什么是字典?

需要字典

文本输入的形式

使用 Gensim 创建字典

从句子列表

实施示例

输出

输出

完整实现示例

从单个文本文件

实现示例

输出

输出

完整实现示例

来自多个文本文件

实施示例

输出

输出

保存和加载 Gensim 字典

颜色选择器

读后有收获微信请站长喝咖啡

错误报告

您的建议:

感谢您的帮助！