Gensim - 转换

本章将帮助您了解 Gensim 中的各种转换。让我们从了解转换文档开始。

转换文档

转换文档意味着以某种方式表示文档，以便可以对文档进行数学操作。除了推断语料库的潜在结构之外，转换文档还将实现以下目标 −

它发现了单词之间的关系。
它揭示了语料库中的隐藏结构。
它以一种新的、更具语义的方式描述文档。
它使文档的表示更加紧凑。
它提高了效率，因为新的表示消耗更少的资源。
它提高了功效，因为在新的表示中忽略了边际数据趋势。
在新的文档表示中，噪音也减少了。

让我们看看将文档从一个向量空间表示转换为另一个向量空间表示的实现步骤。

实施步骤

按顺序要转换文档，我们必须遵循以下步骤 −

步骤 1:创建语料库

第一个也是最基本的步骤是从文档创建语料库。我们已经在前面的例子中创建了语料库。让我们创建另一个具有一些增强功能的语料库(删除常用词和仅出现一次的词)−

import gensim
import pprint
from collections import defaultdict
from gensim import corpora

现在提供用于创建语料库的文档−

t_corpus = ["CNTK formerly known as Computational Network Toolkit", "is a free easy-to-use open-source commercial-grade toolkit", "that enable us to train deep learning algorithm to learn like the human brain.", "You can find its free tutorial on tutorialspoint.com", "Tutorialspoint.com also provide best technology tutorials on technologies like AI deep learning machine learning for free"]

接下来，我们需要进行 tokenise，同时删除常用词 −

stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [
   [
      word for word in document.lower().split() if word not in stoplist
   ]
	for document in t_corpus
]

以下脚本将删除仅出现在 −

frequency = defaultdict(int)
for text in processed_corpus:
   for token in text:
      frequency[token] += 1
   processed_corpus = [
      [token for token in text if frequency[token] > 1] 
      for text in processed_corpus
   ]
pprint.pprint(processed_corpus)

输出

[
   ['toolkit'],
   ['free', 'toolkit'],
   ['deep', 'learning', 'like'],
   ['free', 'on', 'tutorialspoint.com'],
   ['tutorialspoint.com', 'on', 'like', 'deep', 'learning', 'learning', 'free']
]

现在将其传递给 corpora.dictionary() 对象以获取我们语料库中的唯一对象 −

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

输出

Dictionary(7 unique tokens: ['toolkit', 'free', 'deep', 'learning', 'like']...)

接下来，以下代码行将为我们的语料库创建 Bag of Word 模型 −

BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)

输出

[
   [(0, 1)],
   [(0, 1), (1, 1)],
   [(2, 1), (3, 1), (4, 1)],
   [(1, 1), (5, 1), (6, 1)],
   [(1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1)]
]

步骤 2:创建转换

转换是一些标准的 Python 对象。我们可以使用经过训练的语料库初始化这些转换，即 Python 对象。在这里，我们将使用 tf-idf 模型来创建经过训练的语料库的转换，即 BoW_corpus。

首先，我们需要从 gensim 导入模型包。

from gensim import models

现在，我们需要按如下方式初始化模型 −

tfidf = models.TfidfModel(BoW_corpus)

步骤 3:转换向量

现在，在这最后一步中，向量将从旧表示转换为新表示。由于我们已在上述步骤中初始化了 tfidf 模型，因此 tfidf 现在将被视为只读对象。在这里，通过使用这个 tfidf 对象，我们将向量从词袋表示(旧表示)转换为 Tfidf 实值权重(新表示)。

doc_BoW = [(1,1),(3,1)]
print(tfidf[doc_BoW]

输出

[(1, 0.4869354917707381), (3, 0.8734379353188121)]

我们对语料库的两个值应用了转换，但我们也可以将其应用于整个语料库，如下所示 −

corpus_tfidf = tfidf[BoW_corpus]
for doc in corpus_tfidf:
   print(doc)

输出

[(0, 1.0)]
[(0, 0.8734379353188121), (1, 0.4869354917707381)]
[(2, 0.5773502691896257), (3, 0.5773502691896257), (4, 0.5773502691896257)]
[(1, 0.3667400603126873), (5, 0.657838022678017), (6, 0.657838022678017)]
[
   (1, 0.19338287240886842), (2, 0.34687949360312714), (3, 0.6937589872062543), 
   (4, 0.34687949360312714), (5, 0.34687949360312714), (6, 0.34687949360312714)
]

完整实现示例

import gensim
import pprint
from collections import defaultdict
from gensim import corpora
t_corpus = [
   "CNTK formerly known as Computational Network Toolkit", 
   "is a free easy-to-use open-source commercial-grade toolkit", 
   "that enable us to train deep learning algorithms to learn like the human brain.", 
   "You can find its free tutorial on tutorialspoint.com", 
   "Tutorialspoint.com also provide best technical tutorials on 
   technologies like AI deep learning machine learning for free"
]
stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [
   [word for word in document.lower().split() if word not in stoplist]
   for document in t_corpus
]
frequency = defaultdict(int)
for text in processed_corpus:
   for token in text:
      frequency[token] += 1
   processed_corpus = [
      [token for token in text if frequency[token] > 1] 
      for text in processed_corpus
   ]
pprint.pprint(processed_corpus)
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)
BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)
   from gensim import models
   tfidf = models.TfidfModel(BoW_corpus)
   doc_BoW = [(1,1),(3,1)]
   print(tfidf[doc_BoW])
   corpus_tfidf = tfidf[BoW_corpus]
   for doc in corpus_tfidf:
print(doc)

Gensim 中的各种转换

使用 Gensim，我们可以实现各种流行的转换，即向量空间模型算法。其中一些如下 −

Tf-Idf(词频-逆文档频率)

在初始化期间，此 tf-idf 模型算法期望训练语料库具有整数值(例如 Bag-of-Words 模型)。然后，在转换时，它采用向量表示并返回另一个向量表示。

输出向量将具有相同的维数，但稀有特征的值(在训练时)将增加。它基本上将整数值向量转换为实值向量。以下是 Tf-idf 转换的语法 −

Model=models.TfidfModel(corpus, normalize=True)

LSI(潜在语义索引)

LSI 模型算法可以将文档从整数值向量模型(例如 Bag-of-Words 模型)或 Tf-Idf 加权空间转换为潜在空间。输出向量将具有较低的维度。以下是 LSI 转换的语法 −

Model=models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

LDA(潜在狄利克雷分配)

LDA 模型算法是另一种将文档从词袋模型空间转换为主题空间的算法。输出向量将具有较低的维度。以下是 LSI 转换的语法 −

Model=models.LdaModel(corpus, id2word=dictionary, num_topics=100)

随机投影 (RP)

RP 是一种非常有效的方法，旨在降低向量空间的维数。这种方法基本上是近似文档之间的 Tf-Idf 距离。它通过加入一些随机性来实现这一点。

Model=models.RpModel(tfidf_corpus, num_topics=500)

分层狄利克雷过程 (HDP)

HDP 是一种非参数贝叶斯方法，是 Gensim 的新增功能。我们在使用它时应该小心谨慎。

Model=models.HdpModel(corpus, id2word=dictionary

Gensim 教程

Gensim 有用资源