Gensim - 创建 TF-IDF 矩阵

在这里，我们将学习如何在 Gensim 的帮助下创建词频-逆文档频率 (TF-IDF) 矩阵。

什么是 TF-IDF?

它是词频-逆文档频率模型，也是一个词袋模型。它与常规语料库不同，因为它降低了标记的权重，即在文档中频繁出现的单词。在初始化期间，此 tf-idf 模型算法期望训练语料库具有整数值(例如词袋模型)。

然后在转换时，它采用向量表示并返回另一个向量表示。输出向量将具有相同的维数，但稀有特征的值(在训练时)将增加。它基本上将整数值向量转换为实值向量。

如何计算?

TF-IDF 模型通过以下两个简单步骤计算 tfidf −

步骤 1:将局部和全局分量相乘

在此第一步中，模型将局部分量(例如 TF(词频))与全局分量(例如 IDF(逆文档频率))相乘。

步骤 2:将结果标准化

完成乘法后，下一步 TFIDF 模型将结果标准化为单位长度。

由于上述两个步骤，文档中经常出现的单词将被降低权重。

如何获得 TF-IDF 权重?

在这里，我们将实现一个示例来查看如何获得 TF-IDF 权重。基本上，为了获得 TF-IDF 权重，我们首先需要训练语料库，然后在 tfidf 模型中应用该语料库。

训练语料库

如上所述，要获得 TF-IDF，我们首先需要训练我们的语料库。首先，我们需要导入所有必要的包，如下所示 −

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess

现在提供包含句子的列表。我们的列表中有三个句子 −

doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]

接下来，对句子进行标记化，如下所示 −

doc_tokenized = [simple_preprocess(doc) for doc in doc_list]

创建 corpora.Dictionary() 的对象，如下所示 −

dictionary = corpora.Dictionary()

现在将这些标记化的句子传递给 dictionary.doc2bow() 对象，如下所示 −

BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]

接下来，我们将获取文档中的单词 ID 及其频率。

for doc in BoW_corpus:
   print([[dictionary[id], freq] for id, freq in doc])

输出

[['are', 1], ['hello', 1], ['how', 1], ['you', 1]]
[['how', 1], ['you', 1], ['do', 2]]
[['are', 2], ['you', 3], ['doing', 2], ['hey', 1], ['what', 2], ['yes', 1]]

这样我们就训练好了我们的语料库(Bag-of-Word 语料库)。

接下来，我们需要将这个训练好的语料库应用到 tfidf 模型 models.TfidfModel() 中。

首先导入 numpay 包 −

import numpy as np

现在将我们训练好的语料库(BoW_corpus)应用到 models.TfidfModel()

的方括号中

tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')

接下来，我们将在我们的 tfidf 模型语料库中获得单词 id 及其频率 −

for doc in tfidf[BoW_corpus]:
   print([[dictionary[id], np.around(freq,decomal=2)] for id, freq in doc])

输出

[['are', 0.33], ['hello', 0.89], ['how', 0.33]]
[['how', 0.18], ['do', 0.98]]
[['are', 0.23], ['doing', 0.62], ['hey', 0.31], ['what', 0.62], ['yes', 0.31]]

[['are', 1], ['hello', 1], ['how', 1], ['you', 1]]
[['how', 1], ['you', 1], ['do', 2]]
[['are', 2], ['you', 3], ['doing', 2], ['hey', 1], ['what', 2], ['yes', 1]]

[['are', 0.33], ['hello', 0.89], ['how', 0.33]]
[['how', 0.18], ['do', 0.98]]
[['are', 0.23], ['doing', 0.62], ['hey', 0.31], ['what', 0.62], ['yes', 0.31]]

从上面的输出中，我们可以看到文档中单词频率的差异。

完整的实现示例

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess
doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]
doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
for doc in BoW_corpus:
   print([[dictionary[id], freq] for id, freq in doc])
import numpy as np
tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')
for doc in tfidf[BoW_corpus]:
   print([[dictionary[id], np.around(freq,decomal=2)] for id, freq in doc])

单词权重的差异

如上所述，文档中出现频率越高的单词获得的权重就越小。让我们从上述两个输出中了解单词权重的差异。单词'are'出现在两个文档中，并且权重被降低了。类似地，单词'you'出现在所有文档中，并且被一起删除。

Gensim 教程

Gensim 有用资源

Gensim - 创建 TF-IDF 矩阵

什么是 TF-IDF?

如何计算?

步骤 1:将局部和全局分量相乘

步骤 2:将结果标准化

如何获得 TF-IDF 权重?

训练语料库

输出

输出

完整的实现示例

单词权重的差异

颜色选择器

读后有收获微信请站长喝咖啡

错误报告

您的建议:

感谢您的帮助！