Gensim - 创建 LSI 和 HDP 主题模型

本章介绍如何创建与 Gensim 相关的潜在语义索引 (LSI) 和分层狄利克雷过程 (HDP) 主题模型。

Gensim 中首次使用潜在狄利克雷分配 (LDA) 实现的主题建模算法是潜在语义索引 (LSI)。它也被称为潜在语义分析 (LSA)。它于 1988 年由 Scott Deerwester、Susan Dumais、George Furnas、Richard Harshman、Thomas Landaur、Karen Lochbaum 和 Lynn Streeter 获得专利。

在本节中,我们将设置我们的 LSI 模型。它可以采用与设置 LDA 模型相同的方式完成。我们需要从 gensim.models 导入 LSI 模型。

LSI 的作用

实际上,LSI 是一种 NLP 技术,尤其是在分布式语义方面。它分析一组文档与这些文档包含的术语之间的关系。如果我们谈论它的工作原理,那么它会从大量文本中构建一个包含每个文档的字数的矩阵。

构建后,为了减少行数,LSI 模型使用一种称为奇异值分解 (SVD) 的数学技术。除了减少行数外,它还保留了列之间的相似性结构。

在矩阵中,行代表唯一的单词,列代表每个文档。它基于分布假设,即假设含义相近的单词会出现在同一类文本中。

使用 Gensim 实现

在这里,我们将使用 LSI(潜在语义索引)从数据集中提取自然讨论的主题。

加载数据集

我们将要使用的数据集是 '20 个新闻组' 的数据集,其中包含来自新闻报道各个部分的数千篇新闻文章。它可在 Sklearn 数据集下找到。我们可以通过以下 Python 脚本轻松下载 −

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

让我们通过以下脚本查看一些示例新闻 −

newsgroups_train.data[:4]
["From: lerxst@wam.umd.edu (where's my thing)
Subject: 
WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: 
University of Maryland, College Park
Lines: 15

 
I was wondering if anyone out there could enlighten me on this car 
I saw
the other day. It was a 2-door sports car,
looked to be from the late 60s/
early 70s. It was called a Bricklin. 
The doors were really small. In addition,
the front bumper was separate from 
the rest of the body. This is 
all I know. If anyone can tellme a model name, 
engine specs, years
of production, where this car is made, history, or 
whatever info you
have on this funky looking car, 
please e-mail.

Thanks,
- IL
 ---- brought to you by your neighborhood 
Lerxst ----


",

"From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: 
SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: 
SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: 
University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A 
fair number of brave souls who upgraded their SI clock oscillator have
shared their 
experiences for this poll. Please send a brief message detailing
your experiences with 
the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat 
sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies 
are especially requested.

I will be summarizing in the next two days, so please add 
to the network
knowledge base if you have done the clock upgrade and haven't answered 
this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>
",

'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)
Subject: 
PB questions...
Organization: Purdue University Engineering Computer 
Network
Distribution: usa
Lines: 36

well folks, my mac plus finally gave up the 
ghost this weekend after
starting life as a 512k way back in 1985. sooo, i\'m in the 
market for a
new machine a bit sooner than i intended to be...

i\'m looking into 
picking up a powerbook 160 or maybe 180 and have a bunch
of questions that (hopefully) 
somebody can answer:

* does anybody know any dirt on when the next round of 
powerbook
introductions are expected? i\'d heard the 185c was supposed to make 
an
appearence "this summer" but haven\'t heard anymore on it - and since i
don\'t 
have access to macleak, i was wondering if anybody out there had
more info...

* has 
anybody heard rumors about price drops to the powerbook line like the
ones the duo\'s 
just went through recently?

* what\'s the impression of the display on the 180? i 
could probably swing
a 180 if i got the 80Mb disk rather than the 120, but i don\'t 
really have
a feel for how much "better" the display is (yea, it looks great in 
the
store, but is that all "wow" or is it really that good?). could i solicit
some 
opinions of people who use the 160 and 180 day-to-day on if its worth
taking the disk 
size and money hit to get the active display? (i realize
this is a real subjective 
question, but i\'ve only played around with the
machines in a computer store breifly 
and figured the opinions of somebody
who actually uses the machine daily might prove 
helpful).

* how well does hellcats perform? ;)

thanks a bunch in advance for any 
info - if you could email, i\'ll post a
summary (news reading time is at a premium 
with finals just around the
corner... :( )
--
Tom Willis \ twillis@ecn.purdue.edu 
\ Purdue Electrical 
Engineering
---------------------------------------------------------------------------\
n"Convictions are more dangerous enemies of truth than lies." - F. W.
Nietzsche
',

'From: jgreen@amber (Joe Green)
Subject: Re: Weitek P9000 ?
Organization: Harris 
Computer Systems Division
Lines: 14
Distribution: world
NNTP-Posting-Host: 
amber.ssd.csd.harris.com
X-Newsreader: TIN [version 1.1 PL9]

Robert J.C. Kyanko 
(rob@rjck.UUCP) wrote:
 > abraxis@iastate.edu writes in article <
abraxis.734340159@class1.iastate.edu>:
> > Anyone know about the Weitek P9000 
graphics chip?
 > As far as the low-level stuff goes, it looks pretty nice. It\'s 
got this
 > quadrilateral fill command that requires just the four
points.

Do you have Weitek\'s address/phone number? I\'d like to get some 
information
about this chip.

--
Joe Green				Harris 
Corporation
jgreen@csd.harris.com			Computer Systems Division
"The only thing that 
really scares me is a person with no sense of humor."
						-- Jonathan 
Winters
']

先决条件

我们需要 NLTK 中的停用词和 Scapy 中的英语模型。两者都可以通过以下方式下载 −

import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

导入必要的包

为了构建 LSI 模型,我们需要导入以下必要的包 −

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt

准备停用词

现在我们需要导入停用词并使用它们 −

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

清理文本

现在,借助 Gensim 的 simple_preprocess(),我们需要将每个句子标记为一个单词列表。我们还应该删除标点符号和不必要的字符。为了做到这一点,我们将创建一个名为 sent_to_words() 的函数 −

def sent_to_words(sentences):
   for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))

构建二元词组和三元词组模型

众所周知,二元词组是文档中经常一起出现的两个单词,而三元词组是文档中经常一起出现的三个单词。借助 Gensim 的短语模型,我们可以做到这一点 −

bigram = gensim.models.Phrases(data_words, min_count=5, Threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], Threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

过滤掉停用词

接下来,我们需要过滤掉停用词。除此之外,我们还将创建用于制作二元组、三元组和词形还原的函数 −

def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

为主题模型构建词典和语料库

现在我们需要构建词典和语料库。我们在前面的例子中也做过 −

id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

构建 LSI 主题模型

我们已经实现了训练 LSI 模型所需的一切。现在是时候构建 LSI 主题模型了。对于我们的实现示例,可以借助以下代码行完成 −

lsi_model = gensim.models.lsimodel.LsiModel(
   corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)

实现示例

让我们看看构建 LDA 主题模型的完整实现示例 −

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(
   data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
)
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.
lsi_model = gensim.models.lsimodel.LsiModel(
   corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)

现在我们可以使用上面创建的 LSI 模型来获取主题。

查看 LSI 模型中的主题

我们上面创建的 LSI 模型 (lsi_model) 可用于查看文档中的主题。可以借助以下脚本完成此操作 −

pprint(lsi_model.print_topics())
doc_lsi = lsi_model[corpus]

输出

[
   (0,
   '1.000*"ax" + 0.001*"_" + 0.000*"tm" + 0.000*"part" +    0.000*"pne" + '
   '0.000*"biz" + 0.000*"mbs" + 0.000*"end" + 0.000*"fax" + 0.000*"mb"'),
   (1,
   '0.239*"say" + 0.222*"file" + 0.189*"go" + 0.171*"know" + 0.169*"people" + '
   '0.147*"make" + 0.140*"use" + 0.135*"also" + 0.133*"see" + 0.123*"think"')
]

分层狄利克雷过程 (HPD)

LDA 和 LSI 等主题模型有助于总结和组织无法手动分析的大量文本档案。除了 LDA 和 LSI,Gensim 中另一个强大的主题模型是 HDP(分层狄利克雷过程)。它基本上是一个用于无监督分析分组数据的混合成员模型。与 LDA(其有限对应物)不同,HDP 从数据中推断出主题数量。

使用 Gensim 实现

为了在 Gensim 中实现 HDP,我们需要训练语料库和词典(如上例中在实现 LDA 和 LSI 主题模型时所做的那样)我们可以从 gensim.models.HdpModel 导入 HDP 主题模型。这里我们也将在 20Newsgroup 数据上实现 HDP 主题模型,步骤也相同。

对于我们的语料库和词典(在上面的 LSI 和 LDA 模型示例中创建),我们可以按如下方式导入 HdpModel −

Hdp_model = gensim.models.hdpmodel.HdpModel(corpus=corpus, id2word=id2word)

查看 LSI 模型中的主题

HDP 模型 (Hdp_model) 可用于查看文档中的主题。可以借助以下脚本完成此操作 −

pprint(Hdp_model.print_topics())

输出

[
   (0,
   '0.009*line + 0.009*write + 0.006*say + 0.006*article + 0.006*know + '
   '0.006*people + 0.005*make + 0.005*go + 0.005*think + 0.005*be'),
   (1,
   '0.016*line + 0.011*write + 0.008*article + 0.008*organization + 0.006*know '
   '+ 0.006*host + 0.006*be + 0.005*get + 0.005*use + 0.005*say'),
   (2,
   '0.810*ax + 0.001*_ + 0.000*tm + 0.000*part + 0.000*mb + 0.000*pne + '
   '0.000*biz + 0.000*end + 0.000*wwiz + 0.000*fax'),
   (3,
   '0.015*line + 0.008*write + 0.007*organization + 0.006*host + 0.006*know + '
   '0.006*article + 0.005*use + 0.005*thank + 0.004*get + 0.004*problem'),
   (4,
   '0.004*line + 0.003*write + 0.002*believe + 0.002*think + 0.002*article + '
   '0.002*belief + 0.002*say + 0.002*see + 0.002*look + 0.002*organization'),
   (5,
   '0.005*line + 0.003*write + 0.003*organization + 0.002*article + 0.002*time '
   '+ 0.002*host + 0.002*get + 0.002*look + 0.002*say + 0.001*number'),
   (6,
   '0.003*line + 0.002*say + 0.002*write + 0.002*go + 0.002*gun + 0.002*get + '
   '0.002*organization + 0.002*bill + 0.002*article + 0.002*state'),
   (7,
   '0.003*line + 0.002*write + 0.002*article + 0.002*organization + 0.001*none '
   '+ 0.001*know + 0.001*say + 0.001*people + 0.001*host + 0.001*new'),
   (8,
   '0.004*line + 0.002*write + 0.002*get + 0.002*team + 0.002*organization + '
   '0.002*go + 0.002*think + 0.002*know + 0.002*article + 0.001*well'),
   (9,
   '0.004*line + 0.002*organization + 0.002*write + 0.001*be + 0.001*host + '
   '0.001*article + 0.001*thank + 0.001*use + 0.001*work + 0.001*run'),
   (10,
   '0.002*line + 0.001*game + 0.001*write + 0.001*get + 0.001*know + '
   '0.001*thing + 0.001*think + 0.001*article + 0.001*help + 0.001*turn'),
   (11,
   '0.002*line + 0.001*write + 0.001*game + 0.001*organization + 0.001*say + '
   '0.001*host + 0.001*give + 0.001*run + 0.001*article + 0.001*get'),
   (12,
   '0.002*line + 0.001*write + 0.001*know + 0.001*time + 0.001*article + '
   '0.001*get + 0.001*think + 0.001*organization + 0.001*scope + 0.001*make'),
   (13,
   '0.002*line + 0.002*write + 0.001*article + 0.001*organization + 0.001*make '
   '+ 0.001*know + 0.001*see + 0.001*get + 0.001*host + 0.001*really'),
   (14,
   '0.002*write + 0.002*line + 0.002*know + 0.001*think + 0.001*say + '
   '0.001*article + 0.001*argument + 0.001*even + 0.001*card + 0.001*be'),
   (15,
   '0.001*article + 0.001*line + 0.001*make + 0.001*write + 0.001*know + '
   '0.001*say + 0.001*exist + 0.001*get + 0.001*purpose + 0.001*organization'),
   (16,
   '0.002*line + 0.001*write + 0.001*article + 0.001*insurance + 0.001*go + '
   '0.001*be + 0.001*host + 0.001*say + 0.001*organization + 0.001*part'),
   (17,
   '0.001*line + 0.001*get + 0.001*hit + 0.001*go + 0.001*write + 0.001*say + '
   '0.001*know + 0.001*drug + 0.001*see + 0.001*need'),
   (18,
   '0.002*option + 0.001*line + 0.001*flight + 0.001*power + 0.001*software + '
   '0.001*write + 0.001*add + 0.001*people + 0.001*organization + 0.001*module'),
   (19,
   '0.001*shuttle + 0.001*line + 0.001*roll + 0.001*attitude + 0.001*maneuver + '
   '0.001*mission + 0.001*also + 0.001*orbit + 0.001*produce + 0.001*frequency')
]