Gensim - 创建 LDA 主题模型
本章将帮助您了解如何在 Gensim 中创建隐含狄利克雷分配 (LDA) 主题模型。
在 NLP(自然语言处理)的主要应用之一中,自动从大量文本中提取有关主题的信息。大量文本可能是来自酒店评论、推文、Facebook 帖子、任何其他社交媒体渠道的提要、电影评论、新闻报道、用户反馈、电子邮件等。
在这个数字时代,了解人们/客户在谈论什么,了解他们的观点和问题,对企业、政治运动和管理员来说非常有价值。但是,是否可以手动阅读如此大量的文本,然后从主题中提取信息?
不,不可能。它需要一种自动算法,可以读取这些大量的文本文档并自动从中提取所需的信息/讨论主题。
LDA 的作用
LDA 的主题建模方法是将文档中的文本分类到特定主题。 LDA 以狄利克雷分布建模,构建 −
- 每个文档一个主题模型和
- 每个主题一个词模型
在提供 LDA 主题模型算法后,为了获得良好的主题-关键词分布组合,它重新排列 −
- 文档内的主题分布和
- 主题内的关键词分布
在处理过程中,LDA 做出的一些假设是 −
- 每个文档都被建模为主题的多项式分布。
- 每个主题都被建模为词的多项式分布。
- 我们必须选择正确的数据语料库,因为 LDA 假设每个文本块都包含相关的单词。
- LDA 还假设文档是由多种主题组成的。
使用 Gensim 实现
在这里,我们将使用 LDA(潜在狄利克雷分配)从数据集中提取自然讨论的主题。
加载数据集
我们将要使用的数据集是 '20 个新闻组' 的数据集,其中包含来自新闻报道各个部分的数千篇新闻文章。它可在 Sklearn 数据集下找到。我们可以通过以下 Python 脚本轻松下载 −
from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train')
让我们通过以下脚本查看一些示例新闻 −
newsgroups_train.data[:4]
["From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ---- ", "From: guykuo@carson.u.washington.edu (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Posting-Host: carson.u.washington.edu A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll. Please send a brief message detailing your experiences with the procedure. Top speed attained, CPU rated speed, add on cards and adapters, heat sinks, hour of usage per day, floppy disk functionality with 800 and 1.4 m floppies are especially requested. I will be summarizing in the next two days, so please add to the network knowledge base if you have done the clock upgrade and haven't answered this poll. Thanks. Guy Kuo <;guykuo@u.washington.edu> ", 'From: twillis@ec.ecn.purdue.edu (Thomas E Willis) Subject: PB questions... Organization: Purdue University Engineering Computer Network Distribution: usa Lines: 36 well folks, my mac plus finally gave up the ghost this weekend after starting life as a 512k way back in 1985. sooo, i\'m in the market for a new machine a bit sooner than i intended to be... i\'m looking into picking up a powerbook 160 or maybe 180 and have a bunch of questions that (hopefully) somebody can answer: * does anybody know any dirt on when the next round of powerbook introductions are expected? i\'d heard the 185c was supposed to make an appearence "this summer" but haven\'t heard anymore on it - and since i don\'t have access to macleak, i was wondering if anybody out there had more info... * has anybody heard rumors about price drops to the powerbook line like the ones the duo\'s just went through recently? * what\'s the impression of the display on the 180? i could probably swing a 180 if i got the 80Mb disk rather than the 120, but i don\'t really have a feel for how much "better" the display is (yea, it looks great in the store, but is that all "wow" or is it really that good?). could i solicit some opinions of people who use the 160 and 180 day-to-day on if its worth taking the disk size and money hit to get the active display? (i realize this is a real subjective question, but i\'ve only played around with the machines in a computer store breifly and figured the opinions of somebody who actually uses the machine daily might prove helpful). * how well does hellcats perform? ;) thanks a bunch in advance for any info - if you could email, i\'ll post a summary (news reading time is at a premium with finals just around the corner... : ( ) -- Tom Willis \ twillis@ecn.purdue.edu \ Purdue Electrical Engineering ---------------------------------------------------------------------------\ n"Convictions are more dangerous enemies of truth than lies." - F. W. Nietzsche ', 'From: jgreen@amber (Joe Green) Subject: Re: Weitek P9000 ? Organization: Harris Computer Systems Division Lines: 14 Distribution: world NNTP-Posting-Host: amber.ssd.csd.harris.com X-Newsreader: TIN [version 1.1 PL9] Robert J.C. Kyanko (rob@rjck.UUCP) wrote: >abraxis@iastate.edu writes in article <abraxis.734340159@class1.iastate.edu >: > > Anyone know about the Weitek P9000 graphics chip? > As far as the low-level stuff goes, it looks pretty nice. It\'s got this > quadrilateral fill command that requires just the four points. Do you have Weitek\'s address/phone number? I\'d like to get some information about this chip. -- Joe Green Harris Corporation jgreen@csd.harris.com Computer Systems Division "The only thing that really scares me is a person with no sense of humor. " -- Jonathan Winters ']
先决条件
我们需要 NLTK 中的停用词和 Scapy 中的英语模型。两者都可以通过以下方式下载 −
import nltk; nltk.download('stopwords') nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
导入必要的包
为了构建 LDA 模型,我们需要导入以下必要的包 −
import re import numpy as np import pandas as pd from pprint import pprint import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from gensim.models import CoherenceModel import spacy import pyLDAvis import pyLDAvis.gensim import matplotlib.pyplot as plt
准备停用词
现在,我们需要导入停用词并使用它们 −
from nltk.corpus import stopwords stop_words = stopwords.words('english') stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
清理文本
现在,借助 Gensim 的 simple_preprocess(),我们需要将每个句子标记为一个单词列表。我们还应该删除标点符号和不必要的字符。为了做到这一点,我们将创建一个名为 sent_to_words() 的函数 −
def sent_to_words(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) data_words = list(sent_to_words(data))
构建二元词组和三元词组模型
众所周知,二元词组是文档中经常一起出现的两个词,而三元词组是文档中经常一起出现的三个词。借助 Gensim 的 Phrases 模型,我们可以做到这一点 −
bigram = gensim.models.Phrases(data_words, min_count=5, Threshold=100) trigram = gensim.models.Phrases(bigram[data_words], Threshold=100) bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram)
过滤掉停用词
接下来,我们需要过滤掉停用词。除此之外,我们还将创建用于制作二元组、三元组和词形还原的函数 −
def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod[doc] for doc in texts] def make_trigrams(texts): return [trigram_mod[bigram_mod[doc]] for doc in texts] def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): texts_out = [] for sent in texts: doc = nlp(" ".join(sent)) texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) return texts_out
为主题模型构建词典和语料库
现在我们需要构建词典和语料库。我们在前面的例子中也做过 −
id2word = corpora.Dictionary(data_lemmatized) texts = data_lemmatized corpus = [id2word.doc2bow(text) for text in texts]
构建 LDA 主题模型
我们已经实现了训练 LDA 模型所需的一切。现在是时候构建 LDA 主题模型了。对于我们的实现示例,可以借助以下代码行来完成 −
lda_model = gensim.models.ldamodel.LdaModel( corpus=corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True )
实现示例
让我们看看构建 LDA 主题模型的完整实现示例 −
import re import numpy as np import pandas as pd from pprint import pprint import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from gensim.models import CoherenceModel import spacy import pyLDAvis import pyLDAvis.gensim import matplotlib.pyplot as plt from nltk.corpus import stopwords stop_words = stopwords.words('english') stop_words.extend(['from', 'subject', 're', 'edu', 'use']) from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train') data = newsgroups_train.data data = [re.sub('\S*@\S*\s?', '', sent) for sent in data] data = [re.sub('\s+', ' ', sent) for sent in data] data = [re.sub("\'", "", sent) for sent in data] print(data_words[:4]) #it will print the data after prepared for stopwords bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) trigram = gensim.models.Phrases(bigram[data_words], threshold=100) bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram) def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod[doc] for doc in texts] def make_trigrams(texts): [trigram_mod[bigram_mod[doc]] for doc in texts] def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): texts_out = [] for sent in texts: doc = nlp(" ".join(sent)) texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) return texts_out data_words_nostops = remove_stopwords(data_words) data_words_bigrams = make_bigrams(data_words_nostops) nlp = spacy.load('en_core_web_md', disable=['parser', 'ner']) data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[ 'NOUN', 'ADJ', 'VERB', 'ADV' ]) print(data_lemmatized[:4]) #it will print the lemmatized data. id2word = corpora.Dictionary(data_lemmatized) texts = data_lemmatized corpus = [id2word.doc2bow(text) for text in texts] print(corpus[:4]) #it will print the corpus we created above. [[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] #it will print the words with their frequencies. lda_model = gensim.models.ldamodel.LdaModel( corpus=corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True )
现在我们可以使用上面创建的 LDA 模型来获取主题,以计算模型困惑度。