NLP 中 WordNet 中单词的同义词集
machine learningpythonserver side programmingprogramming
简介
WordNet 是 NLTK 库中存在的大型单词数据库,可用于多种语言的自然语言相关用例。NLTK 库有一个称为 Synset 的接口,允许我们在 WordNet 中查找单词。动词、名词等被分组为日落。
WordNet 和同义词集
下图显示了 WordNet 的结构。
在 WordNet 中,单词之间的关系得以维护。例如,sad 等词很相似,在相似的上下文中也能找到应用。这些词在使用过程中可以互换。这些词被归类为同义词集。每个同义词集都相互链接并具有其含义。这些同义词集由于其概念关系而相互关联。
WordNet 中可能存在的关系是 Hypernym 和 Hyponymn
Hypernym − Hypernym 是一个更抽象的术语。例如,如果我们将颜色与其类型(如蓝色、绿色、黄色等)联系起来,那么颜色就是上位词。
下位词 − 在上述颜色示例中,黄色、绿色等单个颜色被称为下位词,它们更加具体。
代码实现
import nltk nltk.download('wordnet') from nltk.corpus import wordnet synset = wordnet.synsets('book')[0] print ("Name of the synset", synset.name()) print ("Meaning of the synset : ", synset.definition()) print ("Example of the synset : ", synset.examples()) print ("Abstract terminology ", synset.hypernyms()) print ("Specific terminology : ",synset.hypernyms()[0].hyponyms()) print ("hypernerm ( ROOT) : ", synset.root_hypernyms())
输出
Name of the synset book.n.02 Synset meaning : physical objects consisting of a number of pages bound together Synset example : ['he used a large book as a doorstop'] Abstract terminology [Synset('publication.n.01')] Specific terminology : [Synset('book.n.01'), Synset('collection.n.02'), Synset('impression.n.06'), Synset('magazine.n.01'), Synset('new_edition.n.01'), Synset('periodical.n.01'), Synset('read.n.01'), Synset('reference.n.08'), Synset('reissue.n.01'), Synset('republication.n.01'), Synset('tip_sheet.n.01'), Synset('volume.n.04')] hypernerm ( ROOT) : [Synset('entity.n.01')]
使用模式库
!pip install pattern from pattern.en import parse,singularize,pluralize from pattern.en import pprint pprint(parse("Jack and Jill went up the hill to fetch a bucket of water", relations=True, lemmata=True)) print("Plural of cat :", pluralize('cat')) print("Singular of leaves :",singularize('leaves'))
输出
WORD TAG CHUNK ROLE ID PNP LEMMA Jack NNP NP SBJ 1 - jack and CC NP ^ SBJ 1 - and Jill NNP NP ^ SBJ 1 - jill went VBD VP - 1 - go up IN PP - - PNP up the DT NP SBJ 2 PNP the hill NN NP ^ SBJ 2 PNP hill to TO VP - 2 - to fetch VB VP ^ - 2 - fetch a DT NP OBJ 2 - a bucket NN NP ^ OBJ 2 - bucket of IN PP - - PNP of water NN NP - - PNP water Plural of cat : cats Singular of leaves : leaf
在 spaCy 中使用 WordNet 接口
!pip install spacy-wordnet import spacy import nltk nltk.download('wordnet') from spacy_wordnet.wordnet_annotator import WordnetAnnotator nlp = spacy.load('en_core_web_sm') nlp.add_pipe("spacy_wordnet", after='tagger') spacy_token = nlp('leaves')[0] print("Synsets : ",spacy_token._.wordnet.synsets()) print("Lemmas : ",spacy_token._.wordnet.lemmas()) print("Wordnet domains:",spacy_token._.wordnet.wordnet_domains())
输出
Synsets : [Synset('leave.v.01'), Synset('leave.v.02'), Synset('leave.v.03'), Synset('leave.v.04'), Synset('exit.v.01'), Synset('leave.v.06'), Synset('leave.v.07'), Synset('leave.v.08'), Synset('entrust.v.02'), Synset('bequeath.v.01'), Synset('leave.v.11'), Synset('leave.v.12'), Synset('impart.v.01'), Synset('forget.v.04')] Lemmas : [Lemma('leaf.n.01.leaf'), Lemma('leaf.n.01.leafage'), Lemma('leaf.n.01.foliage'), Lemma('leaf.n.02.leaf'), Lemma('leaf.n.02.folio'), Lemma('leaf.n.03.leaf'), Lemma('leave.n.01.leave'), Lemma('leave.n.01.leave_of_absence'), Lemma('leave.n.02.leave'), Lemma('farewell.n.02.farewell'), Lemma('farewell.n.02.leave'), Lemma('farewell.n.02.leave-taking'), Lemma('farewell.n.02.parting'), Lemma('leave.v.01.leave'), Lemma('leave.v.01.go_forth'), Lemma('leave.v.01.go_away'), Lemma('leave.v.02.leave'), Lemma('leave.v.03.leave'), Lemma('leave.v.04.leave'), Lemma('leave.v.04.leave_alone'), Lemma('leave.v.04.leave_behind'), Wordnet domains: ['diplomacy', 'book_keeping', 'administration', 'factotum', 'agriculture', 'electrotechnology', 'person', 'telephony', 'mechanics']
NLTK Wordnet 词形还原器
from nltk.stem import WordNetLemmatizer nltk_lammetizer = WordNetLemmatizer() print("books :", nltk_lammetizer.lemmatize("books")) print("formulae :", nltk_lammetizer.lemmatize("formulae")) print("worse :", nltk_lammetizer.lemmatize("worse", pos ="a"))
输出
books : book formulae : formula worse : bad
结论
同义词集是在 WordNet 中查找单词的接口。它们提供了一种非常有用的方法来寻找新词和关系,因为它们作为相似词与 WordNet 相互关联并形成一个紧密的网络。