有时候看英文论文,高频词汇是一些术语,可能不太认识,因此我们可以先分析一下该论文的词频,对于高频词汇可以在看论文之前就记住其意思,这样看论文思路会更顺畅一旦,接下来就讲一下如何用python输出一篇英文论文的词汇出现频次。
首先肯定要先把论文从PDF版转为txt格式,一般来说直接转会出现乱码,建议先转为Word格式,之后再复制为txt文本格式。
接下来附上含有详细注释的代码
#论文词频分析
#You should convert the file to text format
__author__ = 'Chen Hong'
#Read the text and save all the words in a list
def readtxt(filename):
fr = open(filename, 'r')
wordsL = []#use this list to save the words
for word in fr:
word = word.strip()
word = word.split()
wordsL = wordsL + word
fr.close()
return wordsL
#count the frequency of every word and store in a dictionary
#And sort dictionaries by value from large to small
def count(wordsL):
wordsD = {}
for x in wordsL:
#move these words that we don't need
if Judge(x):
continue
#count
if not x in wordsD:
wordsD[x] = 1
wordsD[x] += 1
#Sort dictionaries by value from large to small
wordsInorder = sorted(wordsD.items(), key=lambda x:x[1], reverse = True)
return wordsInorder
#juege whether the word is that we want to move such as punctuation or letter
#You can modify this function to move more words such as number
def Judge(word):
punctList = [' ','\t','\n',',','.',':','?']#juege whether the word is punctuation
letterList = ['a','b','c','d','m','n','x','p','t']#juege whether the word is letter
if word in punctList:
return True
elif word in letterList:
return True
else:
return False
#Read the file and output the file
filename = 'F:\\python\\Paper1.txt'
wordsL = readtxt(filename)
words = count(wordsL)
fw = open('F:\\python\\Words In Order_1.txt','w')
for item in words:
fw.write(item[0] + ' ' + str(item[1]) + '\n')
fw.close()