如何将用python程序的方法来统计文本词频的统计
####### 首先还是先给大家把代码给大家:
import jieba
as j
txt
=open("threekingdoms.txt","r",encoding
="utf8").read()
txts
=j
.lcut(txt
)
keywords
=["却说","二人","不能","如此","不可","商议","左右","如何"]
counts
={}
for word
in txts
:
if len(word
)==1:
continue
elif word
== "孔明" or word
== "孔明曰":
rword
= "孔明"
elif word
== "关公" or word
== "云长":
rword
= "关羽"
elif word
== "玄德" or word
== "玄德曰":
rword
= "刘备"
elif word
== "孟德" or word
== "丞相":
rword
= "曹操"
else:
rword
=word
counts
[rword
] = counts
.get(rword
,0) + 1
for word
in keywords
:
del(counts
[word
])
items
=list(counts
.items())
items
.sort(key
=lambda x
:x
[1], reverse
=True
)
for i
in range(15):
word
,coun
=items
[i
]
print("{:<10}:{:>5}".format(word
,coun
))
这是一段关于三国演义里面对于人物出场次数的统计;首先给大家介绍一下编程思路:
打开文件并读取文件
对文件进行词语分词
对每个词语进行统计
对排名前15进行排序后打印输出
我们接下来依次来解读一下这串代码:
首先打开文件并读取文件:
import jieba
as j
txt
=open("threekingdoms.txt","r",encoding
="utf8").read()
txts
=j
.lcut(txt
)
#这串是用来打开文件,并把读取的文本进行分词
然后对每个词语进行统计:
for word
in txts
:
if len(word
)==1:
continue
elif word
== "孔明" or word
== "孔明曰":
rword
= "孔明"
elif word
== "关公" or word
== "云长":
rword
= "关羽"
elif word
== "玄德" or word
== "玄德曰":
rword
= "刘备"
elif word
== "孟德" or word
== "丞相":
rword
= "曹操"
else:
rword
=word
counts
[rword
] = counts
.get(rword
,0) + 1
#最主要的是最后一串代码:他代表在counts字典中获得rword键所对应的值
,如果没有,那么就返回默认值
0,最后无论如何都要把这个键的值加
1;然后
再把这个值赋值给counts字典对应rword的值;这个实现了对rword的计数
接着我们开始来排序:
items
=list(counts
.items())
items
.sort(key
=lambda x
:x
[1], reverse
=True
)
#我们先将counts这个字典里面的所有键值对进行列表化(注意:对字典列表
化,实质是将单个的键值对用元组来表示,然后再将所有的元组整合到一个列
表中:如
{"wo":1,"ni":2}列表化后为
[(wo
,1),(ni
,2)]);接下来我们对获
得到的列表进行排序
我们对排序后的列表进行前15项打印:
for i
in range(15):
word
,coun
=items
[i
]
print("{:<10}:{:>5}".format(word
,coun
))
#之所以可以给两个变量赋值,主要还是上面列表化后 items列表
中其实每个元素都是用元组来表示的,所以在items列表返回一个
元素时,它返回的是一个元组,元组里面又包括之前的键和值
so。。。。
给大家看一下运行后的图片:
编写不易,希望各位大侠还是留个脚印叭