NLP处理练习

    xiaoxiao2025-04-23  18

    NLP处理基本思路 处理对象 I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today. I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today. I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . . And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!" 流程图 代码执行 # 1.假设文件足够大,不能一次性读入内存 # 2.注意边界处理,如果单纯按数量读取,可能会将单词拆断 # 3.注意末句处理条件 # 4.如不需要修改文件,读取时仅设置'r'即可 # import os import re def nlp(content): # 去标点、换行符 te = re.sub('[^\w]', " ", content) # 小写 te = te.lower() # 分词 te = te.split(' ') # 去空白单词 te = filter(None, te) # 词频统计 for i in te: if i in dic.keys(): dic[i] += 1 else: dic[i] = 1 dic = {} path = r'c:/users/chen.huaiyu/desktop' os.chdir(path) with open('nlp.txt', 'r') as fin: #cc = f.read() cc1 = '' seek = fin.tell() num = 100 count = [] while True: content = fin.read(num) if content == '': break elif content[-1] != ' ': count.append(len(content)) if len(count) > 1: if count[-1] == count[-2]: print('Y') nlp(content) seek = fin.tell() cc1 += content print(fin.tell()) break num += 1 fin.seek(seek) continue else: print('Y') nlp(content) print(len(content)) seek = fin.tell() print(seek) print(fin.tell()) cc1 += content dic = sorted(dic.items(), key=lambda v: v[1], reverse=True) with open('output.txt', 'w') as fout: for word, freq in dic: fout.write('{} {}\n'.format(word, freq)) 结果 and 15 be 13 will 11 to 11 the 10 of 10 a 8 we 8 day 6 able 6 every 6 together 6 i 5 have 5 dream 5 that 5 ……
    最新回复(0)