python爬虫实现豆瓣数据的爬取

    xiaoxiao2025-06-18  9

    本文利用urllib在python3.7的环境下实现豆瓣页面的爬取!

    用到的包有urllib与re两个模块,具体实现如下!

    import urllib.request import re import ssl url = "https://read.douban.com/provider/all" def doubanread(url): ssl._create_default_https_context = ssl._create_unverified_context data = urllib.request.urlopen(url).read() data = data.decode("utf-8") pat = '<div class="name">(.*?)</div>' mydata = re.compile(pat).findall(data) return mydata def writetxt(mydata): fw = open("test.txt","w") for i in range(0,len(mydata)): fw.write(mydata[i] + "\n") fw.close() if __name__ == '__main__': datatest = doubanread(url) writetxt(datatest)

    本文主要实现爬取豆瓣阅读页面的出版社信息的爬取,将所有出版社写入到一个txt文件并保存到本地!

    下面是另一个版本的抓取,用于抓取豆瓣文学部分的数据,包括数名、作者、出版社、出版时间、售价、评分等内容!

    本次抓取利用requests库抓取网页代码;Beautiful解析网页数据;由于此版本可以用来抓取多页数据,为防止爬虫被禁,加入时间,引入time模块;数据最终保存在csv中,在抓取的过程中将数据保存在列表中,最终利用pandas,实现数据形式的转换,保存在csv文件中!

    还有需要注意的一点是,beautiful解析的数据应该是content,否则会报错!

    import requests from bs4 import BeautifulSoup import pandas as pd import time import random headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } urlorigin = 'https://book.douban.com/tag/小说?start=' def doubanSpider(urlorigin): for i in range(0,30): url = urlorigin + str(i * 20) + '&type=T' html = requests.get(url=url,headers=headers) list = doubanparse(html) doubanwrite(list) time.sleep(3 + random.random()) def doubanparse(html): List = [] bookname = [] info1 = [] scores = [] numberofpeople = [] actor = [] publicer = [] date = [] price = [] soup = BeautifulSoup(html.content,'lxml') for name in soup.select('h2 a'): bookname.append(name.get_text().strip()) for actor in soup.select('div .pub'): info = actor.get_text().strip() info1.append(actor.get_text().strip()) actor = info.split('/')[0] publicer = info.split('/')[-3] date = info.split('/')[-2] price = info.split('/')[-1] print(actor,publicer,date,price) for score in soup.select('div .rating_nums'): scores.append(score.get_text().strip()) for peoples in soup.select('div .pl'): numberofpeople.append(peoples.get_text().strip()) for i in range(len(bookname)): List.append([bookname[i],actor,publicer,date,price,scores[i],numberofpeople[i]]) return List def doubanwrite(dataList): fieldnames = ['bookname','acthor','publisher','date','price','score','numberofpeople'] data = pd.DataFrame(columns=fieldnames,data=dataList) data.to_csv('douban.csv') if __name__ == '__main__': doubanSpider(url=urlorigin)

     

    最新回复(0)