Python爬虫:爬取博客

    xiaoxiao2025-01-15  56

    第一次玩python爬虫,盯上了实习公司官网的技术博客,页面如下: 查看网页源码,不难发现想要爬取的内容都位于<ul class="blog-item-contain">,只需要使用bs4的过滤功能匹配到这个标签,再分别对下面的<a>、<span>和<p>标签进行内容读取即可。 代码如下:

    import requests from bs4 import BeautifulSoup def get_html(url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'} resp = requests.get(url, headers=headers).text return resp def html_prase(): for url in all_page(): soup = BeautifulSoup(get_html(url),"lxml") alltitle = soup.find_all('ul',class_ = 'blog-item-contain')[0] title_tmp = alltitle.find_all('a') title = [] for k in title_tmp: title.append(k.string) allauthor_tmp = alltitle.find_all('span') author = [] for k in allauthor_tmp: author.append(k.string) sum_tmp = alltitle.find_all('p') sum = [] for k in sum_tmp: sum.append(k.string) for title,author,sum in zip(title,author,sum): title = str(title) + '\n\n' author = str(author) + '\n\n' sum = '内容导读:' + '\n' + str(sum) + '\n' data = title + author + sum f.writelines(data + '--------------------------------' +'\n') def all_page(): base_url = 'https://www.hansight.com/blog?page=' urllist = [] for page in range(1,5): allurl = base_url + str(page)+'&size=10' urllist.append(allurl) return urllist path = 'C://Users//jz//Desktop//Python//爬虫//瀚思博客文章.txt' f = open(path,'w',encoding='utf-8') html_prase() f.close() print('文件已保存')

    保存的TXT文本效果如下:

    最新回复(0)