爬一爬百思不得姐上的搞笑图片

    xiaoxiao2023-10-13  180

    前两天看了一个爬百思不得姐上段子的视频,然后特意去百思不得姐网址看了一下,发现还有声音,就想爬一下声音这个一栏。使用的是我新学的多线程O(∩_∩)O,没想到居然掉进一个坑。这个网站的声音有十页,但是十页的内容都一毛一样,爬的时候看着我设置的提示信息,有点怀疑人生,比如一下出现5个“xxxxxx已经下载完成”,找了好久才发现是网站的问题。哎本着来都来了的心态,就再爬一下图片吧。 网址:http://www.budejie.com/pic/。 在后面加数字几就代表第几页,也没有什么套路,直接requests请求就可以爬到。代码也和我爬表情包的差不多,不一样的是我的解析库用了pyquery,对于不可以作为文件名的字符我也进行了处理:

    import requests from urllib import request import os import time import threading from queue import Queue from pyquery import PyQuery as pq import re class Productor(threading.Thread): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36" } def __init__(self, page_queue, pic_queue, *args, **kwargs): super(Productor, self).__init__(*args, **kwargs) self.page_queue = page_queue self.pic_queue = pic_queue def run(self): while True: if self.page_queue.empty(): break url = self.page_queue.get() self.get_page_parse(url) def get_page_parse(self, url): response = requests.get(url, headers=self.headers) doc = pq(response.text) pics = doc(".j-r-list-tool-ct-fx div") for pic in pics.items(): pic_url = pic.attr("data-pic") pic_name = pic.attr("data-text") try: suffix = os.path.splitext(pic_url)[1] filename = pic_name + suffix filename = re.sub(" ", "", filename) filename = re.sub("[-《》#??]", "", filename).strip() #对有特殊字符的文件名进行处理 except: continue self.pic_queue.put((pic_url, filename)) time.sleep(0.5) class Consumer(threading.Thread): def __init__(self, page_queue, pic_queue, *args, **kwargs): super(Consumer,self).__init__(*args, **kwargs) self.page_queue = page_queue self.pic_queue = pic_queue def run(self): while True: if self.page_queue.empty() and self.pic_queue.empty(): break try: audio_url,filename = self.pic_queue.get() request.urlretrieve(audio_url,"baisipic/" + filename) print(filename + "下载完成") except: continue time.sleep(0.5) def main(): page_queue = Queue(50) pic_queue = Queue(800) for i in range(1,51): url_s = "http://www.budejie.com/pic/" + str(i) page_queue.put(url_s) for i in range(3): p = Productor(page_queue, pic_queue) p.start() for i in range(3): t = Consumer(page_queue, pic_queue) t.start() if __name__ == '__main__': main()

    爬到的结果:

    最新回复(0)