数据分析与数据挖掘实战视频——学习笔记之微信爬虫实战（目前失败）

xiaoxiao2023-11-28 185

之前在视频里面微信爬虫是失败的，但是回过头来继续调整代码，希望能成功实现目标。前面是初始代码，但是没有用。

什么是微信爬虫

所谓微信爬虫，即自动获取微信的相关文章信息的一种爬虫。微信对我们的限制是很多的，所以，我们需要采用一些手段解决这些限制，主要包括伪装浏览器，使用代理IP等方式。

微信爬虫编写实战

req和url的区别

我的理解是url是网址 req是请求可以通过代码将url转化为请求

url="http://www.baidu.com/s?wd="+keywd#如果是https的话就不行，因为协议问题 req=urllib.request.Request(url)#要转化为请求 data=urllib.request.urlopen(req).read() #http://weixin.sogou.com/ import re import time import urllib.error import urllib.request #自定义函数，功能为使用代理服务器爬一个网址 def use_proxy(proxy_addr,url): #建立异常处理机制 try: req=urllib.request.Request(url) req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36") proxy=urllib.request.ProxyHandler({'http':proxy_addr}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data=urllib.request.urlopen(req).read() return data except urllib.request.URLError as e: if hasattr(e,"code"): print(e.code) if hasattr(e,"reason"): print(e.reason) #若为URLError异常，延时10s执行 time.sleep(10) except Exception as e: print("exception:"+str(e)) #若为Exception异常，延时1执行 time.sleep(1) #设置关键词 key="python" #设置代理服务器，该代理服务器可能失效，读者需要换成新的代理服务器 proxy="127.0.0.1:8888" #爬多少页 for i in range(0,10): key=urllib.request.quote(key) thispageurl="https://weixin.sogou.com/weixin?type=2&query="+key+"&page="+str(i) thispagedata=use_proxy(proxy,thispageurl) print(len(str(thispagedata))) pat1='<a href="(.*?)"' rs1=re.compile(pat1,re.S).findall(str(thispagedata)) if(len(rs1)==0): print("此次（"+str(i)+"页）没成功") continue for j in range(0,len(rs1)): thisurl=rs1[j] thisurl=thisurl.replace("amp;","") file="F:/result/第"+str(i)+"页第"+str(j)+"篇文章" thisdata=use_proxy(proxy,thisurl) try: fh=open(file,"wb") fh.write(thisdata) fh.close() print("第"+str(i)+"页第"+str(j)+"篇文章成功") except Exception as e: print(e) print("第"+str(i)+"页第"+str(j)+"篇文章失败")

代理ip被封了而且url无法直接获取页面，需要验证码，所以目前失败了查看本地ip和端口的方法：如何查看本机ip地址和端口查看本地ip地址和端口方法

2019.5.26 今天来更新了，先把暂时得到的代码写一下吧。

#http://weixin.sogou.com/ import re import time import urllib.error import urllib.request #自定义函数，功能为使用代理服务器爬一个网址 def use_proxy(proxy_addr,url): #建立异常处理机制 try: req=urllib.request.Request(url) req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36") proxy=urllib.request.ProxyHandler({'http':proxy_addr}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data=urllib.request.urlopen(req).read().decode("utf-8") return data except urllib.request.URLError as e: if hasattr(e,"code"): print(e.code) if hasattr(e,"reason"): print(e.reason) #若为URLError异常，延时10s执行 time.sleep(10) except Exception as e: print("exception:"+str(e)) #若为Exception异常，延时1执行 time.sleep(1) #设置关键词 key="python" #设置代理服务器，该代理服务器可能失效，读者需要换成新的代理服务器 proxy="112.85.150.51:9999" #爬多少页 for i in range(1,2): key=urllib.request.quote(key) thispageurl="https://weixin.sogou.com/weixin?type=2&query="+key+"&page="+str(i) thispagedata=use_proxy(proxy,thispageurl) print(len(str(thispagedata))) pat1='<a target="_blank" href="(.*?)"' rs1=re.compile(pat1,re.S).findall(str(thispagedata)) if(len(rs1)==0): print("此次（"+str(i)+"页）没成功") continue for j in range(0,len(rs1)): thisurl=rs1[j] thisurl=thisurl.replace("amp;","") file="F:/result/第"+str(i)+"页第"+str(j)+"篇文章" thisdata=use_proxy(proxy,thisurl) try: fh=open(file,"wb") fh.write(thisdata) fh.close() print("第"+str(i)+"页第"+str(j)+"篇文章成功") except Exception as e: print(e) print("第"+str(i)+"页第"+str(j)+"篇文章失败")

我先说下问题吧。问题就是第二次爬取的时候需要验证码，这儿我就想到后面面对验证码的豆瓣项目，可以参考一下。我再说下修改吧。我对第一次爬取的data查看了一下，发现是乱码的，我加了一个decode(“utf-8”)。再则，代理ip也换了一个可以用的，代理ip参考网址： https://www.xicidaili.com/ https://www.kuaidaili.com/free/

好了，我要开始尝试解决验证码问题了.也可能是思路错了，我看到验证码的网址和实际想要的网址差别有点大，输入验证码之后的也有点大。

我爬去的网址是：http://weixin.sogou.com/api/share?timestamp=1558869883&signature=qIbwYnI6KU9tBso4VCd8lYSesxOYgLcHX5tlbqlMR8N6flDHs4LLcFgRw7FjTAO7iy2zEfnrg2GpROYWYV1hV-HqK65ImW51EybBXBS6JKfFnGmdnbfkNmbImVYKqGV1Ew4ZNlmKXrhg2iEsku4JxAJZqo1sfqpmwKw59GzMczGjrcjkvk8trIkVJc*wLMQFgpXA3X9Jl4-MSAclPjNgjh62z817EWL8wOdRgXFOk=

但是目标网址是：https://mp.weixin.qq.com/s?src=11×tamp=1558869883&ver=1630&signature=-fQZxQcIcVvtFaKu2Jt-2iUljsxvLcWxHl7TBFoExPtkeO2tcpgjKPxq1vNriRb9vySbBcDRa7g4Zf4YiUkRjXqFNdMlUN4qPF1f43feYZl3I7DqMNKjIBgxKfUhks&new=1

一时间找不到关系。可能需要后续更为深入的研究吧。

如果有大神知道怎么改的话，求解答。

最新回复(0)