python爬取动态页面（一）js数据端口

xiaoxiao2023-11-09 159

我们知道部分网站属于动态页面，数据不通过产生新的url即可加载。比如说今日头条，数据通过下拉方式加载；而又比如说信用成都网站，则通过产生新窗口加载数据。

以爬取信用成都列异名录为例。爬取地址为：“http://credit.chengdu.gov.cn/www/index.html#/m///exceptionList/1/10/” 按F12打开源码，发现数据被隐藏。打开pychram下载源码试试。

import requests from bs4 import BeautifulSoup url="http://credit.chengdu.gov.cn/www/index.html#/m///exceptionList/1/10/" r= requests.get(url) soup = BeautifulSoup(r.text,"lxml") print(soup)

打印出的页面如下：

<!DOCTYPE html> <html> <head> <meta charset="utf-8"/> <style> body{ background:#fff; font-family: microsoft yahei; color:#969696; font-size:14px;} .online-desc-con { text-align:center; } .r-tip01 { color: #333; font-size: 18px; display: block; text-align: center; width: 500px; padding: 0 10px; overflow: hidden; text-overflow: ellipsis; margin: 0 auto 15px; } .r-tip02 { color: #585858; font-size: 14px; display: block; margin-top: 20px; margin-bottom: 20px; } #notice-jiasule { word-wrap: break-word; word-break: normal; color:#585858; border:1px solid #ddd; padding:0px 20px 0px 20px } img { border: 0; } .u-ico{ vertical-align: middle; margin-right: 12px;} .btn{ padding: 8px 22px; border-radius: 3px; border: 0; display: inline-block;vertical-align: middle;text-decoration: none;} .btn-g{ background-color: #61b25e; color: #fff;} .report {color: #858585; text-decoration: none;} .report:hover {text-decoration: underline; color: #0088CC;} hr{ border-top: 1px dashed #ddd;} center{ line-height: 48px; color: #919191;} </style> <script id="content_tpl" type="text/template"> <span class="r-tip01"><%= error_403 %></span> <div id='notice-jiasule'> <p>å½“å‰ç½‘å€ï¼š<%- url %></p> <p>å®¢æˆ·ç«¯ç‰¹å¾ï¼š<%- user_agent %></p> <p>æ‹¦æˆªæ—¶é—´ï¼š<%- now %> æœ¬æ¬¡äº‹ä»¶ID <%- rule_id %></p> </div> <span class='r-tip02'> <img class='u-ico' alt='' src='/cdn-cgi/image/guest.png' />å¦‚æžœæ‚¨æ˜¯ç½‘ç«™ç®¡ç†å‘˜ï¼Œè¯·ç™»å½•çŸ¥é“åˆ›å®‡äº‘å®‰å ¨ <a class='btn btn-g' href='http://help.yunaq.com/feedback.html?from=<%- from %>&rule_id=<%- rule_id %>&client_ip=<%- client_ip %>&referrer=<%- ref %>#pus' target='_blank'>æŸ¥çœ‹è¯¦æƒ </a> æˆ–è€ <a class='report' href='http://help.yunaq.com/feedback.html?from=<%- from %>&rule_id=<%- rule_id %>&client_ip=<%- client_ip %>&referrer=<%- ref %>#hus' target='_blank'>åé¦ˆè¯¯æŠ¥</a> </span> </script> <script src="/cdn-cgi/js/underscore_min_1.8.3.js" type="text/javascript"></script> </head> <body> <div class="online-desc-con" style="width:640px;padding-top:15px;margin:34px auto;"> <img alt="" src="/cdn-cgi/image/protected.png" style="margin: 0 auto 17px auto;"/> <div id="content_rendered"></div> <hr/> <center>client: 222.210.9.205, server: 2eb9fce, time: 2019-05-25 14:18:18</center> </div> <script> void(function fuckie6(){if(location.hash && /MSIE 6/.test(navigator.userAgent) && !/jsl_sec/.test(location.href)){location.href = location.href.split('#')[0] + '&jsl_sec' + location.hash}})(); var content = _.template(document.getElementById('content_tpl').innerHTML)({ error_403: '' || 'å½“å‰è®¿é—®ç–‘ä¼¼é»‘å®¢æ”»å‡»ï¼Œå·²è¢«ç½‘ç«™ç®¡ç†å‘˜è®¾ç½®ä¸ºæ‹¦æˆª', url: document.URL.replace(/\</g,"<").replace(/\>/g,">"), user_agent: navigator.userAgent, now: new Date(new Date() - -8 * 3600000).toISOString().substr(0, 19).replace('T', ' '), rule_id: parseInt('[80001]'.replace(/\[|\]/g, '')) || '', from: encodeURIComponent(document.referrer.substr(0, 1024)), client_ip: '222.210.9.205', ref: encodeURIComponent(document.URL.substr(0, 1024)) }); document.getElementById('content_rendered').innerHTML = content; </script> <div style="display:none;"> <script> var _bdhmProtocol = (("https:" == document.location.protocol) ? " https://" : " http://"); document.write(unescape("")); </script> </div> </body> </html>

发现并没有想要的任何数据。打开网址源码，打开网络-js，发现其中有post事件。

点击post，查看右侧消息头部分，发现请求网址，查看响应，发现响应数据是我们想要爬取的企业数据。打开消息头里的请求网址。可以看到数据正是在该网址内。改为爬取该网址数据。

注意：一定要加请求头，不然无法响应。上述代码的意思是先用request请求网页，拿到的数据是json格式，通过json.loads转为python对象，再通过取其中元素的方法得到list（即，mydata)，通过循环取出想要的企业名。

第二步，爬取各企业列异原因，随便点击某企业名，发现弹出了窗口，但url没有改变，查看网络-js，发现有一个新的post。点击后查看响应，发现响应数据就是想要的数据。同上步一样，打开请求网址。但却发现请求网址为空。这是因为没有传入参数请求的原因，回到源码界面，查看参数，发现原请求是通过传入参数进行获取的。于是进行测试。通过requests.post传入参数。可行。完整代码为：

import json import requests headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0","cookie":"yoursessionname1=465A465F8F675B1BFC80EEA0D7828E10-n1; __jsluid=2fbf27051e01ec95326888e1f551a3fd"} url1="http://credit.chengdu.gov.cn/homePage/findExceptionList.do" web1= requests.get(url1,headers=headers) d1=json.loads(web1.text) mydata=d1["msg"]["rows"] for i in mydata: id=i["nbxh"] idno=i["idno"] ptname=i["ptName"] tanchu1="http://credit.chengdu.gov.cn/homePage/getGovExceptionReason.do" c1={"id":id, "idno":idno, "ptName":ptname, "appType":"APP001" } mydata2=[] web2= requests.post(tanchu1,data=c1,headers=headers) d2=json.loads(web2.text) mydata2=d2["msg"]["base"][0]["items"][0][0]["text"] print("{},{}".format(ptname,mydata2))

得到一页的公司名字，列异原因。如果要爬取多页，分析网址和js传入参数。改网址或将传入参数中page改为2都可以。

最新回复(0)