正在回答
2回答
同学你好,二手车网站返回数据格式改变了,我们的代码也要随之迭代更新,这在爬虫中是经常遇到的问题,掌握调试方法很重要,同学可参考下面的思路
1、使用新的正则获取到city_list数据之后,先将city_list打印出来,我们发现city_list中从第3项开始,是城市信息,包含"domain"字样的
2、然后我们遍历city_list,从第3项开始遍历,打印输出,观察其中一个城市信息的内容结构,可以看出每一项都是字典形式的,以大写字母为key,列表为value,value的内容包含城市id,城市名称
3、继续循环遍历上一步的字典,然后从字典的value中获取城市信息,先将value打印出来,查看数据内容
4、从字典的value中获取城市名称
5、完整代码如下
1 | import requests<br> #通过execjs这个包,来解析js<br>import execjs<br>import re<br>import json<br>from guazi_scrapy_project.handle_mongo import mongo<br><br>#我们请求城市的接口<br>url = 'https://www.guazi.com/www/buy'<br><br>#cookie值要删掉,否则对方会根据这个值发现我们,并且屏蔽我们<br>#要通过正则表达式处理请求头,里面有空格,大家一定要注意<br>header = {<br> "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",<br> "Accept-Encoding":"gzip, deflate, br",<br> "Accept-Language":"zh-CN,zh;q=0.9",<br> "Connection":"keep-alive",<br> "Host":"www.guazi.com",<br> "Referer":"https://www.guazi.com/www/buy",<br> "Upgrade-Insecure-Requests":"1",<br> "User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3610.2 Safari/537.36",<br>}<br><br>response = requests.get(url=url,headers=header)<br># print(response.text)<br>#设置返回的编码<br>response.encoding = 'utf-8'<br>if '正在打开中,请稍后' in response.text:<br> #通过正则表达式获取了相关的字段和值<br> value_search = re.compile(r"anti\('(.*?)','(.*?)'\);")<br><br> string = value_search.search(response.text).group(1)<br> key = value_search.search(response.text).group(2)<br> # print(string, key )<br> #读取,我们破解的js文件<br> with open('guazi.js','r') as f:<br> f_read = f.read()<br> #使用execjs包来封装这段JS,传入的是读取后的js文件<br> js = execjs.compile(f_read)<br> js_return = js.call('anti',string,key)<br> #print(js_return)<br> cookie_value = 'antipas='+js_return<br> # print(cookie_value)<br> header['Cookie'] = cookie_value<br> # print(header)<br> response_second = requests.get(url=url,headers=header)<br> # print(response_second.text)<br> city_search = re.compile(r'({.*?});')<br> brand_search = re.compile(r'href="\/www\/(.*?)\/c-1/#bread"\s+>(.*?)</a>')<br> city_list = city_search.findall(response_second.text)<br> # print(city_list)<br> brand_list = brand_search.findall(response_second.text)<br> # print(brand_list)<br> for city_info in city_list[2:]:<br> # print(json.loads(city_info))<br> for k, v in json.loads(city_info).items():<br> # print(v)<br> for city in v:<br> # print(city['name'])<br> if city['name'] == '北京':<br> for brand in brand_list:<br> info = {}<br> # https://www.guazi.com/anqing/buy<br> # https://www.guazi.com/anqing/audi/#bread<br> # https://www.guazi.com/anqing/audi/o1i7/#bread<br> info['task_url'] = 'https://www.guazi.com/' + city['domain'] + '/' + brand[0] + '/' + 'o1i7'<br> info['city_name'] = city['name']<br> info['brand_name'] = brand[1]<br> info['item_type'] = 'list_item'<br> # print(info)<br> # 保存到mongodb里面去<br> mongo.save_task('guazi_task', info)<br><br> |
加油,祝学习愉快~~~
好帮手慕燕燕
2020-11-06 18:01:12
同学,你好,二手车网站的js进行了重新编译,同学可以参考下面的方法获取城市列表:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | import json import requests # 通过execjs这个包,来解析js import execjs import re from lxml import etree url = 'https://www.guazi.com/www/buy' header = { "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" , "Accept-Encoding" : "gzip, deflate, br" , "Accept-Language" : "zh-CN,zh;q=0.9" , "Connection" : "keep-alive" , "Host" : "www.guazi.com" , "Referer" : "https://www.guazi.com/www/buy" , "Upgrade-Insecure-Requests" : "1" , "User-Agent" : "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3610.2 Safari/537.36" , } response = requests.get(url = url, headers = header) response.encoding = 'utf-8' if '正在打开中,请稍后' in response.text: value_search = re. compile (r "anti\('(.*?)','(.*?)'\);" ) string = value_search.search(response.text).group( 1 ) key = value_search.search(response.text).group( 2 ) with open ( 'guazi.js' , 'r' , encoding = "utf-8" ) as f: f_read = f.read() js = execjs. compile (f_read) js_return = js.call( 'anti' , string, key) cookie_value = 'antipas=' + js_return header[ 'Cookie' ] = cookie_value response_second = requests.get(url = url, headers = header) guazi_html = etree.HTML(response_second.text) script_js = guazi_html.xpath( "//script[3]/text()" )[ 0 ] city_search = re. compile (r '({.*?});' ) city = city_search.findall(script_js) cityOne = json.loads(city[ 0 ]) cityTwo = json.loads(city[ 1 ]) A_N = [ chr (i) for i in range ( 65 , 78 )] M_Z = [ chr (i) for i in range ( 78 , 91 )] all_city = [] for i in A_N: each_list1 = cityOne.get(i) if each_list1: all_city.append(each_list1) for i in M_Z: each_list2 = cityTwo.get(i) if each_list2: all_city.append(each_list2) brand_search = re. compile (r 'href="\/www\/(.*?)/c-1/#bread"\s+>(.*?)</a>' ) brand_list = brand_search.findall(response_second.text) for each_list in all_city: for city in each_list: name = city[ 'domain' ] cityname = city[ 'name' ] if cityname = = '北京' : for brand in brand_list: info = {} info[ 'task_url' ] = 'https://www.guazi.com/' + name + '/' + brand[ 0 ] + '/' + 'o1i7' info[ 'city_name' ] = cityname info[ 'brand_name' ] = brand[ 1 ] print (info) |
加油,祝学习愉快~~~
4.入门主流框架Scrapy与爬虫项目实战
- 参与学习 人
- 提交作业 107 份
- 解答问题 1672 个
Python最广为人知的应用就是爬虫了,有趣且酷的爬虫技能并没有那么遥远,本阶段带你学会利用主流Scrapy框架完成爬取招聘网站和二手车网站的项目实战。
了解课程
恭喜解决一个难题,获得1积分~
来为老师/同学的回答评分吧