正在回答
2回答
同学你好,二手车网站返回数据格式改变了,我们的代码也要随之迭代更新,这在爬虫中是经常遇到的问题,掌握调试方法很重要,同学可参考下面的思路
1、使用新的正则获取到city_list数据之后,先将city_list打印出来,我们发现city_list中从第3项开始,是城市信息,包含"domain"字样的

2、然后我们遍历city_list,从第3项开始遍历,打印输出,观察其中一个城市信息的内容结构,可以看出每一项都是字典形式的,以大写字母为key,列表为value,value的内容包含城市id,城市名称

3、继续循环遍历上一步的字典,然后从字典的value中获取城市信息,先将value打印出来,查看数据内容

4、从字典的value中获取城市名称

5、完整代码如下
import requests
#通过execjs这个包,来解析js
import execjs
import re
import json
from guazi_scrapy_project.handle_mongo import mongo
#我们请求城市的接口
url = 'https://www.guazi.com/www/buy'
#cookie值要删掉,否则对方会根据这个值发现我们,并且屏蔽我们
#要通过正则表达式处理请求头,里面有空格,大家一定要注意
header = {
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate, br",
"Accept-Language":"zh-CN,zh;q=0.9",
"Connection":"keep-alive",
"Host":"www.guazi.com",
"Referer":"https://www.guazi.com/www/buy",
"Upgrade-Insecure-Requests":"1",
"User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3610.2 Safari/537.36",
}
response = requests.get(url=url,headers=header)
# print(response.text)
#设置返回的编码
response.encoding = 'utf-8'
if '正在打开中,请稍后' in response.text:
#通过正则表达式获取了相关的字段和值
value_search = re.compile(r"anti\('(.*?)','(.*?)'\);")
string = value_search.search(response.text).group(1)
key = value_search.search(response.text).group(2)
# print(string, key )
#读取,我们破解的js文件
with open('guazi.js','r') as f:
f_read = f.read()
#使用execjs包来封装这段JS,传入的是读取后的js文件
js = execjs.compile(f_read)
js_return = js.call('anti',string,key)
#print(js_return)
cookie_value = 'antipas='+js_return
# print(cookie_value)
header['Cookie'] = cookie_value
# print(header)
response_second = requests.get(url=url,headers=header)
# print(response_second.text)
city_search = re.compile(r'({.*?});')
brand_search = re.compile(r'href="\/www\/(.*?)\/c-1/#bread"\s+>(.*?)</a>')
city_list = city_search.findall(response_second.text)
# print(city_list)
brand_list = brand_search.findall(response_second.text)
# print(brand_list)
for city_info in city_list[2:]:
# print(json.loads(city_info))
for k, v in json.loads(city_info).items():
# print(v)
for city in v:
# print(city['name'])
if city['name'] == '北京':
for brand in brand_list:
info = {}
# https://www.guazi.com/anqing/buy
# https://www.guazi.com/anqing/audi/#bread
# https://www.guazi.com/anqing/audi/o1i7/#bread
info['task_url'] = 'https://www.guazi.com/' + city['domain'] + '/' + brand[0] + '/' + 'o1i7'
info['city_name'] = city['name']
info['brand_name'] = brand[1]
info['item_type'] = 'list_item'
# print(info)
# 保存到mongodb里面去
mongo.save_task('guazi_task', info)
加油,祝学习愉快~~~
好帮手慕燕燕
2020-11-06 18:01:12
同学,你好,二手车网站的js进行了重新编译,同学可以参考下面的方法获取城市列表:
import json
import requests
# 通过execjs这个包,来解析js
import execjs
import re
from lxml import etree
url = 'https://www.guazi.com/www/buy'
header = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Host": "www.guazi.com",
"Referer": "https://www.guazi.com/www/buy",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3610.2 Safari/537.36",
}
response = requests.get(url=url, headers=header)
response.encoding = 'utf-8'
if '正在打开中,请稍后' in response.text:
value_search = re.compile(r"anti\('(.*?)','(.*?)'\);")
string = value_search.search(response.text).group(1)
key = value_search.search(response.text).group(2)
with open('guazi.js', 'r', encoding="utf-8") as f:
f_read = f.read()
js = execjs.compile(f_read)
js_return = js.call('anti', string, key)
cookie_value = 'antipas=' + js_return
header['Cookie'] = cookie_value
response_second = requests.get(url=url, headers=header)
guazi_html = etree.HTML(response_second.text)
script_js = guazi_html.xpath("//script[3]/text()")[0]
city_search = re.compile(r'({.*?});')
city = city_search.findall(script_js)
cityOne = json.loads(city[0])
cityTwo = json.loads(city[1])
A_N = [chr(i) for i in range(65, 78)]
M_Z = [chr(i) for i in range(78, 91)]
all_city = []
for i in A_N:
each_list1 = cityOne.get(i)
if each_list1:
all_city.append(each_list1)
for i in M_Z:
each_list2 = cityTwo.get(i)
if each_list2:
all_city.append(each_list2)
brand_search = re.compile(r'href="\/www\/(.*?)/c-1/#bread"\s+>(.*?)</a>')
brand_list = brand_search.findall(response_second.text)
for each_list in all_city:
for city in each_list:
name = city['domain']
cityname = city['name']
if cityname == '北京':
for brand in brand_list:
info = {}
info['task_url'] = 'https://www.guazi.com/' + name + '/' + brand[0] + '/' + 'o1i7'
info['city_name'] = cityname
info['brand_name'] = brand[1]
print(info)加油,祝学习愉快~~~
4.入门主流框架Scrapy与爬虫项目实战
- 参与学习 人
- 提交作业 107 份
- 解答问题 1672 个
Python最广为人知的应用就是爬虫了,有趣且酷的爬虫技能并没有那么遥远,本阶段带你学会利用主流Scrapy框架完成爬取招聘网站和二手车网站的项目实战。
了解课程
恭喜解决一个难题,获得1积分~
来为老师/同学的回答评分吧
0 星