为什么我爬的速度很慢
# -*- coding: utf-8 -*- # Scrapy settings for guazi_scrapy_project project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html # 项目名称 BOT_NAME = 'guazi_scrapy_project' # 解析器模块 SPIDER_MODULES = ['guazi_scrapy_project.spiders'] NEWSPIDER_MODULE = 'guazi_scrapy_project.spiders' # 设置默认请求头 # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" # 一般设置为False # robots为网站设置,如果某些内容被设置为不可抓取,填True会遵循这个规则,即不抓取 # 我们要抓取数据,当然不能遵守协议 # Obey robots.txt rules ROBOTSTXT_OBEY = False # 并发请求数量 # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 5 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # 每次请求间隔,一般为0.1-0.5 DOWNLOAD_DELAY = 0.1 # The download delay setting will honor only one of: # 单个域名并发请求最大数量 CONCURRENT_REQUESTS_PER_DOMAIN = 100 # 单个ip并发请求最大数量 CONCURRENT_REQUESTS_PER_IP = 100 # 超时时间,默认180秒 DOWNLOAD_TIMEOUT = 10 # 是否重试,重试几次 # 失败了会到errback里 RETRY_ENABLED = True RETRY_TIMES = 5 # 是否开启cookie,一般为False # Disable cookies (enabled by default) #COOKIES_ENABLED = False # TELNET SSH,默认启用 # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # 默认请求头 # Override the default request headers: DEFAULT_REQUEST_HEADERS = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Connection": "keep-alive", "Host": "www.guazi.com", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Sec-Fetch-User": "?1", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36", } # spider中间件,一般用不到 # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'guazi_scrapy_project.middlewares.GuaziScrapyProjectSpiderMiddleware': 543, #} # downloader中间件 # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { # 'guazi_scrapy_project.middlewares.GuaziScrapyProjectDownloaderMiddleware': 543, 'guazi_scrapy_project.middlewares.guazi_downloader_middleware': 500, 'guazi_scrapy_project.middlewares.my_useragent': 600, # 'guazi_scrapy_project.middlewares.my_proxy': 502, } # 拓展中间件 # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # 数据管道,设置了pipeline的时候要开启 # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'guazi_scrapy_project.pipelines.GuaziScrapyProjectPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html AUTOTHROTTLE_ENABLED = True # The initial download delay AUTOTHROTTLE_START_DELAY = 0 # The maximum download delay to be set in case of high latencies AUTOTHROTTLE_MAX_DELAY = 30 # The average number of requests Scrapy should be sending in parallel to # each remote server AUTOTHROTTLE_TARGET_CONCURRENCY = 100 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # http缓存,默认禁用 # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
这是我的settings文件,并发,时间间隔都调了,但是一个小时才爬几百条,请问是为什么呢?
7
收起
正在回答 回答被采纳积分+1
1回答
4.入门主流框架Scrapy与爬虫项目实战
- 参与学习 人
- 提交作业 107 份
- 解答问题 1672 个
Python最广为人知的应用就是爬虫了,有趣且酷的爬虫技能并没有那么遥远,本阶段带你学会利用主流Scrapy框架完成爬取招聘网站和二手车网站的项目实战。
了解课程
恭喜解决一个难题,获得1积分~
来为老师/同学的回答评分吧
0 星