您的位置：首页 > 新闻资讯 > 正文

如何处理代理IP的难题:抓取频次过多立即封IP

发布时间：2019-11-22 00:00:00 来源：

序言

Python网络爬虫要亲身经历网络爬虫、网络爬虫被限定、网络爬虫反限定的全过程。自然事后也要网页爬虫限定提升，网络爬虫再反限定的一系列道高一尺魔高一丈的全过程。网络爬虫的初始阶段，加上headers和ip代理能够处理许多难题。

自己自身在抓取豆瓣读书的那时候,就认为抓取频次过多,立即封号了IP.之后就科学研究了代理IP的难题.

(那时候不清楚什么原因,差点儿心理状态就崩了...),下边给大伙儿详细介绍一下自己代理IP抓取统计数据的难题,请大伙儿强调存在的不足.

如何处理代理IP的难题:抓取频次过多立即封IP

难题

这就是我的IP封号了,一开始好好地的,我都认为就是我的编码难题了

如何处理代理IP的难题:抓取频次过多立即封IP

构思：

从在网上搜索了一些有关网络爬虫代理IP的材料,获得下边的构思

抓取一些IP,过虑掉不能用.

在requests的恳求的proxies主要参数添加相匹配的IP.

再次抓取.

收工

还好,全是空话,基础理论大伙儿都懂,上边立即上编码...

构思拥有,着手起來.

软件环境

Python 3.7, Pycharm

这种必须大伙儿立即去构建好自然环境...

提前准备工作中

抓取ip地址的网址(中国高匿代理)

校检ip地址的网址

你以前封号IP的py网络爬虫脚本制作...

上边的网站地址看本人的状况来选择

抓取IP的详细编码

PS:简易的应用bs4获得IP和服务器端口,沒有啥难度系数,里边提升了一个过虑不能用IP的逻辑性

重要地区常有注解了

import requests

from bs4 import BeautifulSoup

import json

class GetIp(object):

\"\"\"爬取代理IP\"\"\"

def __init__(self):

\"\"\"复位自变量\"\"\"

self.url = 'http://www.xicidaili.com/nn/'

self.check_url = 'https://www.ip.cn/'

self.ip_list = []

@staticmethod

def get_html(url):

\"\"\"恳求html网页页面信息内容\"\"\"

header = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

}

try:

request = requests.get(url=url, headers=header)

request.encoding = 'utf-8'

html = request.text

return html

except Exception as e:

return ''

def get_available_ip(self, ip_address, ip_port):

\"\"\"检验ip地址是不是能用\"\"\"

header = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

}

ip_url_next = '://' + ip_address + ':' + ip_port

proxies = {'http': 'http' + ip_url_next, 'https': 'https' + ip_url_next}

try:

r = requests.get(self.check_url, headers=header, proxies=proxies, timeout=3)

html = r.text

except:

print('fail-%s' % ip_address)

else:

print('success-%s' % ip_address)

soup = BeautifulSoup(html, 'lxml')

div = soup.find(class_='well')

if div:

print(div.text)

ip_info = {'address': ip_address, 'port': ip_port}

self.ip_list.append(ip_info)

def main(self):

\"\"\"主方式 \"\"\"

web_html = self.get_html(self.url)

soup = BeautifulSoup(web_html, 'lxml')

ip_list = soup.find(id='ip_list').find_all('tr')

for ip_info in ip_list:

td_list = ip_info.find_all('td')

if len(td_list) > 0:

ip_address = td_list[1].text

ip_port = td_list[2].text

# 检验ip地址是不是合理

self.get_available_ip(ip_address, ip_port)

# 刻录合理文档

with open('ip.txt', 'w') as file:

json.dump(self.ip_list, file)

print(self.ip_list)

# 程序流程主通道

if __name__ == '__main__':

get_ip = GetIp()

get_ip.main()

操作方法详细编码

PS: 关键是根据应用任意的IP来抓取,依据request_status来分辨这一IP是不是能够用.

黑洞IP

如何处理代理IP的难题:抓取频次过多立即封IP