dashwood.net -

Ryan Stefan's Micro Blog

Handling Public Proxies with Scrapy Quickly - Remove Bad on N Failures

Jan 312019

You can catch 404 and connection errors by using errback= inside of the scrapy.Request object. From there I just add the failed proxy inside of the request meta to a list of failed proxies inside of the ProxyEngine class. If a proxy is seen inside of the failed list N times it can be removed with the ProxyEngine.remove_bad() class function. I also discovered that passing the download_timeout inside of the request meta works a lot better than inside of the Spider's global settings. Now the spider doesn't hang on slow or broken proxies and will be much much faster. 

Next I plan to refactor the ProxyEngine data to serialize attempts so that I can catch proxies that have been banned by one domain, but not others. Also, I need to feed bad_proxies back into the request generator after being down for N time and save all of the proxy data to a database. Here's the code:

Proxy Engine

class ProxyEngine:
    def __init__(self, limit=3):
        self.proxy_list = []
        self.bad_proxies = []
        self.good_proxies = []
        self.failed_proxies = []
        self.limit = limit

    def get_new(self, file='./proxies.txt'):
        new_proxies = []
        with open(file, 'r') as file:
            for line in file:
                new_proxies.append(f'https://{line.strip()}')
        return [self.proxy_list.append(x) for x in new_proxies if x not in self.proxy_list and x not in self.bad_proxies]

    def remove_bad(self):
        for proxy in self.proxy_list:
            if self.failed_proxies.count(proxy) >= self.limit:
                self.bad_proxies.append(proxy)
        return [self.proxy_list.remove(x) for x in self.proxy_list if x in self.bad_proxies]

Proxy Spider

class ProxyTest(Spider):
    name = 'proxy_test'
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ProxyPipeline': 400
        },
        'CONCURRENT_REQUESTS_PER_IP': 2,
    }

    def __init__(self):
        self.prox = ProxyEngine(limit=20)

    def start_requests(self):

        self.prox.get_new()
        for proxy in self.prox.proxy_list:
            request = scrapy.Request("https://dashwood.net/post/python-3-new-string-formatting/456ft",
                                     callback=self.get_title, errback=self.get_error, dont_filter=True)
            request.meta['proxy'] = proxy
            request.meta['dont_retry'] = True
            request.meta['download_timeout'] = 5
            yield request

    def get_title(self, response):
        print(response.status)
        print('*' * 15)

    def get_error(self, failure):
        if failure.check(HttpError):
            response = failure.value.response
            print("HttpError occurred", response.status)
            print('*' * 15)

        elif failure.check(DNSLookupError):
            request = failure.request
            print("DNSLookupError occurred on", request.url)
            print('*' * 15)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.prox.failed_proxies.append(request.meta["proxy"])
            print("TimeoutError occurred", request.meta)
            print('*' * 15)

        else:
            request = failure.request
            print("Other Error", request.meta)
            print(f'Proxy: {request.meta["proxy"]}')
            self.prox.failed_proxies.append(request.meta["proxy"])
            print('Failed:', self.prox.failed_proxies)
            print('*' * 15)

Comments