0
Follow
2
View

Character formatting errors occurred after scrapy used the agent

dkaola 注册会员
2023-02-28 19:25


This error is because the proxy IP contains non-ASCII characters, and the encoding Scrapy uses by default is ASCII and cannot be converted to bytecode. This problem can be solved by specifying the encoding in the request header in the middleware as follows:

from scrapy import settings
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
import random
import requests


class RandomProxyMiddleware(HttpProxyMiddleware):
    def __init__(self, auth_encoding='utf-8', proxy_list=None):
        self.proxy = settings.PROXY

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy)
        if self.check_proxy(proxy):
            print('当前使用的代理IP是:', proxy)
            request.meta['proxy'] = proxy
        else:
            self.process_request(request, spider)

    def check_proxy(self, proxy):
        try:
            requests.get('https://www.eastmoney.com/', proxies={'http': proxy}, timeout=3)
            return True
        except:
            return False

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy)
        if self.check_proxy(proxy):
            print('当前使用的代理IP是:', proxy)
            request.meta['proxy'] = proxy
            # 指定编码方式为UTF-8
            request.headers.setdefault('Accept-Encoding', 'gzip, deflate')
            request.headers.setdefault('Content-Type', 'text/html; charset=utf-8')
        else:
            self.process_request(request, spider)


In the process_request method, you added code specifying the encoding so that you can avoid the UnicodeEncodeError.

donglei55 注册会员
2023-02-28 19:25

For GPT's and my own thinking, this error is probably due to the inclusion of non-ASCII characters in the proxy IP, which Scrapy encodes using 'ascii' encoding, Hence the Unicode encoding error. You can try using other encodings, such as 'utf-8' or 'gbk', to encode the string of proxy IP.

You can modify the part of the proxy IP read, for example:

class RandomProxyMiddleware(HttpProxyMiddleware):

    def __init__(self, auth_encoding='utf-8', proxy_list=None):
        self.proxy = settings.PROXY

        # 将代理IP编码为'utf-8'格式
        self.proxy = [p.encode('utf-8') for p in self.proxy]

    def request(self, request, spider):
        # 随机选择一个代理IP
        proxy = random.choice(self.proxy)
        # 判断代理IP是否可用
        if self.check_proxy(proxy):
            print('当前使用的代理IP是:', proxy)
            request.meta['proxy'] = proxy

        else:
            self.process_request(request, spider)

    def check_proxy(self, proxy):
        # 将代理IP解码为'utf-8'格式
        proxy = proxy.decode('utf-8')
        # 判断代理IP是否可用
        try:
            # 设置超时时间为3秒
            requests.get('https://www.eastmoney.com/', proxies={'http': proxy}, timeout=3)
            return True
        except:
            return False

In this example, we set the proxy IP encoding mode to 'utf-8' and encode and decode the proxy IP as it is read and used. Doing so should avoid this Unicode coding error.