0
Follow
0
View

The python crawler runs successfully but the data is not output

dboy2233 注册会员
2023-02-28 04:20

Just change this

    # 找到包含文章信息的标签
    article_tags = soup.find_all("div", class_="docsum-content")

    # 提取每篇文章的标题和链接
    results = []
    for tag in article_tags:
        title_tags = tag.find_all('a', class_='docsum-title')
        if title_tags:
            title = title_tags[0].get_text().strip()
            link = 'https://pubmed.ncbi.nlm.nih.gov' + title_tags[0]['href']
            results.append((title, link))

img


To make fun of the abuse of chatgpt

These answers do not solve the problem at all. They are too written, and they are probably automatically answered by the interface of the interface. Thank you.

cscwan 注册会员
2023-02-28 04:20
yantafeizei 注册会员
2023-02-28 04:20

There are several possible reasons why your first code did not return any results while your second code did.

One reason may be that the first code is missing key information contained in the second code, such as specifying the number of pages to be crawled.
Another possible reason is that the site has changed since you last used working code, which may require you to update the crawl code accordingly.

In addition, the site may have taken steps to prevent crawling, such as limiting rates or blocking IP addresses, which may prevent your code from accessing the data.

If you suspect that the problem is the site's security measures, you can try using a crawl library that helps you bypass these measures, such as scrapy or selenium.
In addition, you may also want to check the site's terms of service to ensure that your crawling activity does not violate its policies.

dxwwdm 注册会员
2023-02-28 04:20

Refer to GPT and my own thinking, based on the code you provided, I don't see any obvious problems. It could be that your output is redirected somewhere else or that the output is not printed. You can try running the script in a terminal to check the output, or write the output to a file, for example:

import requests
from bs4 import BeautifulSoup

url = 'https://pubmed.ncbi.nlm.nih.gov/?term=cervical%20cancer%20treatment&filter=years.2020-2023'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'
}

def get_articles(url):
    # 发送HTTP请求,获取页面数据
    response = requests.get(url, headers=headers)
    html = response.text

    # 解析HTML代码
    soup = BeautifulSoup(html, 'html.parser')

    # 找到包含文章信息的标签
    article_tags = soup.select('.docsum-content')
    
    # 提取每篇文章的标题和链接
    results = []
    for tag in article_tags:
        title_tags = tag.select('.docsum-title > a')
        if title_tags:
            title = title_tags[0].get_text().strip()
            link = 'https://pubmed.ncbi.nlm.nih.gov' + title_tags[0]['href']
            results.append((title, link))
    
    return results

if __name__ == '__main__':
    with open('output.txt', 'w', encoding='utf-8') as f:
        for page in range(1, 6):
            page_url = f'{url}&page={page}'
            articles = get_articles(page_url)
            f.write(f'Page {page}: {page_url} ({len(articles)} articles found)\n')
            for article in articles:
                f.write(article[0]+'\n')
                f.write(article[1]+'\n')
                f.write('---\n')

This code writes the output to a text file called output.txt, which you can check to see if the output is correct.

dboywwy 注册会员
2023-02-28 04:20

This answer quotes ChatGPT

Restore me if in doubt

There are many reasons why

data may not be available. Here are some possible reasons for the problem:

1. The web server's anti-crawler mechanism, such as IP blocking, illegal request headers, etc., may cause your request to be rejected by the server or redirected to another page.

2. Network problems, such as network instability and frequent requests, may cause requests to fail or be restricted by the server.

3. The HTML code is not parsed correctly, so the required information cannot be parsed correctly.

4, the website is updated, resulting in a change in the HTML code structure or element selector, causing your crawler to parse incorrectly.

6. The website has a small amount of data, so you can't find any data when you crawl.

To determine the cause of the problem, you can try some debugging techniques:

1. Use a browser to view the original page source code and check that your crawler has extracted the required elements correctly.

2. Use the response object of the requests library to view the HTTP status code returned by the server to determine whether the request was successful.

3. Adjust the header information and try to access the website disguised as a browser.

4. Adjust the request frequency to prevent server constraints caused by frequent requests.

5. Use proxy IP addresses to prevent IP addresses from being blocked by the server.

When debugging, you can start with a small data set, check that the program is extracting the required data correctly, and gradually expand the data set, eventually determining the cause of the problem and fixing it.

lianmaocc1 注册会员
2023-02-28 04:20

把这行代码
```python
title_tags = tag.select('.docsum-title > a')

title_tags = tag.select('.docsum-title')

```

fcbzyy 注册会员
2023-02-28 04:20

This answer partially references GPT, GPT_Pro to solve the problem better.
This problem may be caused by the crawler you wrote running incorrectly. Beautifulsoup has a parameter "parser", which is a parser for an html document. This parameter specifies that html is parsed using html.parser, lxml, or html5lib. The Beautifulsoup argument "parser" should be set to "lxml" :

soup = Beautifulsoup(html, parser='lxml')

An html document must be processed before Beautifulsoup can be used, that is, converted to a string. requests.text property can be used to convert html documents:

response = requests.get(url, headers=headers)
html = response.text
soup = Beautifulsoup(html, parser='lxml')

In addition, an html document needs to be simply formatted before using Beautifulsoup. This means that the tags in the html document need to be properly nested. For example, when you nest div tags, the start tag must match the end tag. Otherwise, the Beautifulsoup function may not work properly. It is also necessary to import requests and Beautifulsoup library before using the Beautifulsoup function:

import requests
from bs4 import Beautifulsoup

Finally, note that when using the Beautifulsoup function, you don't forget to add a few other parameters, such as features, encoding, and parse_only. Of course, the most important thing is to define the correct headers to request html documents:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
} 

If the answer is helpful, please accept it.

civilnainiu 注册会员
2023-02-28 04:20

img

title_tags = tag.select('a'). For each a tag, Because article_tags = soup.select('.docsum-content') is already located in the specific div, we just need to iterate through the a tag below.

import requests
from bs4 import BeautifulSoup

url = 'https://pubmed.ncbi.nlm.nih.gov/?term=cervical%20cancer%20treatment&filter=years.2020-2023'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'
}


def get_articles(url):
    # 发送HTTP请求,获取页面数据
    response = requests.get(url, headers=headers)
    html = response.text

    # 解析HTML代码
    soup = BeautifulSoup(html, 'html.parser')

    # 找到包含文章信息的标签
    article_tags = soup.select('.docsum-content')
    # print('article_tags : ', article_tags)

    # 提取每篇文章的标题和链接
    results = []
    for tag in article_tags:
        title_tags = tag.select('a')
        if title_tags:
            title = title_tags[0].get_text().strip()
            link = 'https://pubmed.ncbi.nlm.nih.gov' + title_tags[0]['href']
            results.append((title, link))

    return results


# if __name__ == '__main__':
#     for page in range(1, 6):
#         page_url = f'{url}&page={page}'
#         articles = get_articles(page_url)
#         for article in articles:
#             print(article[0])
#             print(article[1])
#             print('---')
if __name__ == '__main__':
    for page in range(1, 6):
        page_url = f'{url}&page={page}'
        articles = get_articles(page_url)
        print(f'Page {page}: {page_url} ({len(articles)} articles found)')
        for article in articles:
            print(article[0])
            print(article[1])
            print('---')

tony_nc 注册会员
2023-02-28 04:20

Based on Monster group and GPT:

in the first code The
  • page parameter loops from 1, but the actual page number of Pubmed search results starts at 0, so you should change the page loop to range(0, 6).

You need to set Headers on the Pubmed search results page, otherwise the server may reject the connection. In the first code you provided, you defined Headers but did not use it. You need to pass Headers as an argument to requests.get when sending requests, e.g.

response = requests.get(url, headers=headers)


You can also set Headers globally:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'
}
requests.headers.update(headers)


In the first code, the tag used to extract article information with.select('.docsum-content') is risky because the tag's class name may change. The correct approach is to use.find_all('div', {'class': 'docsum-content'}). Similarly, the method for extracting article titles and links needs to be changed to.find('a', {'class': 'docsum-title'}).

About the Author

Question Info

Publish Time
2023-02-28 04:20
Update Time
2023-02-28 04:20

Related Question

关于#正则表达式#的问题,如何解决?(语言-python)

python安装不了pygame

在python中比较图像的平均RGB值

Python 如何把print的值赋值给变量?

我是新的python下面的代码显示错误:AttributeError:模块'sys'没有属性'_MEIPASS'[重复]

关于#python#的问题:在做Django的管理系统时提交post请求的时候出现了404的错误

Python数组问题

循环遍历光栅堆栈的每个像素,并使用Python返回像素的时间序列

使用python的ECDSA库签署AWS消息的错误

Python爬虫问题