酷代码 AI
菜单
全部AI  /  热门  /  AI写代码神器
服务商

编写一个爬虫实现深度优先爬虫,爬取的网站为 www.baidu.com。使其输出预期为:Add the seeds url ['http://www.baidu.com'] to the unvisited url list Pop out one url "http://www.baidu.com" from unvisited url list Get 10 new links Visited url count: 1 Visited deepth: 1 10 unvisited links: Pop out one url "http://news.baidu.com" from unvisited url list Get 52 new links Visited url count: 2 Visited deepth: 2 Pop out one url "http://www.hao123.com" from unvisited url list Get 311 new links Visited url count: 3 Visited deepth: 2 Pop out

下面是一个示例 Python 爬虫程序,采用深度优先策略爬取 www.baidu.com,模拟你给出的输出信息。程序使用 requests 获取网页并用 BeautifulSoup 提取新链接,记录访问深度和访问计数。爬取流程截至时仅演示打印核心信息,避免真实爬取造成过大访问压力,方便展示思路。 ```python import requests from bs4 import BeautifulSoup from urllib.parse import urljoin, urlparse class DepthFirstCrawler: def __init__(self, seed_url, max_depth=3, max_visits=20): self.seed_url = seed_url self.max_depth = max_depth self.max_visits = max_visits self.visited = set() self.visited_count = 0 # 用列表模拟栈,存放 (url, depth) self.unvisited = [(seed_url, 1)] print(f'Add the seeds url [\'{seed_url}\'] to the unvisited url list') def get_links(self, url): """请求url并解析出新链接,简单限制只提取baidu域名下链接""" try: resp = requests.get(url, timeout=5) resp.encoding = resp.apparent_encoding except Exception: return [] soup = BeautifulSoup(resp.text, 'html.parser') links = set() for a_tag in soup.find_all('a', href=True): link = urljoin(url, a_tag['href']) parsed = urlparse(link) if 'baidu.com' in parsed.netloc: # 过滤类似 mailto:, javascript:等链接 if parsed.scheme in ('http', 'https'): links.add(link) return list(links) def crawl(self): while self.unvisited and self.visited_count < self.max_visits: url, depth = self.unvisited.pop() # 深度优先用栈结构,pop末尾元素(后进先出) if url in self.visited: continue print(f'Pop out one url "{url}" from unvisited url list') links = self.get_links(url) print(f'Get {len(links)} new links') self.visited.add(url) self.visited_count += 1 print(f'Visited url count: {self.visited_count}') print(f'Visited deepth: {depth}') if links: print(f'{len(links)} unvisited links:') else: print('0 unvisited links:') if depth < self.max_depth: for link in links: if link not in self.visited: self.unvisited.append((link, depth + 1)) # 结束时可能还剩多少未访问 if self.unvisited: print(f'{len(self.unvisited)} unvisited links:') else: print('No unvisited links left') if __name__ == '__main__': seed = 'http://www.baidu.com' crawler = DepthFirstCrawler(seed_url=seed, max_depth=3, max_visits=20) crawler.crawl() ``` 代码说明: - 使用列表模拟栈结构实现深度优先(pop最近加入的url)。 - 访问时打印当前栈顶url、获取新链接数、访问计数与深度。 - 仅爬取 baidu.com 域名下的 http/https 链接。 - 设置最大访问次数和最大深度防止死循环与访问过多。 - 输出格式基本模拟你给的示例提示。 注意的逻辑点: 1. 实现深度优先通过列表的 pop() 实现栈结构访问顺序。 2. 新链入栈时深度+1,控制最大深度限制,避免无限递归。 3. 过滤域名和协议,避免加入无效或跨域链接,保证爬取安全与准确。 [2025-05-13 15:54:49 | AI写代码神器 | 2142点数解答]

相关提问