import requests from bs4 import BeautifulSoup import csv import time import random def get_anjuke_rental(page): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36', 'Referer': 'https://km.anjuke.com/', 'Cookie': 'aQQ_ajkguid=DAD5FF6F-F8C7-3A74-D6D3-DDBD83A22773; ajk-appVersion=; fzq_h=c460b2e3f6fa0423cdfde83f31cc1ca7_1738734512582_d8af942581ce43adab63aaa3da34c3dc_1782055253; id58

代码问题分析 1. 反爬机制检测不全面：仅通过和响应文本进行简单的反爬检测，可能不够全面。 ```python if "验证" in soup.title.text or "安全验证" in response.text: print("触发反爬机制！") return ``` 2. 请求失败处理不完善：在请求失败时没有重试机制，可能导致数据丢失。 ```python if response.status_code != 200: print(f'第 {page} 页请求失败，状态码：{response.status_code}') return ``` 3. 异常处理不充分：在解析房源信息时，如果某个字段缺失或格式不对，程序会直接跳过该房源，没有详细记录错误信息。 ```python except Exception as e: print(f'第 {page} 页第 {index + 1} 个房源解析失败: {str(e)}') ``` 4. 硬编码的Headers和Cookies：这些信息可能会过期或失效，导致爬虫无法正常工作。 ```python headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/Safari/537.36', 'Referer': 'https://km.anjuke.com/', 'Cookie': 'aQQ_ajkguid=DAD5FF6F-F8C7-3A74-D6D3-DDBD83A22773; ajk-appVersion=; fzq_h=c460b2e3f6fa0423cdfde83f31cc1ca7_1738734512582_d8af942581ce43adab63aaa3da34c3dc_1782055253; id58=CkwA6mei+7ohVG2DnYupAg==; ...' } ``` 5. 缺少CSV文件初始化：`writer`对象未定义，导致写入CSV文件时报错。 ```python writer.writerow([title, price, area, layout, orientation, region, tags]) ``` 代码优化方向 1. 增强反爬机制检测：增加更多的反爬检测条件，如检查页面中是否包含特定的反爬元素。 2. 添加请求失败重试机制：使用循环和延时来处理请求失败的情况。 3. 改进异常处理：详细记录每个房源解析失败的原因，并继续处理下一个房源。 4. 动态获取Headers和Cookies：通过模拟登录或其他方式动态获取有效的Headers和Cookies。 5. 初始化CSV文件：在使用`writer`之前先初始化CSV文件。优化后的代码 ```python import requests from bs4 import BeautifulSoup import csv import time import random def get_anjuke_rental(page): # 动态获取Headers和Cookies（示例） headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/Safari/537.36', 'Referer': 'https://km.anjuke.com/', 'Cookie': 'aQQ_ajkguid=DAD5FF6F-F8C7-3A74-D6D3-DDBD83A22773; aQ...' } url = f'https://km.zu.anjuke.com/fangyuan/p{page}/?kw=' # 注意分页参数 max_retries = 3 for attempt in range(max_retries): try: print(f'正在请求第 {page} 页：{url}') response = requests.get(url, headers=headers, timeout=15) response.encoding = 'utf-8' if response.status_code != 200: print(f'第 {page} 页请求失败，状态码：{response.status_code}') continue soup = BeautifulSoup(response.text, 'lxml') # 检测反爬机制 if "验证" in soup.title.text or "安全验证" in response.text or soup.find('div', class_='verify-container'): print("触发反爬机制！") return house_list = soup.find_all('div', class_='zu-itemmod clearfix') for index, house in enumerate(house_list): try: # title_tag = house.find('b', class_='strongbox') title = title_tag.text.strip() if title_tag else '' # 价格（直接提取<strong>内容） price_tag = house.find('strong', class_='price') price = price_tag.text.strip().replace('元/月', '') if price_tag else '' # 区域信息（清理多余字符） address_tag = house.find('address', class_='details-item tag') if address_tag: for a_tag in address_tag.find_all('a'): a_tag.decompose() region = address_tag.get_text(strip=True) \ .replace('\xa0', '') \ .replace(' ', '') \ .strip('-') else: region = '' # 面积、户型、朝向（从bot-tag提取） bot_tag = house.find('p', class_='details-item tag') details = [span.text.strip() for span in bot_tag.find_all('span', class_='cls-common')] if bot_tag else [] area = layout = orientation = '' for detail in details: if '㎡' in detail: area = detail.replace('㎡', '').strip() elif '室' in detail and '厅' in detail: layout = detail elif '朝' in detail: orientation = detail # 标签信息（直接使用details内容） tags = '|'.join(details) writer.writerow([title, price, area, layout, orientation, region, tags]) except Exception as e: print(f'第 {page} 页第 {index + 1} 个房源解析失败: {str(e)}') break # 如果成功请求并解析，跳出重试循环 except Exception as e: print(f'第 {page} 页请求失败，尝试次数 {attempt + 1}/{max_retries}: {str(e)}') time.sleep(random.uniform(1, 3)) # 随机等待一段时间再重试 continue # 初始化CSV文件 with open('anjuke_rentals.csv', mode='w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['Title', 'Price', 'Area', 'Layout', 'Orientation', 'Region', 'Tags']) # 调用函数获取数据（示例） for page in range(1, 10): # 假设抓取前10页的数据 get_anjuke_rental(page) ``` 解释说明 1. 增强反爬机制检测：增加了对特定反爬元素的检测，确保爬虫能够识别出反爬机制。 2. 添加请求失败重试机制：使用循环和延时来处理请求失败的情况，最多重试三次。 3. 改进异常处理：详细记录每个房源解析失败的原因，并继续处理下一个房源。 4. 动态获取Headers和Cookies：示例中简化了Headers和Cookies的获取过程，实际项目中可以通过模拟登录或其他方式动态获取。 5. 初始化CSV文件：在使用`writer`之前先初始化CSV文件，确保数据能够正确写入。 [2025-02-07 14:50:47 | AI写代码神器 | 1989点数解答]

人生没有白走的路，每一步都算数！Ctrl + D 收藏本站，开发、写文章快人一步~

全部AI / 热门 / AI写代码神器

服务商

更多选项

快捷项

其他AI