项目中知乎爬数据踩坑记录，好用爬虫工具Web scraper介绍-东山笔记

分析项目数据时，从知乎特定问题获取信息非常困难，因为知乎的呈现方法复杂，网上的指导资料要么过时，要么需要付费，实在令人烦恼，接下来要谈谈我遇到的问题。

Web工具尝试

这个网络工具是一款无需编码的简单数据采集器，通过安装附加组件即可操作，在应用市场里能找到。按下F12键开始操作，操作时需要先确定区域再选择内容。它的工作方式是模拟页面滚动到顶部然后进行数据抓取，对于轻量级的采集任务比较适用。不过我要处理两千多条的回复时，可能会遇到无法滚动到顶部或者程序崩溃的问题。但对于数据量少的情况应该没有问题，只需要把任务名称放在特定的位置，同时替换链接中的问题编号即可。

使用某方法再试

{"_id":"name","startUrl":["https://www.zhihu.com/question/xxxxxxxxx/answers/updated"],"selectors":[{"id":"block","parentSelectors":["_root"],"type":"SelectorElementScroll","selector":"div.List-item:nth-of-type(n+2)","multiple":true,"delay":2000,"elementLimit":2100},{"id":"content","parentSelectors":["block"],"type":"SelectorText","selector":"span[itemprop='text']","multiple":true,"regex":""},{"id":"user","parentSelectors":["block"],"type":"SelectorLink","selector":".AuthorInfo-name a","multiple":true,"linkType":"linkFromHref"},{"id":"date","parentSelectors":["block"],"type":"SelectorText",
"selector":".ContentItem-time span",
"multiple":true,"regex":""}]}

这种做法和网页运作方式类似，需要滑动到底部来获取内容。不过，同样会碰到滑动到底部时页面响应缓慢的情况，同时知乎的临时数据存储也会造成数据获取不完整。这种方法对于简单的任务可能还有点用，操作时自行准备相应材料，否则很难绕过登录环，相当费事。

配合soap试验

def scrape1(question_id):
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    ]
    url = f'https://www.zhihu.com/question/{question_id}'  # 替换question_id
    # 创建一个Options对象，并设置headers
    options = Options()
    options.add_argument("user-agent=" + random.choice(user_agents))
    # 传入cookie
    cookies = json.load(open('cookie.json', 'r', encoding='utf-8'))
    # options.add_argument("--headless")
    # 创建WebDriver时传入options参数
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    driver.delete_all_cookies()
    for cookie in cookies:
        driver.add_cookie(cookie)
    time.sleep(2)
    driver.refresh()
    time.sleep(5)  # 等待页面加载完成
      
    # items = []
    # question = driver.find_element(By.CSS_SELECTOR, 'div[class="QuestionPage"] meta[itemprop="name"]').get_attribute(
    #     'content')
    # while True:
    #     # 滚动到页面底部
    #     print('scrolling to bottom')
    #     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    #     time.sleep(random.randint(5, 8))  # 等待页面加载新内容的时间，根据实际情况进行调整
    #
    #     # 如果找到了页面底部元素就停止加载
    #     try:
    #         driver.find_element(By.CSS_SELECTOR, 'button.Button.QuestionAnswers-answerButton')
    #         print('reached the end')
    #         break
    #     except:
    #         pass
    #
  

    html = driver.page_source
    # 解析HTML
    soup = BeautifulSoup(html, 'html.parser')
    # 获取所有回答的标签
    answers = soup.find_all('div', class_='List-item')
    df = pd.DataFrame()
    contents = []
    answer_ids = []
    driver.quit()
    for answer in answers:
        # 获取回答的文本内容
        content = answer.find('div', class_='RichContent-inner').get_text()
        contents.append(content)
    df['answer_id'] = answer_ids
    df['content'] = contents
    df.to_csv(f'{question_id}.csv', index=False, encoding='utf-8')

这是我最终找到的可行办法，核心在于能够中途暂停然后继续操作，不必等全部完成也能使用。我查阅了相关资料，发现原始代码的基本逻辑是那样的，但并不适合我的使用场景。我的想法是调整代码以获取必要数据，然后利用这些数据来完整提取答案。

模版网页获取

要得到那个模板页面，先打开你想要的那条回复，然后刷新页面去寻找特定的数据包。但是，那个请求网址里的信息总是空白，证明这个方法行不通。可以试试用最开始提供的网址去找到next链接，这样就能获取到相关回复的内容了。

根据信息爬内容

掌握必要信息后便着手搜集资料，每获取一百项就进行一次备份，以防意外发生导致前功尽弃。这样即便过程中遇到障碍，也不会造成重大损失，可以继续之前的工作，同时也能提升工作效率。

#网址模板
template = 'https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset={offset}&limit=5&sort_by=default&platform=desktop'
for page in range(1, 100):
    #对第page页进行访问
    url = template.format(offset=page)
    resp = requests.get(url, headers=headers)
    
    #解析定位第page页的数据
    for info in resp.json()['data']:
        author = info['author']
        Id = info['id']
        text = info['excerpt']
        data = {'author': author,
                'id': Id,
                'text': text}
        #存入csv
        writer.writerow(data)
        
    #降低爬虫对知乎的访问速度
    time.sleep(1)