惊闻知乎上某高中生声称证明了哥德巴赫猜想,并要在2019年1月1日在知乎贴出证明过程。
作为新年第一瓜、这个当然不能错过了。
作为高级吃瓜群众,在围观的同时还要分析一波教育程度对盲目看戏的影响 ,并以一月一日前的关注者作为样本,统计这些用户在一月一日后的情况。因变量为是否取关,自变量为用户信息。基于一月一日这个自然实验,把教育分为高教育组和低教育组(或分性别),然后用DiD的方法判定目标组与对照组取关的差异。(好了,编不下去了…)
具体可到项目地址Goldbach Research Group 查看。
正值期末考完军训前难得的空闲日子,利用这个项目学习一下科研范式与python爬虫也是不错的。
本文记录如何爬取知乎问题下回答和评论数据。
以下均以知乎问题如果高中生能证明哥德巴赫猜想,会被清华北大数学系保送吗? 为例分析,如需爬取其他问题,只需切换questionID即可。
分析API
这部分是有前端大佬已经分析出来的,这里就简要记录一下。(自己也不是很懂)
用浏览器打开知乎问题,F12调出控制台,切换到网络,刷新,查看类型为json的请求,把其中的网址逐个排查可以得到以下API。
answer
https://www.zhihu.com/api/v4/questions/306537777/answers?include=data[*].is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%3Bdata[*].mark_infos[*].url%3Bdata[*].author.follower_count%2Cbadge[*].topics&limit=1&offset=0&platform=desktop&sort_by=default
comment
https://www.zhihu.com/api/v4/answers/559871763/root_comments?include=data[*].author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author&order=normal&limit=1&offset=0&status=open
child_comments
https://www.zhihu.com/api/v4/comments/565399549/child_comments?include=%24[*].author%2Creply_to_author%2Ccontent%2Cvote_count&limit=1&offset=0&include=%24[*].author%2Creply_to_author%2Ccontent%2Cvote_count&tdsourcetag=s_pctim_aiomsg
python程序
以下是比较完整的python程序,可以爬取该问题下所有的回答、评论、子评论数据。
注意:由于水平有限,未写多线程,因此此程序运行较慢。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 import requestsimport jsonimport sysimport osquestionId = 307595822 startAns = 0 headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0' } def getAnsUrl (num ): url = 'https://www.zhihu.com/api/v4/questions/' +str (questionId)+'/answers' \ '?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed' \ '%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2' \ 'Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2' \ 'Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crele' \ 'vant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked' \ '%2Cis_nothelp%2Cis_labeled%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.' \ 'author.follower_count%2Cbadge%5B%2A%5D.topics&limit=1&offset=' +str (num)+'&platform=' \ 'desktop&sort_by=default' return url def getComUrl (ansId,offset ): url = 'https://www.zhihu.com/api/v4/answers/' +str (ansId)+'/root_comments' \ '?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2C' \ 'vote_count%2Cis_parent_author%2Cis_author&order=normal&limit=1&offset=' +str (offset)+'&status=open' return url def getChildComUrl (comId,offset ): url = 'https://www.zhihu.com/api/v4/comments/' +str (comId)+'/child_comments' \ '?include=%24%5B%2A%5D.author%2Creply_to_author%2Ccontent%2Cvote_count&limit=1' \ '&offset=' +str (offset)+'&include=%24%5B*%5D.author%2Creply_to_author%2C' \ 'content%2Cvote_count&tdsourcetag=s_pctim_aiomsg' return url def mkdir (path ): isExists = os.path.exists(path) if not isExists: os.makedirs(path) mkdir('./answers' ) mkdir('./comments' ) mkdir('./child_comments' ) ansUrl = getAnsUrl(0 ) ansResponse = requests.get(ansUrl,headers = headers) ansJson = json.loads(ansResponse.text) totalAns = ansJson['paging' ]['totals' ] for i in range (startAns,totalAns): print('Get answer' +str (i)+'.json' ) ansUrl = getAnsUrl(i) ansResponse = requests.get(ansUrl,headers = headers) ansJson = json.loads(ansResponse.text) f = open ("./answers/answer" +str (i)+".json" ,"w" ,encoding='utf-8' ) f.write(ansResponse.text) f.close() if ansJson['data' ]: ansId = ansJson['data' ][0 ]['id' ] comUrl = getComUrl(ansId,0 ) comResponse = requests.get(comUrl,headers = headers) comJson = json.loads(comResponse.text) totalCom = comJson['paging' ]['totals' ] if totalCom > 0 : mkdir('./comments/answer' +str (i)) for j in range (0 ,totalCom): print('Get answer' +str (i)+'--comment' +str (j)+'.json' ) comUrl = getComUrl(ansId,j) comResponse = requests.get(comUrl,headers = headers) f = open ("./comments/answer" +str (i)+"/comment" +str (j)+".json" ,"w" ,encoding='utf-8' ) f.write(comResponse.text) f.close() comJson = json.loads(comResponse.text) if comJson['data' ]: comId = comJson['data' ][0 ]['id' ] totalChCom = comJson['data' ][0 ]['child_comment_count' ] if totalChCom > 0 : mkdir('./child_comments/answer' +str (i)) mkdir('./child_comments/answer' +str (i)+'/comment' +str (j)) for k in range (0 ,totalChCom): print('Get answer' +str (i)+'--comment' +str (j)+'' +'--child_comment' +str (k)+'.json' ) chComUrl = getChildComUrl(comId,k) comResponse = requests.get(chComUrl,headers = headers) f = open ("./child_comments/answer" +str (i)+"/comment" +str (j)+"/child_comment" +str (k)+".json" ,"w" ,encoding='utf-8' ) f.write(comResponse.text) f.close()
后记
证明准时发布,被指出明显错误,(not even wrong)题主注销帐号,一场闹剧就此结束。
2018年结束,2019年开始。