数据爬虫确实具有不少明显优点,然而在具体运用过程中,却常常面临诸多挑战,诸如获取网页信息的难度、IP被封锁致使请求无法顺利完成,以及请求失败后是否应该再次尝试的犹豫,这些问题无疑让人感到头疼。下面,我将向大家展示一个我编写的简单爬虫实例,希望能为大家提供一些思考。
页面数据获取要条件
操作数据抓取时,我们得根据需求设置筛选规则。比如说,针对某个官网的数据,我们会输入查询条件来搜索信息。随后,在控制台里查看返回的数据请求,会发现头部部分需要提交一些关键信息。这些信息是以json格式的字符串形式呈现的。利用JAVA的库执行网络请求操作,向请求头中嵌入必要信息,然后对请求参数进行格式化处理,模拟HTTP请求流程,最终成功获取所需数据。
IP被封麻烦大
在执行爬虫任务时,最常遇到的问题或许是IP地址被封禁,这种情况会使得请求无法顺利完成。以我个人的操作经验来看,我采取了使用代理服务器的方法来克服这一难题,每次发起请求时,我都会从代理池中随机选取一个IP地址。如果在请求过程中遇到错误,这通常意味着所选的代理IP存在问题,我会立刻将其从代理池中移除。这些代理IP信息都是我通过抓取网站数据收集而来的。当代理池里的IP用完时,我会重新从代理网站上获取新的IP批次。虽然我只抓到了十个IP,但实际上我本来可以提前抓到更多,以备不时之需。
代理IP抓取法
我通过网站上的软件来收集代理IP。首先,我需要设定抓取的规则,然后从众多代理IP中挑选出有效的,并将它们存入代理池。不过,这个过程并不简单,因为有些网站已经采取了反爬虫措施,我们想办法规避。比如,模拟浏览器行为、调整请求头信息等,这些都是我们常用的手段。完成这些步骤后,我们就能放心地使用这些IP进行爬虫操作了。
请求失败重试难
请求未能成功时,应当启动重试机制。若发现请求数据存在问题,将执行this.(,5)函数。此函数接受两个参数:一是请求数据,二是重试的次数。函数会启动一个新线程,这样做是为了防止阻塞原有逻辑。新线程将重新执行请求操作以获取数据,一旦请求顺利完成,便会终止后续的请求。若请求未能顺利完成,系统将依照既定规则进行多次尝试,直至达到最大尝试次数,随后将终止尝试,并将具体错误信息详尽地记录并存储于数据库中。
private String catchTicket(String depCode,String arrCode,String flightDate) throws Exception { String param = "json格式参数"; String cookie = "cookie数据"; DefaultHttpClient httpclient = new DefaultHttpClient(); HttpResponse response = null; String responseString = null; HttpPost httpost = null; IProxy iproxy = null; try { httpclient = new DefaultHttpClient(); HttpParams params = httpclient.getParams(); HttpConnectionParams.setConnectionTimeout(params, 50*1000); HttpConnectionParams.setSoTimeout(params, 120*1000); /* 设置代理 */ iproxy = HttpProxy.getProxy(); HttpHost httphost = new HttpHost(iproxy.getIp(), iproxy.getPort()); httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY, httphost); httpost = new HttpPost(POST_URL); httpost.addHeader("Accept", "application/json, text/javascript, */*; q=0.01"); httpost.addHeader("Accept-Language", "zh-CN,zh;q=0.8"); httpost.addHeader("Connection", "keep-alive"); httpost.addHeader("Content-Type","application/json; charset=UTF-8"); httpost.addHeader("Cookie", cookie); httpost.addHeader("Host", "www.united.com"); httpost.addHeader("UserAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36"); httpost.addHeader("X-Requested-With", "XMLHttpRequest"); //httpost.addHeader("Accept-Encoding", "gzip, deflate"); StringEntity parEntity = new StringEntity(param); parEntity.setContentType("application/json; charset=utf-8"); httpost.setEntity(parEntity); response = httpclient.execute(httpost); // 执行 responseString = EntityUtils.toString(response.getEntity(),"UTF-8"); if (response.getStatusLine().getStatusCode() != 200 && response.getStatusLine().getStatusCode() != 404) { logger.info("response code error({}) throw exception",response.getStatusLine().getStatusCode()); throw new Exception(); } } catch (Exception e) { e.printStackTrace(); HttpProxy.removeProxy(iproxy); throw e; } finally { httpost.abort(); httpost = null; httpclient.getConnectionManager().shutdown(); httpclient = null; } // 休眠1秒 TimeUnit.SECONDS.sleep(1); return responseString; }
开启线程保效率
启动一个独立的执行流程以实现重试策略,目的是确保程序的其它功能不会遭受干扰。若不采取此措施,请求失败时程序将陷入停滞,进而影响其他任务的执行,最终导致整体性能降低。通过使用线程,重试操作能够与其它任务并行执行,既保证了程序的稳定运行,又显著提升了爬虫的作业效率。
数据爬虫建成
/** * 代理类 * chuan.zhang */ public class HttpProxy { private static final Logger logger = LoggerFactory.getLogger(HttpProxy.class); private final static String POST_URL = "http://www.kuaidaili.com/proxylist/1/"; private static List iproxys = new ArrayList(); public static void main(String[] args) throws Exception { System.out.println(HttpProxy.getProxy()); } /** * 随机生成一个代理 * @return * @throws Exception */ public static IProxy getProxy() throws Exception { if (iproxys.size() == 0) { initProxys(); logger.info("init proxy over"); } Random rand = new Random();int num = rand.nextInt(iproxys.size()-1); return iproxys.get(num); } public static void removeProxy(IProxy iproxy) { if (iproxy != null) { iproxys.remove(iproxy); logger.info("send request error remove the iproxy: "+iproxy.getIp()+":"+iproxy.getPort()); } } /** * 始化代理 * 从http://www.kuaidaili.com/获取代理 */ private static List initProxys() throws Exception { DefaultHttpClient httpclient = new DefaultHttpClient(); HttpResponse response = null; String responseString = null; HttpGet httget = null; try { httpclient = new DefaultHttpClient(); HttpParams params = httpclient.getParams(); HttpConnectionParams.setConnectionTimeout(params, 50*1000); HttpConnectionParams.setSoTimeout(params, 120*1000); httget = new HttpGet(POST_URL); httget.addHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"); httget.addHeader("Accept-Language", "zh-CN,zh;q=0.8"); httget.addHeader("Connection", "keep-alive"); httget.addHeader("cookie", "channelid=0; sid=0121255086558; _gat=1; _ga=GA1.2.2135905250.1469704395; Hm_lvt_7ed65b1cc4b810e9fd37959c9bb51b31=1469704395,1469781681,0121266; Hm_lpvt_7ed65b1cc4b810e9fd37959c9bb51b31=0121847"); httget.addHeader("Content-Type","application/json; charset=UTF-8"); httget.addHeader("Host", "www.kuaidaili.com"); httget.addHeader("Referer", "http://www.kuaidaili.com/proxylist/2/"); httget.addHeader("UserAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36"); response = httpclient.execute(httget); // 执行 responseString = EntityUtils.toString(response.getEntity(),"UTF-8"); Pattern p = Pattern.compile("
([\s\S]*?) [\s\S]*?([\s\S]*?) ", Pattern.DOTALL); Matcher m = p.matcher(responseString); IProxy iproxy = new IProxy(); while (m.find()) { iproxy.setIp(m.group(1)); iproxy.setPort(Integer.parseInt(m.group(2))); iproxys.add(iproxy); } } catch (Exception e) { e.printStackTrace(); logger.error("init proxy error"); throw e; } finally { httget.abort(); httget = null; httpclient.getConnectionManager().shutdown(); httpclient = null; } return iproxys; } }
执行数据抓取任务,需逐一进行网页内容获取、突破IP封锁、请求失败后的多次重试等操作,只有这些步骤都顺利完成,任务才算基本完成。在设置请求参数、采用代理IP以及实施重试策略等各个环,都至关重要。数据爬虫功能强大,它能帮助我们收集来自众多网站的信息,对数据分析和市场研究等方面提供了极大的便利。然而,在这个过程中会遇到不少挑战,需要我们不断探索和优化。
在大家努力收集信息的时候,是否曾遭遇过一些特别棘手的难题?若这篇文章对您有所启发,不妨点个赞,并且将它广泛传播开来。
public void execute() { while(true) { try { Map routeMap = this.dao.getRoute(); logger.info("start catch {}", routeMap); String depCode = routeMap.get("dep_code").toString(); String arrCode = routeMap.get("arr_code").toString(); String flightDate = routeMap.get("catch_date").toString(); JSONObject json = null; try { String result = this.catchTicket(depCode,arrCode,flightDate); json = JSONObject.parseObject(result); } catch (Exception e) { logger.info("catch error result: "+routeMap); this.retryCatch(routeMap,5); continue; } this.parseDataToDb(json); } catch (Exception e) { e.printStackTrace(); try { TimeUnit.MINUTES.sleep(30); } catch (InterruptedException e1) { e1.printStackTrace(); } } } }