us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline Java Examples
The following examples show how to use
us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline.
You can vote up the ones you like or vote down the ones you don't like,
and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example #1
Source File: CommonSpider.java From Gather-Platform with GNU General Public License v3.0 | 6 votes |
/** * 测试爬虫模板 * * @param info * @return */ public List<Webpage> testSpiderInfo(SpiderInfo info) throws JMException { final ResultItemsCollectorPipeline resultItemsCollectorPipeline = new ResultItemsCollectorPipeline(); final String uuid = UUID.randomUUID().toString(); Task task = taskManager.initTask(uuid, info.getDomain(), info.getCallbackURL(), "spiderInfoId=" + info.getId() + "&spiderUUID=" + uuid); task.addExtraInfo("spiderInfo", info); QueueScheduler queueScheduler = new QueueScheduler(); MySpider spider = (MySpider) makeSpider(info, task) .addPipeline(resultItemsCollectorPipeline) .setScheduler(queueScheduler); spider.startUrls(info.getStartURL()); //慎用爬虫监控,可能导致内存泄露 // spiderMonitor.register(spider); spiderMap.put(uuid, spider); taskManager.getTaskById(uuid).setState(State.RUNNING); spider.run(); List<Webpage> webpageList = Lists.newLinkedList(); resultItemsCollectorPipeline.getCollected().forEach(resultItems -> webpageList.add(CommonWebpagePipeline.convertResultItems2Webpage(resultItems))); return webpageList; }
Example #2
Source File: CommonSpider.java From spider with GNU General Public License v3.0 | 6 votes |
/** * 测试爬虫模板 * * @param info * @return */ public List<Webpage> testSpiderInfo(SpiderInfo info) throws JMException { final ResultItemsCollectorPipeline resultItemsCollectorPipeline = new ResultItemsCollectorPipeline(); final String uuid = UUID.randomUUID().toString(); Task task = taskManager.initTask(uuid, info.getDomain(), info.getCallbackURL(), "spiderInfoId=" + info.getId() + "&spiderUUID=" + uuid); task.addExtraInfo("spiderInfo", info); QueueScheduler queueScheduler = new QueueScheduler(); MySpider spider = (MySpider) makeSpider(info, task) .addPipeline(resultItemsCollectorPipeline) .setScheduler(queueScheduler); if (info.isAjaxSite() && StringUtils.isNotBlank(staticValue.getAjaxDownloader())) { spider.setDownloader(casperjsDownloader); } else { spider.setDownloader(contentLengthLimitHttpClientDownloader); } spider.startUrls(info.getStartURL()); //慎用爬虫监控,可能导致内存泄露 // spiderMonitor.register(spider); spiderMap.put(uuid, spider); taskManager.getTaskById(uuid).setState(State.RUNNING); spider.run(); List<Webpage> webpageList = Lists.newLinkedList(); resultItemsCollectorPipeline.getCollected().forEach(resultItems -> webpageList.add(CommonWebpagePipeline.convertResultItems2Webpage(resultItems))); return webpageList; }
Example #3
Source File: PhantomJSPageProcessor.java From webmagic with Apache License 2.0 | 5 votes |
public static void main(String[] args) throws Exception { PhantomJSDownloader phantomDownloader = new PhantomJSDownloader().setRetryNum(3); CollectorPipeline<ResultItems> collectorPipeline = new ResultItemsCollectorPipeline(); Spider.create(new PhantomJSPageProcessor()) .addUrl("http://s.taobao.com/search?q=%B6%AC%D7%B0&sort=sale-desc") //%B6%AC%D7%B0为冬装的GBK编码 .setDownloader(phantomDownloader) .addPipeline(collectorPipeline) .thread((Runtime.getRuntime().availableProcessors() - 1) << 1) .run(); List<ResultItems> resultItemsList = collectorPipeline.getCollected(); System.out.println(resultItemsList.get(0).get("html").toString()); }
Example #4
Source File: Spider.java From webmagic with Apache License 2.0 | 4 votes |
protected CollectorPipeline getCollectorPipeline() { return new ResultItemsCollectorPipeline(); }