Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes!
Sample UsageFirst, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation:
import java.util.ArrayList;
import java.util.regex.Pattern;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://www.ics.uci.edu/")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
ArrayList links = page.getURLs();
}
}
As can be seen in the above code, there are two main functions that should be overridden:
shouldVisit: This function decides whether the given URL should be crawled or not. visit: This function is called after the content of a URL is downloaded successfully. You can easily get the text, links, url and docid of the downloaded page.
You should also implement a controller class which specifies the seeds of the crawl, the folder in which crawl data should be stored and number of concurrent thread:
import edu.uci.ics.crawler4j.crawler.CrawlController;
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://www.ics.uci.edu/");
controller.start(MyCrawler.class, 10);
}
}PolitenessCrawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. This parameter can be tuned with the "setPolitenessDelay" function in controller.
DependenciesThe following libraries are used in the implementation of crawler4j. In order to make life easier all of them are bundled in the "crawler4j-dependencies-lib.zip" package:
Berkeley DB Java Edition 4.0.71 or higher fastutil 5.1.5 DSI Utilities 1.0.10 or higher Apache HttpClient 4.0.1 Apache Log4j 1.2.15 Apache Commons Logging 1.1.1 Apache Commons Codec 1.4 Source CodesSource codes are available for checkout from this subversion repository: https://crawler4j.googlecode.com/svn/trunk/
30 Day Summary May 18 2013 — Jun 17 2013
|
12 Month Summary Jun 17 2012 — Jun 17 2013
|
Copyright
©
2013
Black Duck Software, Inc.
and its contributors, Some Rights Reserved. Unless otherwise marked, this work is licensed under a
Creative Commons Attribution 3.0 Unported License
. Ohloh
®
and the Ohloh logo are trademarks of
Black Duck Software, Inc.
in the United States and/or other jurisdictions. All other trademarks are the property of their respective holders.