How to write a crawler by using Java?

Actually writing a Java crawler program is not very hard by using the existing APIs, but write your own crawler probably enable you do every function you want. It should be very interesting to get any specific information from internet. To provide the code is not easy, but I searched and find the basic algorithm for a crawler.

You’ll be reinventing the wheel, to be sure. But here’s the basics:

1. A list of unvisited URLs – seed this with one or more starting pages
2. A list of visited URLs – so you don’t go around in circles
3. A set of rules for URLs you’re not interesting – so you don’t index the whole Internet
4. Put these stored in a database is necessary, since crawler may stop and need to restart with the same place without losing state.

Algorithm is as follows.

while(list of unvisited URLs is not empty) {
      take URL from list
      fetch content
      record whatever it is you want to about the content
      if content is HTML {
          parse out URLs from links
          foreach URL {
                 if it matches your rules
                 and it's not already in either the visited or unvisited list
                 add it to the unvisited list
          }
     }
}

Let’s discuss if you decide to make one. Now the problem here is the list of URLs.

Where to find the list of websites? I guess you just need to find it from some existing directories or somewhere, or even manually.

Jsoup is an HTML parser which could make the parsing part very easy and interesting to do.

Update: I have made a simple crawler tutorial.

The following code still has problem.

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
 
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
public class FileCrawler {
 
	public static void main(String[] args) throws IOException {
 
		File dir = new File(".");
		String loc = dir.getCanonicalPath() + File.separator + "record.txt";
		FileWriter fstream = new FileWriter(loc, true);
		BufferedWriter out = new BufferedWriter(fstream);
		out.newLine();
		out.close();
 
		processPage("http://some.url");
 
		File file = new File(loc);
 
		if (file.delete()) {
 
		}
	}
 
	// givn a String, and a File
	// return if the String is contained in the File
	public static boolean checkExist(String s, File fin) throws IOException {
 
		FileInputStream fis = new FileInputStream(fin);
		// //Construct the BufferedReader object
		BufferedReader in = new BufferedReader(new InputStreamReader(fis));
 
		String aLine = null;
		while ((aLine = in.readLine()) != null) {
			// //Process each line
			if (aLine.trim().contains(s)) {
				//System.out.println("contains " + s);
				in.close();
				fis.close();
				return true;
			}
		}
 
		// do not forget to close the buffer reader
		in.close();
		fis.close();
 
		return false;
	}
 
	public static void processPage(String URL) throws IOException {
 
		File dir = new File(".");
		String loc = dir.getCanonicalPath() + File.separator + "record.txt";
 
		// invalid link
		if (URL.contains(".pdf") || URL.contains("@") 
				|| URL.contains("adfad") || URL.contains(":80")
				|| URL.contains("fdafd") || URL.contains(".jpg")
				|| URL.contains(".pdf") || URL.contains(".jpg"))
			return;
 
		// process the url first
		if (URL.contains("some.url") && !URL.endsWith("/")) {
 
		} else if(URL.contains("some.url") && URL.endsWith("/")){
			URL = URL.substring(0, URL.length()-1);
		}else{
			// url of other site -> do nothing
			return;
		}
 
		File file = new File(loc);
 
		// check existance
		boolean e = checkExist(URL, file);
		if (!e) {
			System.out.println("------ :  " + URL);
			// insert to file
			FileWriter fstream = new FileWriter(loc, true);
			BufferedWriter out = new BufferedWriter(fstream);
			out.write(URL);
			out.newLine();
			out.close();
 
			Document doc = null;
			try {
				doc = Jsoup.connect(URL).get();
			} catch (IOException e1) {
				e1.printStackTrace();
				return;
			}
 
			if (doc.text().contains("PhD")) {
				//System.out.println(URL);
			}
 
			Elements questions = doc.select("a[href]");
			for (Element link : questions) {
				processPage(link.attr("abs:href"));
			}
		} else {
			// do nothing
			return;
		}
 
	}
}

3 thoughts on “How to write a crawler by using Java?”

  1. Very informative, great write up. I dig your style by stating the algorithm in plain English and _then_ getting to the code.

Leave a Comment