Actually writing a Java crawler program is not very hard by using the existing APIs, but write your own crawler probably enable you do every function you want. It should be very interesting to get any specific information from internet. To provide the code is not easy, but I searched and find the basic algorithm for a crawler.
You’ll be reinventing the wheel, to be sure. But here’s the basics:
1. A list of unvisited URLs – seed this with one or more starting pages
2. A list of visited URLs – so you don’t go around in circles
3. A set of rules for URLs you’re not interesting – so you don’t index the whole Internet
4. Put these stored in a database is necessary, since crawler may stop and need to restart with the same place without losing state.
Algorithm is as follows.
while(list of unvisited URLs is not empty) { take URL from list fetch content record whatever it is you want to about the content if content is HTML { parse out URLs from links foreach URL { if it matches your rules and it's not already in either the visited or unvisited list add it to the unvisited list } } }
Let’s discuss if you decide to make one. Now the problem here is the list of URLs.
Where to find the list of websites? I guess you just need to find it from some existing directories or somewhere, or even manually.
Jsoup is an HTML parser which could make the parsing part very easy and interesting to do.
Update: I have made a simple crawler tutorial.
The following code still has problem.
import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileInputStream; import java.io.FileWriter; import java.io.IOException; import java.io.InputStreamReader; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class FileCrawler { public static void main(String[] args) throws IOException { File dir = new File("."); String loc = dir.getCanonicalPath() + File.separator + "record.txt"; FileWriter fstream = new FileWriter(loc, true); BufferedWriter out = new BufferedWriter(fstream); out.newLine(); out.close(); processPage("http://some.url"); File file = new File(loc); if (file.delete()) { } } // givn a String, and a File // return if the String is contained in the File public static boolean checkExist(String s, File fin) throws IOException { FileInputStream fis = new FileInputStream(fin); // //Construct the BufferedReader object BufferedReader in = new BufferedReader(new InputStreamReader(fis)); String aLine = null; while ((aLine = in.readLine()) != null) { // //Process each line if (aLine.trim().contains(s)) { //System.out.println("contains " + s); in.close(); fis.close(); return true; } } // do not forget to close the buffer reader in.close(); fis.close(); return false; } public static void processPage(String URL) throws IOException { File dir = new File("."); String loc = dir.getCanonicalPath() + File.separator + "record.txt"; // invalid link if (URL.contains(".pdf") || URL.contains("@") || URL.contains("adfad") || URL.contains(":80") || URL.contains("fdafd") || URL.contains(".jpg") || URL.contains(".pdf") || URL.contains(".jpg")) return; // process the url first if (URL.contains("some.url") && !URL.endsWith("/")) { } else if(URL.contains("some.url") && URL.endsWith("/")){ URL = URL.substring(0, URL.length()-1); }else{ // url of other site -> do nothing return; } File file = new File(loc); // check existance boolean e = checkExist(URL, file); if (!e) { System.out.println("------ : " + URL); // insert to file FileWriter fstream = new FileWriter(loc, true); BufferedWriter out = new BufferedWriter(fstream); out.write(URL); out.newLine(); out.close(); Document doc = null; try { doc = Jsoup.connect(URL).get(); } catch (IOException e1) { e1.printStackTrace(); return; } if (doc.text().contains("PhD")) { //System.out.println(URL); } Elements questions = doc.select("a[href]"); for (Element link : questions) { processPage(link.attr("abs:href")); } } else { // do nothing return; } } } |
Hmm, it seems like your site ate my first comment (it was extremely long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.
http://www.awstrainingchennai.in/aws-training-in-chennai.html
Really nice and simple example. Thank you!
Very informative, great write up. I dig your style by stating the algorithm in plain English and _then_ getting to the code.