how to process a simple loop in WWW::Mechanize to be more efficient?

Need a re-write of a Java / Python Spider

  • Specifically for efn-ga, however, as he may not want to do it, everyone with Java and Python skills is welcome to give it a go. I need a re-write of a Java / Python spider that I wrote myself. I've been tweaking it off and on as time permits, however, I now have something else that is more pressing, hence I need some help in making this one perfect. 1) I've only tested it on up to 500 or so urls. I'm confident it works on up to that number. I recently did a test of 60,000 urls, and it stalled at 3684, so that might be an issue or it might be CPU related. It never taxes my server so I don't think that is the issue. 2) I'm sure the code is in-efficient as it was a first draft. Please correct. 3) The code passes the actual url visit off to a python script I wrote. The reason for this is that the following python code: socket.setdefaulttimeout(30) remotefile = urllib._urlopener.open(x) content = remotefile.read() can't be reproduced (easily) in the java URLConnection class. No method within the URLConnection class exists to set a timeout. Of course I could get into sockets but I don't have the time (pardon the pun). I'd like this issue addressed. It can be addressed in one of two ways: a) Use the java Socket class or other such class to set a time out, and re-write the python portion into a method within my Spider class. b) Just re-write the whole thing in python. 4) And finally, my last feature was to create a threaded application. Please add a threaded capability to this. I think up to 4 threads running would be OK, however, assuming memory is not an issue (and it isn't with me) I could see up to 10 or 20 threads running at the same time. FYI, the spider hits only one page of a site. It will not be used to retrieve multiple pages from a site, hence I haven't incorporated a reading of the robots.txt file. I don't see editing and improving my code as taking more than 1 hour. Please advise. I'll start with 25 USD. Here is the Python code: import urllib import re import sys import string import socket class AppURLopener(urllib.FancyURLopener): def __init__(self, *args): self.version = "Bot Information" urllib.FancyURLopener.__init__(self, *args) urllib._urlopener = AppURLopener() x = sys.argv[1] socket.setdefaulttimeout(30) try: #remotefile = urllib.urlopen(x) remotefile = urllib._urlopener.open(x) content = remotefile.read() p = re.compile('<TITLE>.*?</TITLE>',re.DOTALL|re.IGNORECASE) q = re.compile('<meta.*?>',re.DOTALL|re.IGNORECASE) r = re.compile('<meta.*?description.*?content="',re.DOTALL|re.IGNORECASE) s = re.compile('<TITLE>|</TITLE>',re.DOTALL|re.IGNORECASE) plist = p.findall(content) qlist = q.findall(content) remotefile.close() for x in plist: m = p.match(x) if m: #print 'Match found: ', m.group() x = re.sub(s, '', x) x = string.strip(x) print '<TITLE>',x,'<TITLE>' break for x in qlist: m = r.match(x) if m: #print 'Match found: ', m.group() x = re.sub(r, '', x) x = string.replace(x, '">', '') x = string.strip(x) print '<DESC>',x,'<DESC>' break#need to break out for those sites where the designer made a mistake and has multiple descriptions except Exception, e: print '<TITLE>Timeout<TITLE>' print '<DESC>Timeout<DESC>' ###End Python Code### Here is the Java Code: import java.io.*; import java.sql.*; import java.util.List; import java.util.ArrayList; import java.util.Iterator; import java.util.ListIterator; import java.util.Collections; public class Spider { private ArrayList urlList; private ArrayList selectList; private String urlString; //SelectData variables private Statement stmt; private Connection conn; private String url = "jdbc:mysql://domainname.com/dbName"; private String username = "username "; private String password = "password"; private Statement selectData; private ResultSet rs; private int total=0; private int initial=0; private int num_of_rows = 1000; private class ResultSetData { protected int linkID; protected String linkURL; protected int linkFatherID; public ResultSetData (int lid, String lur, int fid) { linkID = lid; linkURL = lur; linkFatherID = fid; }//end IncidentData constructor }//end IncidentData class private Spider() { urlList = new ArrayList(); total = countData(); System.out.println("The total is: " + total); System.out.println("The inital is: " + initial); System.out.println("The num_of_rows is: " + num_of_rows); while(initial < total) { if(total - initial > num_of_rows) { num_of_rows = num_of_rows; }//end if else { num_of_rows = total - initial; }//end else System.out.println("Initial = " + initial + "and Total = " + total); urlList = SelectData(initial,num_of_rows); Iterator it = urlList.iterator(); int y = urlList.size(); System.out.println("Size is: " + y); //for(int i=0;i<100;i++) { while (it.hasNext()) { System.out.println("In while loop....and size is " + y); //You need to cast the it.next elements to the appropriate Object. ResultSetData rsd = (ResultSetData) it.next(); //urlString = (String) it.next(); urlString = rsd.linkURL; //Test effect of bogus URL //urlString = "http://www.unknownunknowndoesnotexit.com"; String rp = RunPython(urlString); //System.out.println("Now back in constructor"); //System.out.println(rp); String [] sp = SplitText(rp); String title = "Not available."; String desc = "Not available."; String strUrl = "Not available."; int link_id = rsd.linkID; int father_id = rsd.linkFatherID; //System.out.println("Now in for loop..."); for(int h=0; h<sp.length; h++) { //System.out.println(sp[h]); if(sp[h].startsWith("<TITLE>")) { title = sp[h].replaceAll("<TITLE>","").trim(); }//end if else if(sp[h].startsWith("<DESC>")) { desc = sp[h].replaceAll("<DESC>","").trim(); }//end if else if(sp[h].startsWith("<URL>")) { strUrl = sp[h].replaceAll("<URL>","").trim(); }//end if }//end for ShowToday st = new ShowToday(); //System.out.println(st.demo()); //System.out.println("Data retrieved..."); //System.out.println("URL = " + strUrl); //System.out.println("TITLE = " + title); //System.out.println("DESC = " + desc + "\n"); InsertData(title,desc,strUrl,link_id,father_id); }//end while initial = initial + num_of_rows; System.out.println("The inital is now: " + initial); }//end while System.exit(0); }//end constructor public String RunPython(String u) { urlString = u; String s; String text = "<URL>" + u + "<URL>"; try { // run the python application Process p = Runtime.getRuntime().exec("python2.3 urlcontent.py " + urlString); BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream())); BufferedReader stdError = new BufferedReader(new InputStreamReader(p.getErrorStream())); while ((s = stdInput.readLine()) != null) { //System.out.println(s); text += "_:,:_" + s; }//end while //System.exit(0); }//end try catch (IOException e) { System.out.println("IOException"); e.printStackTrace(); System.exit(-1); }//end catch return text; }//end method public String [] SplitText(String i) { String text = i; String [] split_text = text.split("_:,:_"); return split_text; }//end method private void InsertData(String t, String d, String u, int l, int f) { PreparedStatement stmt; Connection conn; try { Class.forName("com.mysql.jdbc.Driver").newInstance(); }//end try catch (Exception ex) { ex.printStackTrace(); }//end catch try { String insert = "INSERT INTO spider (title,description,url,net_Links_ID," + "net_Category_FatherID) values (?,?,?,?,?)"; String url = "jdbc:mysql://domainname.com/dbname"; String username = "username"; String password = "password"; conn = DriverManager.getConnection(url, username, password); PreparedStatement insertData = conn.prepareStatement(insert); insertData.setString(1, t); insertData.setString(2, d); insertData.setString(3, u); insertData.setInt(4, l); insertData.setInt(5, f); insertData.executeUpdate(); insertData.close(); conn.close(); }//end try catch (java.sql.SQLException ex) { ex.printStackTrace(); }//end catch }//end method private ArrayList SelectData(int in, int nmr) { /*this try will query 3 tables in my database and retrieve a slice of URL's that I want to spider. I can provide a CREATE statement and some data for troubleshooting. */ try { selectList = new ArrayList(); String select = "SELECT net_Links.ID, net_Links.URL, net_Links.TITLE, net_Category.FatherID " + "FROM net_Links,net_Category, net_CatLinks " + "WHERE net_Category.Full_Name LIKE 'Health%' " + //"WHERE net_Category.FatherID = '72658' " + //"AND net_Category.ID = CategoryID AND net_Links.ID = LinkID"; "AND net_Category.ID = CategoryID AND net_Links.ID = LinkID LIMIT " + initial + "," + num_of_rows; System.out.println(select); conn = DriverManager.getConnection(url, username, password); selectData = conn.createStatement(); rs = selectData.executeQuery(select); while (rs.next()) { int id = rs.getInt("net_Links.ID"); String ul = rs.getString("net_Links.URL"); int fd = rs.getInt("net_Category.FatherID"); ResultSetData rsd = new ResultSetData(id, ul, fd); selectList.add(rsd); }//end while selectData.close(); conn.close(); System.out.println("The select statement is:\n\n" + select); }//end try catch (java.sql.SQLException ex) { ex.printStackTrace(); }//end catch return selectList; }//end method private int countData() { try { /*this try counts the number of rows that could be returned if all rows were to be selected. The idea is to use this number to determine the final loop's SQL statement. */ Class.forName("com.mysql.jdbc.Driver").newInstance(); String select_count = "SELECT count(net_Links.ID) AS total FROM net_Links,net_Category, net_CatLinks " + "WHERE net_Category.Full_Name like 'Health%' " + //"WHERE net_Category.FatherID = '72658' " + "AND net_Category.ID = CategoryID AND net_Links.ID = LinkID"; conn = DriverManager.getConnection(url, username, password); selectData = conn.createStatement(); rs = selectData.executeQuery(select_count); while (rs.next()) { total = rs.getInt("total"); }//end while System.out.println(total); }//end try catch (Exception ex) { System.out.println("countData exception"); ex.printStackTrace(); }//end catch return total; }//end method public static void main(String args[]) { Spider sp = new Spider(); }//end main }//end class

  • Answer:

    Dear coolguy90210, Among the posts you made above after I had posted my multi-threaded Spider code, the first few mention a problem you observed with the single-threaded version. In particular, the program was going into a trance around the 8698th URL in your database. If I were in your place and were inclined to investigate this problem, my first step would be to run the single-threaded code on just the five or ten URLs around that point. Once I had definitively identified the one URL that was causing all the trouble, I would visit it manually with a web browser to investigate the characteristics of the web page. I would also insert many println() statements into the program and rerun it just for that one URL, carefully observing its behavior. Does it get confused while opening the connection? Or while reading the contents of the web site? I can't say for certain, but I suspect the problem is that a web site begins feeding information to the crawler, then abruptly stops. The crawler keeps waiting, since it has not yet received an End Of File (EOF) character and believes there's more text to come. I have adopted this as my hypothesis because the business of requesting and establishing an HTTP connection is conducted according to a fairly well defined and widely respected protocol. I would be surprised if Java's URL class didn't have a reasonable timeout to deal with web servers that can't make an HTTP connection according to spec. I might be wrong. The only way to find out for sure is to carry out close debugging. A heavy-duty solution to the problem of unpredictable timeouts is to use supervisor threads that act something like killer droids. You can either have a bunch of these supervisor threads polling the crawler threads and terminating any that are taking too long to do their job, or you can launch a supervisor together with each crawler. Under a buddy system of this kind, the only job of the supervisor is to terminate its associated crawler as well as itself if the crawling isn't done after, say, 30 seconds. Otherwise, if the crawler finishes on time, it terminates its supervisor and itself. The next item you mention is that you'd like to know how many crawler threads are running at once in the multi-threaded Spider. In the code below, I'll modify the periodic output so that it shows the currently simultaneous number of threads, but this should almost always be equal to maxThreadCount. This is because as soon as one crawler is finished, the next one starts up. The transition is practically instantaneous. To see only 19 crawlers running instead of the maximum 20 would be a rare treat. You also express an interest in the upper limit on the number of threads that could run at one time. The answer is complicated because the crawler threads are not merely using CPU bandwidth, they're also using Internet bandwidth. If you were running threads that didn't access the disk or the network, you could comfortably run thousands of them, even tens of thousands at a time, on a fast machine equipped with a robust operating system such as Linux. When it comes to web crawling, however, you will probably congest your Internet connection once you've got several dozen threads downloading webpages at the same time. Divide the incoming bandwidth of your home computer with the outgoing bandwidth of your average website, add something like a twofold factor to account for latency, and that's roughly the number of crawlers you can reasonably operate at the same time. I know that Korea has the fastest consumer broadband in the world, so you might be able to run a couple of hundred crawlers on a good hookup. I'm quite willing to bet that 25 crawlers is a safe number on a DSL modem anywhere in the world, and I'd even feel safe with 50. You asked about setting the HTTP user agent for a Java crawler. It is indeed a nice habit to inform web sites of your crawler's identity, but I'm afraid I can't help you on this score. I'm more familiar with the Python web-access methods myself. I looked at the HttpURLConnection class, but it doesn't seem to offer anything relevant. You could certainly send the user-agent information if you wrote your own URL access methods that talk directly to the socket. At one point, you inquire about the line PageCrawler crawler = new PageCrawler(this, links[i]); and essentially proceed to answer your own question. Yes, "this" refers to the Spider object inside which this line is being executed. The Spider object is passing a reference to itself, so that the PageCrawler object can subsequently call a method of the Spider object. In particular, the PageCrawler says spider.unlock(); when it is about to terminate, thereby informing the Spider object that it may proceed to launch a new PageCrawler. The practice of objects retaining references to each other is an integral part of the object-oriented programming (OOP) style. If you hear OOP practitioners talking about objects passing messages to each other, what they actually mean is that object A, having a reference to object B, calls some method B.foo() to notify B of something. In reply, B might call A.bar(), and then we have an OOP dialogue going. The OutOfMemoryError exception is a difficult matter. I have no clearcut answer for you, but I can mention some possibilities. First, you may want to upgrade to a newer version of the Java Runtime Environment if you haven't done so lately. Memory management is a known problem in Java, and while Sun engineers keep improving the heap allocation and garbage collection, it's still far from perfect. Your hypothesis of excessive thread propagation is off the mark, I hope, or else it would mean that I'd coded the threading incorrectly. The lock() and unlock() methods are meant to prevent the thread count from ever exceeding the value of maxThreadCount. Upon reviewing the Spider.java code, what strikes me as the greatest source of memory usage is that the Scraper constructor reads an entire web page into a String before proceeding to search for the <title> and <meta...> patterns. One quick fix, which I've implemented below, is to search for matching patterns on a line-by-line basis. Thus, only a single line is stored at a time. This does mean that you can't extract your information when the pattern is spread over several lines, but I believe such cases are rare. Note that I've also added a check for the </head> tag, since the title and description should only be found in the HTML header. The most general approach would be to read the web page one character at a time or in slightly larger chunks, accumulating a sequence of them only when they begin to match one of your patterns. I don't know whether you have the patience or the need for such a solution. An intermediate solution would be to read the entire header into one String, without the body, and pattern-match in that. Finally, you are right to presume that the OutOfMemoryError exception, like all other exceptions, can be caught in your program. If there's a particular segment of code that does something bad to the memory, you might be able to narrow it down by enclosing smaller and smaller blocks of code with an appropriate try...catch statement. Alternatively, and this is something I often do, you can just sprinkle println() statements throughout the program to display the current value of the most interesting variables. Then, at the point of failure, you'll have a customized debugging trace to review. Was there anything else? I can't recall more at the moment. In the event of further trouble, I'll be glad to respond to your Clarification Requests. Regards, leapinglizard //----------begin Spider.java import java.io.*; import java.sql.*; import java.util.*; import java.util.regex.*; import java.net.*; class PageData { String title, desc; public PageData (String title, String desc) { this.title = title; this.desc = desc; } void setTitle(String title) { this.title = title; } void setDesc(String desc) { this.desc = desc; } } class Scraper { int pFlags = Pattern.CASE_INSENSITIVE; Pattern titlePattern = Pattern.compile( "<title>(.*?)</title>", pFlags); Pattern descPattern = Pattern.compile( "<meta\\s+name=\"description\"\\s+content=\"(.*?)\"", pFlags); Pattern endPattern = Pattern.compile( "</head>", pFlags); PageData pageData; public Scraper(String address) { URL url; InputStreamReader stream = null; BufferedReader reader = null; String line; pageData = new PageData("_no_title_", "_no_description_"); boolean gotTitle = false, gotDesc = false; Matcher matcher; try { url = new URL(address); stream = new InputStreamReader(url.openStream()); reader = new BufferedReader(stream); while ((line = reader.readLine()) != null) { if ((matcher = titlePattern.matcher(line)).find()) { pageData.setTitle(matcher.group(1).trim()); gotTitle = true; } if ((matcher = descPattern.matcher(line)).find()) { pageData.setDesc(matcher.group(1).trim()); gotDesc = true; } if (endPattern.matcher(line).find() || (gotTitle && gotDesc)) return; } } catch (MalformedURLException e) { pageData.setTitle("_malformed_URL_"); } catch (IOException e) { pageData.setTitle("_inaccessible_URL_"); } } public PageData getPageData() { return pageData; } } class LinkData { int id, father_id; String url; public LinkData (String url, int id, int father_id) { this.url = url; this.id = id; this.father_id = father_id; } } class PageCrawler extends Thread { Spider spider; LinkData link; PageCrawler(Spider spider, LinkData link) { this.spider = spider; this.link = link; } public void run() { PageData page = new Scraper(link.url).getPageData(); insertData(page.title, page.desc, link.url, link.id, link.father_id); spider.unlock(); } void insertData(String t, String d, String u, int l, int f) { Connection conn; String dbURL = spider.dbURL; String username = spider.username, password = spider.password; try { Class.forName("com.mysql.jdbc.Driver").newInstance(); } catch (Exception ex) { ex.printStackTrace(); } try { String insert = "INSERT INTO spider (title,description,url," + "net_Links_ID,net_Category_FatherID) values (?,?,?,?,?)"; conn = DriverManager.getConnection(dbURL, username, password); PreparedStatement insertData = conn.prepareStatement(insert); insertData.setString(1, t); insertData.setString(2, d); insertData.setString(3, u); insertData.setInt(4, l); insertData.setInt(5, f); insertData.executeUpdate(); insertData.close(); conn.close(); } catch (java.sql.SQLException ex) { ex.printStackTrace(); } } } public class Spider { int maxThreadCount = 50; String dbURL = "jdbc:mysql://domainname.com/dbName"; String username = "username", password = "password"; int threadCount, total, count, row_num, ct; public Spider() { threadCount = 0; count = 0; row_num = 1000; total = countData(); System.out.println("count = "+count+", total = "+total); while(count < total) { ct = count; row_num = Math.min(row_num, total-count); System.out.println("row_num = "+row_num); LinkData links[] = selectData(count, row_num); System.out.println("found "+links.length+" links"); for (int i = 0; i < links.length; i++) { lock(); PageCrawler crawler = new PageCrawler(this, links[i]); crawler.start(); if (++ct % 100 == 0) { System.out.println("launched "+ct+" crawlers so far"); System.out.println(" ("+threadCount+" simultaneous)"); } } count += row_num; System.out.println("count = "+count); } System.exit(0); } public synchronized void lock() { while (threadCount == maxThreadCount) try { wait(); } catch (InterruptedException e) {} threadCount++; } public synchronized void unlock() { threadCount--; notify(); } private LinkData[] selectData(int in, int nmr) { // queries 3 tables in database to retrieve URLs Vector vector = new Vector(); Connection conn; try { String select = "SELECT net_Links.ID, net_Links.URL," + " net_Links.TITLE, net_Category.FatherID" + " FROM net_Links,net_Category, net_CatLinks" + " WHERE net_Category.Full_Name LIKE 'Health%'" + " AND net_Category.ID = CategoryID AND" + " net_Links.ID = LinkID LIMIT "+count+","+row_num; System.out.println("select = \""+select+"\""); conn = DriverManager.getConnection(dbURL, username, password); Statement statement = conn.createStatement(); ResultSet rs = statement.executeQuery(select); while (rs.next()) { String url = rs.getString("net_Links.URL"); int id = rs.getInt("net_Links.ID"); int father_id = rs.getInt("net_Category.FatherID"); vector.add(new LinkData(url, id, father_id)); } statement.close(); conn.close(); } catch (java.sql.SQLException ex) { ex.printStackTrace(); } LinkData links[] = new LinkData[vector.size()]; for (int i = vector.size()-1; i >= 0; i--) links[i] = (LinkData) vector.get(i); return links; } private int countData() { int ct = 0; Connection conn; try { // counts number of rows; will use to construct final SQL statement Class.forName("com.mysql.jdbc.Driver").newInstance(); String select_count = "SELECT count(net_Links.ID) AS total" + " FROM net_Links,net_Category, net_CatLinks WHERE" + " net_Category.Full_Name like 'Health%'" + " AND net_Category.ID = CategoryID AND net_Links.ID = LinkID"; conn = DriverManager.getConnection(dbURL, username, password); Statement statement = conn.createStatement(); ResultSet rs = statement.executeQuery(select_count); while (rs.next()) ct += rs.getInt("total"); System.out.println("count = "+ct); statement.close(); conn.close(); } catch (Exception ex) { System.out.println("countData exception"); ex.printStackTrace(); } return ct; } public static void main(String args[]) { new Spider(); } } //----------end Spider.java

coolguy90210-ga at Google Answers Visit the source

Was this solution helpful to you?

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.