perl: fetching Web pages / redirect issue
-
Hi - The script below fetches Web page content from specified URLs. I'd like to have code added to the grab function so that in case of a redirect I'll have a way of knowing what URL is being redirected to, by looking at the contents of @endurl. In the example below, for instance, "http://www.news.com" redirects to http://news.com.com", therefore $endurl[0] should be "http://news.com.com". $endurl[1] should "http://www.yahoo.com" since there is no redirect for that address. As an answer to this question, please add the necessary code to the grab function and post the entire modified, tested script. Thank you. Marc. #!/usr/bin/perl use LWP::Simple; require LWP::Parallel::UserAgent; @urls = ( "http://www.news.com", "http://www.yahoo.com" ); @content = &grab(@urls); # add code to the grab function below so that @endurl is filled with actual URLs that content is being fetched from. # in cases where there is no redirect, those will be the same as the $url[n]. # In this example, $endurl[0] should be "http://news.com.com" ( since this is # what www.news.com redirects to) and $endurl[1] should be "http://www.news.com" print "$endurl[0] \n"; print $content[0]; exit; sub grab { @results= (); $ua = LWP::Parallel::UserAgent->new(); $ua->agent("MS Internet Explorer"); $ua->timeout ($timeout); $ua->redirect (1); # prevents automatic following of redirects $ua->max_hosts(6); # sets maximum number of locations accessed in parallel $ua->max_req (2); # sets maximum number of parallel requests per host foreach $url2 (@_) { $ua->register(HTTP::Request->new(GET => $url2), \&callback); } $ua->wait (); return @results; } sub callback { my($data, $response, $protocol) = @_; #Comment this line to prevent show the url print $response->base."\n"; #majortom-ga's change my $oresponse = $response; while (defined($oresponse->previous)) { $oresponse = $oresponse->previous; } for ($i=0; $i<@urls; $i++) { # check $oresponse->base, not $response->base if ( index( $oresponse->base, $urls[$i]) != -1 ) { $results[$i].=$data; last; } } } ############
-
Answer:
Hello, marcfest-ga, It's a pleasure to hear from you again. I have made the change you asked for, and also made an improvement to the callback: instead of searching for $urls[$i] in $oresponse->base, which just happens to work if there is no <base> tag or the <base> tag happens to point to a deeper URL in the same web site, I now compare $urls[$i] directly to $oresponse->request->uri, which is the document actually asked for on this particular request, although it may not be the *original* document asked for due to redirects and we still have to deal with that as we always have by walking the list of responses. Here is the code with the new @endurl feature. Let me know if you have any questions! Thanks for the opportunity to work on this interesting program. Sources of information: "perldoc HTTP::Request", "perldoc HTTP::Response" * * * CUT HERE * * * #!/usr/bin/perl use LWP::Simple; require LWP::Parallel::UserAgent; @urls = ( "http://www.news.com", "http://www.yahoo.com" ); @content = &grab(@urls); # add code to the grab function below so that @endurl is filled with # actual URLs that content is being fetched from. # in cases where there is no redirect, those will be the same as the $url[n]. # In this example, $endurl[0] should be "http://news.com.com" ( since this is # what www.news.com redirects to) and $endurl[1] should be "http://www.news.com" print "ENDURLS ARE:\n"; for $u (@endurl) { print $u, "\n"; } print "END OF ENDURLS\n"; print $content[0]; exit; sub grab { @results = (); @endurl = (); $ua = LWP::Parallel::UserAgent->new(); $ua->agent("MS Internet Explorer"); $ua->timeout ($timeout); $ua->redirect (1); # ALLOWS automatic following of redirects $ua->max_hosts(6); # sets maximum number of locations accessed in parallel $ua->max_req (2); # sets maximum number of parallel requests per host foreach $url2 (@_) { $ua->register(HTTP::Request->new(GET => $url2), \&callback); } $ua->wait (); return @results; } sub callback { my($data, $response, $protocol) = @_; #Comment this line to prevent show the url print "URL: ", $response->base, "\n"; #majortom-ga's change my $oresponse = $response; while (defined($oresponse->previous)) { $oresponse = $oresponse->previous; } for ($i=0; $i<@urls; $i++) { # majortom-ga: instead of trying to find $urls[$i] in # $oresponse->base, which won't necessarily contain it # if a Content-Base: header or <base> tag is present, # we match it exactly in $oresponse->request->uri if ($oresponse->request->uri eq $urls[$i]) { #majortom-ga: record the URL of the last page #we were redirected to $endurl[$i] = $response->request->uri; $results[$i] .= $data; last; } } }
marcfest-ga at Google Answers Visit the source
Related Q & A:
- How to write code to go through web pages automatically and get the required information?Best solution by Stack Overflow
- Why do people slice up images created in Adobe Photoshop CS3 or Illustrator for web pages?Best solution by Yahoo! Answers
- How to view sss web pages?Best solution by Yahoo! Answers
- How do you block any web pages from the computer?Best solution by wikihow.com
- How do I enlarge the print on my web pages?Best solution by womeninbusiness.about.com
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.