Can I use a web scraper to get real URLs via shortened URLs?
-
Are there any open source web scrapers that I can use to get original URLs from shortened URLs (e.g. bit.ly, is.gd, tinyurl, etc). I'm interested in scraping Digg submission histories for an easy way to look at the websites certain Diggers link to. Unfortunately, the history page links to a Digg page, which then often links to a digg.com-shortened URL via DiggBar. Is there any way to get through these clicks and scrape for the original URLs?
-
Answer:
I don't think you need a real web scraper for this. Digg has announced that anonymous users will get a straight 301 redirect to the site in question. The diggbar is opt in now. You should be able to use something like LWP in perl or cURL to figure out where you are being redirected to. That will give you the target URL.
alohaliz at Ask.Metafilter.Com Visit the source
Other answers
It's not a scraper, as such, but if you are writing one you might look at the code from the http://longurl.org/tools. It's a Greasemonkey script that expands these types of URLs and displays the destination URL in a Firefox tooltip.
fireoyster
Cool, that helps getting the destination URL from the shortened URL--thanks! But what about getting even the shortened URL just from the submission history page (e.g. http://digg.com/users/MrBabyMan/history/submissions). It looks like the submission page just goes to the story's Digg.com page...unless I'm missing something? (new at this, if it's not totally obvious...) Is there a way to actually just pull all the destination URLs straight off the sub history page? btw I'm not actually interested in MrBabyMan...just using as an example.
alohaliz
On Unix systems:#!/usr/bin/perluse warnings;use strict;use LWP::UserAgent;my @urls = qw(http://xrl.us/bepvnq http://xrl.us/bepvnq);my @locations;open my $OUT, ">", "long_urls.txt";my $ua=LWP::UserAgent->new;foreach my $url (@urls) { my $resp = $ua->get('http://xrl.us/bepvnq'); push @locations, $resp->previous->header('location');}print $OUT "$_\n" for @locations;On windows, grab http://strawberryperl.com and run the script from the dos window perl scrape.pl in the directory containing the script. To do what you want with the submission history, install WWW::Mechanize from CPAN (in the DOS prompt cpan WWW::Mechanize and then run the utility script mech-dump -links http://the.submission.page.com > unedited links Then edit up the links to have ony the links you're interested in, insert into that list in the script above replacing the two urls in there, and you should be good. Of course there are ways to automate that more, but that'll do you to some extent.
singingfish
I see the problem you have there. That is a web scraping problem. I don't know of anything that will parse this out of the box. You would have to parse the HTML and pull out the links you care about, follow them, and then parse those pages to get the real URL. That's a tall order if you're not comfortable with scripting yourself. If you are feeling brave, WWW::Mechanize is, indeed, a good perl library to handle it. You might take a look at the find_link method which might get you what you need.
MasterShake
There's also WWW::Mechanize::Shell which helps you write mechanize scripts from the command line.
singingfish
Related Q & A:
- How can I use a button to retrieve a phone 'number' from contacts?Best solution by Stack Overflow
- How can I get a web design job in Toronto?Best solution by Yahoo! Answers
- Can I use a Logitech multi media speaker to a 50 inch TV?Best solution by Yahoo! Answers
- Can I use my Military ID to get a Passport?Best solution by Yahoo! Answers
- Can I use a 4 ohm crossover with a high pass slope of 24 db with a 6 ohm tweeter that has a 6 db slope?Best solution by termpro.com
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.