Help me populate a spreadsheet by scraping an RSS feed.
-
I would like to scrape information from an RSS feed into an Excel-readable text file for a completely legal non-copyright violating use. In a better world, I'd have access to the database that generates the feed, but since this ain't a perfect world it appears that scraping is my best bet. Are there tools that will help me automate this, or programming tutorials that will help me figure it out myself (it's been 15 years since I last write any code beyond simple SQL queries)? The XML is formatted thus, for each new post (but with angle brackets where I've put square brackets): [item] [title]A title[/title] [link]http://URL[/link] [guid isPermalink="true"]http://URL[/guid] [description]Description, which may include embedded links and images. [/description] [/item] I'd like to scrape this into an Excel-readable format, where each row consists of: TITLE, URL (from "guid" Permalink, not from "link"), DESCRIPTION (First 50 characters, don't need links or images). In an even more ideal world, I'd be able to do this in a smart enough manner that if I scrape the feed every day my software/widget/whatever tool can distinguish new content and only scrape that. I know this is possible, would be super easy for the right programmer, and that without any help I could probably even cobble something together in a month or two. But I'm a writer without access to the "right programmer," and I'd really prefer not to take 1-2 months to try to figure it out.
-
Answer:
croutonsupafreak at Ask.Metafilter.Com Visit the source
Other answers
Sounds pretty simple, if you're willing to skip Excel. Set up a database for this, and it becomes pretty a simple python script: 1. Load up beautiful soup 2. download the rss file 3. parse the rss file with beautiful soup 4. iterate over the items 5. for each item, check to see if the guid is already in the database and discard it if is 6. write the remaining items to the database, truncating the description text. If even that sounds like too much work, perhaps consider this really ghetto approach, something like putting the rss into Liferea, and writing a small script to write a CSV file from the liferea SQLite database could also work. I used a similar process to export my blog from LiveJournal to markdown; its a really short script, if you exclude the HTML->Markdown conversion.
pwnguin
I'd do it in Perl. I'm assuming that you can get to a Perl environment, either via the command-line on Linux or OS/X, or via Cygnus tools or the ActiveState Perl port on Windows. Here's a super quick stupid hack that probably misses your guid vs link distinction (I'm using Perl's XML::RSS module to output stuff, and haven't looked at the output other than to confirm it works in my RSS reader...). This also doesn't try to do anything about caching fines locally, or only outputting changed records, or what-have-you.#!/usr/bin/perl -wuse strict;use LWP::Simple;use XML::RSS;my $content = get('http://www.flutterby.net/User:DanLyke_status.rss') || die "Unable to get RSS\n";my $rss = XML::RSS->new();$rss->parse($content);foreach my $item (@{$rss->{'items'}}){ print join(',', map {$a = $item->{$_}; $a =~ s/\\/\\\\/g; $a =~ s/\'/\\'/g; "'$a'"} ('title', 'link', 'description')); print "\n";}Known flaws:For caching files, if you do a perldoc LWP::Simple you'll see a "mirror" that you can use instead of "get" that will save bandwidth, and you can then use $rss->parsefile.There's a better way to output CSV versions of " $item->{title}" etc.You need to figure out how to handle duplicates.This help?
straw
I cobbled this together, it's not awesome and it has an external dependancy (http://code.google.com/p/feedparser/) but it kinda works. At least the fields I used worked with a sample feed. #!/usr/bin/env python import feedparser,csv,sqlite3 con = sqlite3.connect('your_storage.db') cur = con.cursor() try: cur.execute("DROP table entries") except: pass finally: cur.execute(""" CREATE TABLE entries( url PRIMARY KEY, title, description)""") rss = feedparser.parse('http://twitter.com/statuses/user_timeline/DBAHULK.rss') for entry in rss.entries: try: cur.execute(""" INSERT INTO entries (url,title,description) VALUES (?,?,?)""", (entry['id'],entry['title'],entry['summary'][:50])) except Exception as e: print "error (%s) inserting '%s' -- %s" % (type(e),entry['id'],e.args[0]) F = open('your_csv_output.csv','w') cur.execute("SELECT * FROM entries") for entry in cur.fetchall(): row = """"%s","%s","%s"\n""" % (entry[0],entry[1],entry[2]) F.write(row.encode('utf8')) F.close()
mce
Ugh. I should explain: 1)I was stupid and let my whitespace get stripped, so of course no python interpreter will run that. 2)feedparser is a very robust library for parsing rss and atom feeds. In an attempt to universalize the difference between feed specs some fields might not end up with the names you expected 3)It works by letting the sql library enforce the uniqueness of the guid, swallowing errors that pop up when you try to re-add an existing entry
mce
You folks are THE BOMB(s). It looks like Excel XML mapping does exactly what I'm looking for, minus the bonus "avoid stuff already scraped," and I'm gonna spend a bit of time familiarizing myself with feedparser in the mean time. Gracias!
croutonsupafreak
Another possibility - gdocs has a function https://support.google.com/docs/bin/answer.py?hl=en&answer=155181. Here's an https://spreadsheets.google.com/pub?key=pzLh3kFGO3Cw_nZZVZj0_HA&output=html&gid=2&single=true.
mce
A quick hack using the http://mojolicio.us/ Perl framework https://gist.github.com/2305007. Mojo::UserAgent has a nice could-be-one-line flow to it with a pretty nice HTML5/XML DOM with CSS-like selection ( find('item guid[isPermalink="true"])->each(...) ). If you were to go the Perl route......my $csv = Text::CSV_XS->new; sub do_url { my $URL = 'http://www.flutterby.net/User:DanLyke_status.rss'; my $res = Mojo::UserAgent->new->get($URL)->res; my $xml = Mojo::DOM->new->xml(1)->parse( $res->body ); $xml->find('item link')->each( sub { my ($i) = @_; my $p = $i->parent; $csv->combine( map { $p->$_->text } qw(title link description) ); say $csv->string; } );}...
zengargoyle
You could use the http://www.microsoft.com/download/en/details.aspx?id=29074 and use its built-in import from RSS functionality. It has a GUI interface and it can bring in the stuff you want in no time. The imported results can then be analyzed in a pivot table. To get more data, click Refresh. Although PowerPivot can easily do a full data refresh, PowerPivot is not smart enough to do incremental refresh from RSS feeds. For that, you need the full version of Analysis Services in SQL Server 2012.
crazycanuck
Related Q & A:
- How to save RSS-feed to a database?Best solution by feeds2mysql.com
- How to create RSS feed in Java?Best solution by Stack Overflow
- Wat is RSS n RSS FEED?Best solution by Yahoo! Answers
- Can I Get An RSS Feed For Facebook?Best solution by Web Applications
- What is a RSS feed?Best solution by Super User
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.