How come CD doesn't work in my shell script?

Tips as to UNIX Shell Script to Programmatically Save a Webpage as a Text File

  • I'm writing a shell script to take a webpage and convert it into a text file. I'd appreciate tips as to how to store URLs, save the file with the URL's title as its name, and also just general tips as to how to improve the script and/or achieve the process better. Specific questions inside. I'd like to write a shell script which converts a webpage into a text file. After a lot of tinkering with various note-taking applications, Firefox extensions, and so on, I've found that the best tool for me is just plain good old-fashioned text files. However, I'd love it if I could automate the process a little bit more, and so I'm writing a shell script to get a webpage into a text-form equivalent. Right now, I've got: links -dump -width 512 "$1" | cut -c 4- > /tmp/temp.file lynx -listonly -dump "$1" | sed '1,3d' | cut -c 7- >> /tmp/temp.file edit -b /tmp/temp.file In this example, $1 is a Web address. What this does is:Uses links to save the text of the page. I use this instead of lynx because the "-width 512" lets it handle it without inappropriate line breaks, and links seems to let handle punctuation spacing better than lynx. (The "cut" removes the extra lefthand margin.)Uses lynx to generate the list of links that are on that page, removing the "References" header, margin, and numbering. Links doesn't seem to have any way of recording the URLs when generating a text copy. It appends that to the work in progress.Sends this to TextWrangler to open up in the background.I'm seeking the community's advice on this on three points:The way I've got it working now is okay, but, ideally, I'd like to handle URLs in the way that Mefi's print stylesheet handles it — the URL appearing right after the link text. So in a webpage converted into a text file, instead of it being "Google", it'd be "Google [http://www.google.com]". I'm aware that lynx lets you do footnotes ("[1]Google" and later "1. http://www.google.com"), but lynx's handling of line breaks and spacing isn't great.I'd then ideally like to have this script save the results automatically to a text file on my Desktop with the URL's TITLE attribute as the name of the file.I'm wondering if, given the format, any odd punctuation in the URL could screw up the process.Also, I imagine this might be an enjoyable script for others — and if so, any other modifications to the script that would improve the overall process and/or end goal — and/or any utilities that do this process better than what I'm hacking up — would be appreciated.

  • Answer:

    3. Yes. Use Perl.

WCityMike at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

That first line is now: links -dump -width 512 "$1" | cut -c 4- | sed 's/^[ \t]*//;s/[ \t]*$//' > /tmp/temp.file Does a shift left to get rid of any initial or tailing whitespace.

WCityMike

I don't know Perl. I just splice junk I find on the Web together.

WCityMike

Sounds like you're writing a screen-scraper. My suggestion would be to use http://www.crummy.com/software/BeautifulSoup/documentation.html and write something in Python. Existing Python libraries can easily handle 1 and 2 above.

needled

I'm afraid I don't know Python either.

WCityMike

I'm a complete spaz in programming and I wrote a scraper in python that creates email newsletters of a page. Mostly with your 'splice junk from the web' strategy. So, ya know, if I can do it you can too. And unlike a shell script, you can transport it between different OSes. For an amateur coder, the sort of text manipulation you're doing with pipes (#1 + #2) would be easier with some string or regex functions, and stored variables. Separating them out would also improve the script's readability and expandability (#4). If you might consider learning python, I could email you my scraper. I commented it up pretty well, so it might help you gauge what a beginner can do. I realize it's an annoying suggestion to "learn this other thing and redo what you've done" but man does that shell stuff look hacky.

cowbellemoo

Shell scripting is great for doing some things quickly, but as you're finding out, once you reach a certain level of complexity, it gets quite messy. Pick any modern scripting language (perl, python, ruby, etc) and it'll have a great libraries for handling this sort of thing. I'lll go ahead and recommend using Ruby and hpricot. The example on the first page of the http://wiki.github.com/why/hpricot gives you an idea of how powerful and easy this can be. There are great tutorials both on that wiki and elsewhere on the web that will have you up and running in no time.

chrisamiller

I can't argue with chrisamiller's recommendation of hpricot (but I'll assume that any reasonable language and XPath library would make this easy.) This will fetch a URL, insert links' URLs into the text following the link (doesn't fully qualify relative URLs, though) and print it out. require 'open-uri' require 'hpricot' doc = open("http://metafilter.com/") { |f| Hpricot(f) } title = (doc/"head/title")[0].inner_html puts title (doc/"a").each do |a|   a.after(' ' + a.get_attribute('href') + ' ') if a.has_attribute?('href') end puts doc Then run that through http://www.mbayer.de/html2text/ and you're close to done. (That could be done within the ruby script, of course, and you could write the output to the string stored in title, above, but you'd want to sanitize it to make sure it's a reasonable filename.) Ah, how can you keep them programming in Java after they've seen Ruby?

Zed

Go look at http://www.aaronsw.com/2002/html2text/ - http://www.aaronsw.com/2002/html2text/?url=http://daringfireball.net/ Links lists & numbers links with the option "-html-numbered-links 1" passed to it, at least on Ubuntu.

Pronoiac

Note that Pronoiac and I are talking about unrelated html2text programs. The one he brought up is much closer to doing what you want out of the box than anyone else's suggestions.

Zed

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.