How to scrape data from a website?

How to scrape a web forum?

  • How to scrape a web forum? I need help understanding the process. I want to figure out how to scrape a website, specifically a forum. It’s a site that’s been around for a long time with a lot of knowledge, but the last couple years the owner has gotten progressively less active, and is AWOL now. I’m afraid it will go away once the domain expires and I want a back up so I can access the information if I need to. Its an invision board forum and you need a password to see the forum content, if that effects how its done. I’ve been looking for how to do it but I all I am finding is blackhat SEO sites that are talking about scraping and reusing content, which is not what I’m looking to do. I read some posts here but I’m still left not fully understanding what steps I need to take. I looked at scraperwiki (https://scraperwiki.com/ - recommended in this post: http://ask.metafilter.com/178622/Scraping-the-web) And I’m not sure exactly I’m supposed to do. I’ve been trying to dig into the tutorials and the intro video and I think it might be what I want but now I more confused than when I started. I’m not even sure what to do for a data request if I decide to go that route. I’m on a limited budget, but would consider software if it didn’t break the bank, so if there are any suggestions for software, I’d also appreciate it. I can also usually hack my way around php and mysql, but this isn’t like anything I’ve done before, leaving me more confused every time I search. Looking at scraperwiki I suspect its outside my skill level (but am willing to give it a shot if I can find some good tutorials). I’m really looking for some guides and advice to understand it better, so anything you can share would be great.

  • Answer:

    I've successfully used SiteScraper for something similar (grabbing a bunch of hiking pages for offline access while in the "wilds"). http://www.sitesucker.us/mac/mac.html

[insert clever name here] at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

These Premises Are Alarmed

The thing that you want to do isn't called scraping. screen-scraping is extracting specific data from poorly-formatted sources online, like what you find when you search for that term. You just want to make a cache of the entire site, so that you could (for instance) read it offline (or put it back online somewhere else, later). You want a piece of software like this: http://www.httrack.com/

tylerkaraszewski

You need a http://www.phpbb.com/community/viewtopic.php?f=65&t=1761395. I have used it in the past (invisionfree to phpbb) and it worked quite well. You may need to modify it a bit depending on the forum's skin though.

Memo

I'm familiar with wget for ftp a little bit, and I have a mac so I could probably do it through terminal. How would I use it to crawl and grab pages? Will it work for a password protected section of the site? What will be output?

[insert clever name here]

You really don't want to scrape (extracting and reusing data), you want to mirror/cache/archive. You might want to seek guidance from http://archiveteam.org/index.php?title=Main_Page, which was founded by http://www.metafilter.com/user/18342. They're very concerned about preserving endangered online communities.

zsazsa

Yeah, you definitely want wget. http://www.dheinemann.com/2011/archiving-with-wget/.

cdmwebs

I think I do still want to scrape it. I think. My plan was run a local copy of the site so if I needed to I could do all the same searched. HOWEVER, I'm wondering if maybe what I want to do is capture the pages, and then if I ever did need to try and re-grab it to a database, I'd then scrape it? Does that sound more like what I'd want to do?

[insert clever name here]

BTW, that's an excellent suggestion, contacting the archive team. I don't know if they'd be interested or not, its a pretty niche site. But its also been around since 2000 so maybe.

[insert clever name here]

Oops, that is, of course, Site Sucker.

jeffch

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.