How to scrape a web forum?
-
How to scrape a web forum? I need help understanding the process. I want to figure out how to scrape a website, specifically a forum. Itâs a site thatâs been around for a long time with a lot of knowledge, but the last couple years the owner has gotten progressively less active, and is AWOL now. Iâm afraid it will go away once the domain expires and I want a back up so I can access the information if I need to. Its an invision board forum and you need a password to see the forum content, if that effects how its done. Iâve been looking for how to do it but I all I am finding is blackhat SEO sites that are talking about scraping and reusing content, which is not what Iâm looking to do. I read some posts here but Iâm still left not fully understanding what steps I need to take. I looked at scraperwiki (https://scraperwiki.com/ - recommended in this post: http://ask.metafilter.com/178622/Scraping-the-web) And Iâm not sure exactly Iâm supposed to do. Iâve been trying to dig into the tutorials and the intro video and I think it might be what I want but now I more confused than when I started. Iâm not even sure what to do for a data request if I decide to go that route. Iâm on a limited budget, but would consider software if it didnât break the bank, so if there are any suggestions for software, Iâd also appreciate it. I can also usually hack my way around php and mysql, but this isnât like anything Iâve done before, leaving me more confused every time I search. Looking at scraperwiki I suspect its outside my skill level (but am willing to give it a shot if I can find some good tutorials). Iâm really looking for some guides and advice to understand it better, so anything you can share would be great.
-
Answer:
I've successfully used SiteScraper for something similar (grabbing a bunch of hiking pages for offline access while in the "wilds"). http://www.sitesucker.us/mac/mac.html
[insert clever name here] at Ask.Metafilter.Com Visit the source
Other answers
I would start with http://fosswire.com/post/2008/04/create-a-mirror-of-a-website-with-wget/ on a UNIX or mac machine.
These Premises Are Alarmed
The thing that you want to do isn't called scraping. screen-scraping is extracting specific data from poorly-formatted sources online, like what you find when you search for that term. You just want to make a cache of the entire site, so that you could (for instance) read it offline (or put it back online somewhere else, later). You want a piece of software like this: http://www.httrack.com/
tylerkaraszewski
You need a http://www.phpbb.com/community/viewtopic.php?f=65&t=1761395. I have used it in the past (invisionfree to phpbb) and it worked quite well. You may need to modify it a bit depending on the forum's skin though.
Memo
I'm familiar with wget for ftp a little bit, and I have a mac so I could probably do it through terminal. How would I use it to crawl and grab pages? Will it work for a password protected section of the site? What will be output?
[insert clever name here]
You really don't want to scrape (extracting and reusing data), you want to mirror/cache/archive. You might want to seek guidance from http://archiveteam.org/index.php?title=Main_Page, which was founded by http://www.metafilter.com/user/18342. They're very concerned about preserving endangered online communities.
zsazsa
Yeah, you definitely want wget. http://www.dheinemann.com/2011/archiving-with-wget/.
cdmwebs
I think I do still want to scrape it. I think. My plan was run a local copy of the site so if I needed to I could do all the same searched. HOWEVER, I'm wondering if maybe what I want to do is capture the pages, and then if I ever did need to try and re-grab it to a database, I'd then scrape it? Does that sound more like what I'd want to do?
[insert clever name here]
BTW, that's an excellent suggestion, contacting the archive team. I don't know if they'd be interested or not, its a pretty niche site. But its also been around since 2000 so maybe.
[insert clever name here]
Oops, that is, of course, Site Sucker.
jeffch
Related Q & A:
- How To Make A Web Site?Best solution by Yahoo! Answers
- How Much Does A Web Designer Charge?Best solution by Yahoo! Answers
- How to make a web crawler?Best solution by Stack Overflow
- How to review a web application code?Best solution by Stack Overflow
- How to develop a web application?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.