Regex: Text from HTML, no attributes
-
Regex Madness...filter. How do I pull the text out of an html document without looking at the tag attributes? I'm using javascript... and I am just stuck. I think my brain is about to explode. I'm trying to pull certain things out of an html document. Let's say, for simplicity's sake, it looks like this... 'cept with html tags. (Had to change 'em to display here.)[!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"][html] [head] [meta http-equiv="content-type" content="text/html; charset=windows-1250"] [meta name="generator" content="PSPad editor, www.pspad.com"] [title]Sample Document[/title] [/head] [body] [p] [img src="http://blah.com/sample.jpg"] [/p] [p] Some text is [a href="fjkj.html"]here[/a] [/p] [/body][/html]All I want out of that thing is: Sample Document Some text is here Is that possible? I thought I had something working... but I was so wrong. I tried to spider down through the dom, but I never could get that right either. As a bonus... is there a particular book/tutorial folks recommend for understandings the mighty regex?
-
Answer:
For the bonus question I would suggest Jeffrey Friedl's http://regex.info/.
ph00dz at Ask.Metafilter.Com Visit the source
Other answers
Here's the actual code, in case someone stumbles on this later and needs it:replaceSearchTerms('ph00dz', 'dumkopf'); // this code came from http://www.nsftools.com/misc/SearchAndHighlight.htm// it was originally a highlighter... but repurposed for this thing!function doReplace(bodyText, searchTerm, replaceText) { // find all occurences of the search term in the given text, // and add some "highlight" tags to them (we're not using a // regular expression search, because we want to filter out // matches that occur within HTML tags and script blocks, so // we have to do a little extra validation) var newText = ""; var i = -1; var lcSearchTerm = searchTerm.toLowerCase(); var lcBodyText = bodyText.toLowerCase(); while (bodyText.length > 0) { i = lcBodyText.indexOf(lcSearchTerm, i+1); if (i { newText += bodyText; bodyText = ""; } else { // skip anything inside an HTML tag if (bodyText.lastIndexOf(">", i) >= bodyText.lastIndexOf(" { // skip anything inside a block if (lcBodyText.lastIndexOf("/script>", i) >= lcBodyText.lastIndexOf(" { newText += bodyText.substring(0, i) + replaceText; bodyText = bodyText.substr(i + searchTerm.length); // bodyText.substr(i, searchTerm.length) lcBodyText = bodyText.toLowerCase(); i = -1; } // end if } // end if } // end else } // end while return newText; } // end functionfunction replaceSearchTerms(searchText, replaceText) { searchArray = [searchText]; var bodyText = content.document.body.innerHTML; for (var i = 0; i { bodyText = doReplace(bodyText, searchArray[i], replaceText); } content.document.body.innerHTML = bodyText; return true; }>>>
ph00dz
Perhaps I'm missing something? Are you after the text for some further action in your js code? If not and you just want a plain text version of it, use textutil if you are on OS X.
jxpx777
There might be a much easlier way to do this - the http://links.sourceforge.net/ or http://lynx.browser.org text-mode web browsers and their "dump" option: mrbill@ohno:~> links -dump test.html Some text is heremrbill@ohno:~> lynx -dump test.html [sample.jpg] Some text is [1]hereReferences 1. file://localhost/disk/home/mrbill/fjkj.html
mrbill
Also, the lone quote mark at the end should be a double (empty string) drat
moift
There's not supposed to be a space between the < and the . in the regex, but I couldn't get it to show up right.
moift
Screw the dom, real javascript ninjas use regex (you can paste this in the urlbar to see if it does basically what you want): javascript:for(var i=0; !document.childNodes[i].innerHTML; ++i); document.body.innerHTML=document.childNodes[i].innerHTML.replace(//gm, ');> The loop is to find the first non-empty node, which should contain the whole document if it's well formed. You'll need to beef up the regex a bit if you want it to remove CSS/Script blocks as well, but if you haven't moved on already I'd be happy to help with that.
moift
if you have the contents in a string in ruby: mystring.gsub!(//, ' ') in PHP: $mystring = preg_replace('//', ' ', $mystring); Then mystring will hold just the content you want. I replace with a space instead of nothing so you don't end up with words running into each other if stripping out things like a <br> that don't have a space around it.>>
chrisroberts
Didn't want to sound too not-having-a-clue, but viewing the HTML in a browser and cutting/pasting what you see is perhaps far too simple?
vanoakenfold
this is trivial with xsl; google have released an xsl implementation in javascript.
andrew cooke
Related Q & A:
- How do I add custom attributes to javascript tags in Wordpress?Best solution by WordPress
- How to convert positional attributes in an xml file to normal attributes?Best solution by Stack Overflow
- how to regex remove this?Best solution by Stack Overflow
- How to input variable inside regex javascript?Best solution by stackoverflow.com
- How to clean the details with regex?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.