How to extract the first few sentences from a body of text on web page
-
We are building some sort of digg site and want to automatically fetch limited text (2-3 sentences). It can be last 3 sentences of article.if that would be easier. At the momemt we fetch web page content without the problem but want to make universal script to get few sentences. We want to avoid making custom scripts for each web site from which we want to get content. I was thinking to find the text block by dots. To find dots in a close range and than to get words around dots. That is raw idea. Does someone has some other idea how to extract just par of the text. We don't want to scrape full content. Thank you.
-
Answer:
You could look for large portions of the document that have less markup and less vertical whitespace. Download the page's source and strip out any markup using strip_tags(). Then you can search for, say, five consecutive sentences using regular expressions. Here's an example script. It uses a class not included (an abstraction of curl_multi functions), but that class isn't really relevant for your question. <?php require_once("./../MultipleRequester.php"); $requester = new MultipleRequester(); $requester->addGetRequest( 'test', 'http://www.businessweek.com/news/2011-08-24/gold-tumbles-most-since-march-2008-as-demand-for-haven-wanes.html'); $requester->execute(); $content = $requester->getContent('test'); $plainText = strip_tags( $content ); $search = preg_match('/(\h{0,2}\v{0,2}\h{0,2}[A-Z]{1}[A-z0-9 ,\'")(.$]{10,1000}\.){2,5}/', $plainText, $matches); if( $search ) print trim($matches[0]); else print "Could not extract anything."; print "\n\n"; ?> This prints: The dollar rose against a basket of six major currencies amid speculation about whether Federal Reserve Chairman Ben S. Bernanke will say this week that the central bank is willing to provide more stimulus to the economy. Central bankers meet this week in Jackson Hole, Wyoming, to address the U.S. recovery. You may still have trouble with sites that mark up their content a lot. You might want to make the regular expression more lenient, particularly towards whitespace. The regexp is a little messy, but you can tune it or write your own.
Croky at Stack Overflow Visit the source
Related Q & A:
- How can I make a dynamic web page in PHP?Best solution by Yahoo! Answers
- How to translate a complete web page?Best solution by Server Fault
- How to extract a specific text from an image?Best solution by Stack Overflow
- How to check if a web page loads?Best solution by Server Fault
- How to make the first page of a PDF display by itself and the succeeding pages display two-up?Best solution by Super User
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.