What are the technologies behind the "Send to Kindle" Chrome Extension?
-
The amazing "Send to Kindle" Chrome Extension [1] has a pretty decent automatic page content extraction, ignoring stuff such as headers, menus etc, and sometimes it even recognizes the text author. What are the technologies behind it? [1] https://chrome.google.com/webstore/detail/send-to-kindle-for-google/cgdjpilhipecahhcilnafpblkieebhea
-
Answer:
(Diffbot has been working on article content extraction, among others, since 2010, and powers a number of popular services. I can't address specific customers or the nuts and bolts behind how individual apps/services -- e.g., Send to Kindle or Instapaper or Apple's Reading List -- precisely work, but I can describe the most common mechanisms that article extractor technology uses.) The simplest and earliest versions of article extractors scan page markup for meaningful HTML elements -- specifically those that generally contain text content (<article>, <p>, <blockquote>, <aside>, etc.) or that on specific pages contain significant amounts of text. Often targeting the parent element of text-heavy nodes is what ends up happening (e.g., <div class="content">). Elements of interest also tend to be repetitive (numerous <p> elements in a row or within the same parent element). In terms of finding the author or other article metadata: The simplest approach looks for specific metadata that's already present in the page <HEAD> for purposes of easy sharing: OpenGraph tags, Twitter Card tags, http://Schema.org itemprops. Often the author, the date, the primary image, a short description, title, and other useful metadata is sitting right here. http://Schema.org microdata is often available within the core markup itself, as well. Markup cues/rules (a div ID of "author" or "byline") can also be used to identify additional metadata, although this is far less reliable and more difficult to do. For instance, an article's date will usually precede the core content in the DOM/markup, but not always. And often a publication will have today's date somewhere in the page header, independent of the article date -- choosing between these isn't always easy. Some examples of extractors that use the above techniques: Readability (Python port: https://github.com/buriy/python-readability) Boilerpipe (https://code.google.com/p/boilerpipe/) Alchemy (hosted API: http://alchemyapi.com) Embedly (hosted API: http://www.embedly.com) You can also see a comparison of various (2011-era) extractors here: http://web.archive.org/web/20120606173919/http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms Diffbot's approach relies on computer-vision techniques (in conjunction with machine learning, NLP and markup inspection) as the primary engine in identifying the proper content to extract from a page. What this means: when analyzing a web document, our system renders the page fully, as it appears in a browser -- including images, CSS, even Ajax-delivered content -- and then analyzes its layout using visually. A full rendering allows the page to be broken-down into its constituent visual components. Then using machine-learning-trained algorithms, these elements will be weighted for their likelihood in being various components of a page: title, author, related image, full text, sharing icons, next-page link, etc. As part of this the content and markup within and surrounding each element will also be evaluated to help further identify the right elements. Within each of these, again our machine-learning-trained algorithms will be used to identify likelihood of each block based on element content, surrounding markup, etc. Finally the unrelated components will be discarded and the identified elements will be processed (extraneous text or inline elements removed; HTML normalized; date normalized; image-headers scanned; etc.). A visual approach allows for better identification of extraneous content (inline advertising links / links to related content; sharing links; attribution elements; etc.) and the difficult-to-identify metadata (author, date, title, etc.) and to work very well on non-English pages, since visual structure of a page tends to be similar regardless of a page's written language. For instance, on the following Chinese-language article (http://www.worldjournal.com/view/full_aUS_14/25414476/article-%E8%B0%B7%E6%AD%8C%E9%AB%98%E5%B1%A4%E5%90%B8%E6%AF%92%E6%AD%BB%E4%BA%A1-%E5%A6%93%E5%A5%B3%E8%A6%8B%E6%AD%BB%E4%B8%8D%E6%95%91%E8%A2%AB%E6%8E%A7?instance=us11), a visual-based extraction works very well:
John Davi at Quora Visit the source
Related Q & A:
- What's the science behind acoustic dispersion?Best solution by sound.stackexchange.com
- how to open a link in new tab in chrome extension?Best solution by Stack Overflow
- What are the secrets behind the prosperity of the Scandinavian nations?Best solution by Yahoo! Answers
- What is the story behind the song Cassie by flyleaf?Best solution by Yahoo! Answers
- What are some Technologies used in climate change?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.