How to extract data from any website?

How do I extract XML data from a website?

  • Answer:

    Extracting data from websites with Kettle ranges from easy to difficult.  Easy is where you use an HTTP step to call out to the website address, and it brings the response back into the stream as a column (field).  The most difficult is where you have to navigate someone's log in logic by using PDI to create shell scripts which use wget or curl to handle the interaction while saving cookies and session information.  Then a final call through wget or curl to cause the web site to give you a result page, which could very well be an XML response, but more likely html.   The final call will download that html file to disk temporarly, which you can load into memory (single row, single field), use JTIDY via UDJC step to convert the html to XHTML, then run that stream through the Xpath step.  So it is tedious, but doable.  I am talking with Pentaho about making this process easier until Cloud based SaaS providers figure out that letting the data out through great APIs is something important to customers and something that could be monetized.  Reality is, there will always be some great web application, managed by a small company, which may not have the time, money or resources to let data out to machines in an easy way.  We have a ton of them in healthcare.

Brandon Jackson at Quora Visit the source

Was this solution helpful to you?

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.