How advanced can web scraping or data mining be?
-
I am new to this web screen capture. I did some challenging scraping job in PDFs and capturing information behind layers of form. I wonder how tough a web scraping can be. Also what are the challenges situation in obtaining data.
-
Answer:
I think the most complicated aspect of obtaining data from the web, is that websites are not consistent. So, if you wanted to get all prices for a particular good, it would be difficult to build one crawler to go to all online market places to obtain the data for that good. Instead, most will create one scraper for each individual online marketplace. In short, the difficult part is creating a single set of code that can obtain the information you want, and parse it, from various inconsistent sources.
Neil Aggarwal at Quora Visit the source
Other answers
The real challenge is with the scale. When you have to mine data from a few web pages, you could just do a wget and fetch what you need. When its about a thousand websites, that's a different problem altogether. Consequently, that brings in various other challenges of setting up a big data infrastructure, smart distributed computing, adaptive crawling and monitoring all the things. That's a real headache that one can't get rid of unless you decide to give the headache away. A lot has been achieved in this field via a good mix of technologies that you are free to pick and choose but needless to say the technology barrier is pretty high. This problem gets amplified when the output needs to be structured data (like the formatted XML's or CSV's that's ready to be loaded into your DB). And it brings an obvious concern with inconsistencies in data formats and dynamism within the web. And because of this, only approximate solutions can be formulated if there's just one extractor doing all the job. I wrote a short post here sometime back- http://blog.promptcloud.com/2012/09/standard-data-formats-our-right-to-data.html. Another interesting challenge is to acquire datasets in real-time. We have achieved near real-time success but achieving a real-time latency would be a big step forward. Apart from these, complexities within the websites, like AJAX components say, have paved way for advancements.
Arpan Jha
Actually it would be very easy if there's an easy but powerful web scraping tool. You know, if you try to write a crawler especially when you teach yourself from scratch, it's gonna be a very long process and cost you a lot. Octoparse is a very easy but powerful web scraper that allow you to scrape data from almost every web page and converts them into structured formats. It's also very easy for green hands. One of challenge could be that some websites are anti-scraping and you can't extract data from them. Facebook is very much a "Walled Garden". But sometimes you can also scrape some information on Facebook . Here's tutorial of how to extract data from Facebook. http://www.octoparse.com/tutorial/extract-facebook-data/ It's free now . So try it now!
Jerry Huang
Web scrapers are tools that can extract important data from otherâs websites. These can be your competitors or supplier's websites. With the data that the tool extracts, you can change the pricing, images, etc. accordingly and grow your online business. The web scraper tools can also help in migrating the entire data from your old website to a new one. One such web scraping tool that I am aware of is the Easy Data Feed, provided by ShoppingCart.Elite. The tool is easy to use and can gather innumerable vital information, thus making your on-line business successful. Disclosure: I am a consultant for on-line businesses and have reviewed this platform.
Michael Domol
If i go back to 5 or 10 years, website scraping was not an easy task. You'd need a lot of manual analysis of website you want to scrape and then write you program manually to teach how to start, what page to hit and what data to scrape.But if you look on 2015-2016, it's completely changed by few startups launched their innovate product in this field and the Chrome/Firefox developer tools to analyse the DOM, Network traffic.For example http://www.datascraping.coA point and click app to https://chrome.google.com/webstore/detail/advanced-web-scraper/gpolcofcjjiooogejfbaamdgmgfehgff in just few minutes using Jquery style CSS selectors with superb real-time extracted data preview and then use their desktop app for advance feature like batch url crawling, scheduling, multiple website scraping in parallel and more....
Prince Rathee
As Arpan and Neil have already mentioned, web scraping can get pretty complicated when you have many sites with different structures and when you need your output in a certain format. Luckily Kimono (http://kimonolabs.com) makes a bookmarklet/chrome extension that you can use for scraping almost any site. You can get your output in JSON, CSV, and RSS, and Kimono will host your data for you. Your API will even update when you tell it to. Full disclosure, I work for Kimono, but if youâve ever struggled with creating a bunch of hand-made scrapers, try it out. Itâs really easy to use and it can save you a ton of work. Iâd recommend giving it a try before putting in the work to write your own scraper. Plus, itâs free!
Katie Lundsgaard
There are many challenge faced in web scraping. Challenges facing data scraping It is very important to note that getting data through http://www.loginworks.com/our-services/web-scraping/ is not very easy, it encounters quite a number of problems including, but not limited to. Metadata: only a few datasets are thoroughly explained for a person to understand easily what they mean. It can therefore be very difficult for the web scrapper to know what the web designer meant by some statements. Scale: it is rather apparent that the differences in which data is represented in terms of units of measure can be a big challenge during data scraping. The dataâs terabytes can be a problem to some file systems. Complexity of the source: an exact answer to a specific question is what is required by the web user, so if the source from which the data to be scrapped is complicated and not easy to comprehend, data scraping process may fail since proper and accurate information may not be extracted. For more knowledge visit : http://www.loginworks.com/blogs/web-scraping-blogs/data-scraping-considerations-challenges-benefits/
Jhon Lee
web scraping is a good way to scrap the information that we really want in our business ! it helps us to make decisions and analyse the market. a firm can collect information regarding price from competitors and determine their own pricing matrix. And through which firms can generate more profits. if anyone is looking for web scraping then you must check this website LOGINWORKS http://www.loginworks.com/our-services/web-scraping/ .
Jason Miller
Related Q & A:
- How To Get Online Data Mining Work?Best solution by theatlantic.com
- How can I insert posted data into the database?Best solution by Stack Overflow
- How can I query Parse data by creation date with Swift?Best solution by Stack Overflow
- How can I add the data to shopping cart?Best solution by Stack Overflow
- How can I sort my data in alphabetic order?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.