Is it possible to build a stealth search engine (web crawling not web scraping) to target just one website online, without them knowing, and what coding skills or anonymity would be required?
-
The website I am keeping tabs on has a new web page for each new product promotion. So I wonder if it is at all possible to build a search engine / web crawler to keep up to date with it. In other words, I want to collect the subdomain URLs on a given website, and only those URLs that contain a certain word/phrase Please advise on how to stay in stealth mode. Please enlighten. Thanks a million. Edit: To clarify my question: I have been reading up on the dintinction between web-crawling and web-scraping and it led me to believe that what I should be looking for is a web crawler / search engine, not a web scraper. Because I just want the subdomain url of each new promotion web page, not the content on the pages. ........The difference between web-crawling and web-scraping: http://stackoverflow.com/questions/4327392/what-is-the-difference-between-web-crawling-and-web-scraping http://blog.promptcloud.com/2012/05/data-scraping-vs-data-crawling.html http://stackoverflow.com/questions/3207418/crawler-vs-scraper
-
Answer:
What you are asking is called web scraping. You would use some kind of script that is scheduled to visit the site and harvest the info every x amount of times per x. The language doesn't matter really, but Python has some great libraries for accomplishing tasks like this. You would scrape and store in some kind of database, most likely a MySQL database. Pre coded solutions like you're asking for support using proxies, van's and spoofing the user agent to remain anonymous as well.
Dwayne Charrington at Quora Visit the source
Other answers
Some things you should keep in mind: Use a reliable proxy that you can programmatically control(e.g., http://torproject.org/) Spoof user agent, a quick method is to handcraft 20-30 user agents that corresponds common web browsers then pick one randomly every time you crawl that website. Use a scheduling software to execute your crawling script regularly. For storage: go for NoSQL(MongoDB, etc.) over SQL databases(SQLite, MySQL, etc.) as it provides schema-less ways of storing data(i.e., key => value) which is, from previous experience, way better for storing inconsistent and huge data.
Abdulaziz
Use a crawler and use a user agent name similar to the browser and setup a periodic refresh on the pages you are indexing. You can setup a http collection on SearchBlox and do it for free.
Timo Selvaraj
Firstly, Yes it is possible to achieve what you are after fairly easily. Anonymity - A website identifies you with multiple parameters: HTTP Request Headers Cookies and other identity markers IP Address of your request Activity Pattern Every HTTP request carries with it a payload of Headers. However, these headers are completely open for manipulation. So this is not your major concern. As suggested, you would want to spoof the User Agent String in there to appear normal and alter any other identifiable data. The IP Address is slightly more tricky. The way the web works, you will need to work hard to spoof this. For your scenario, basic spoofing will get you a long way. Proxies, VPN's and other similar tech are what you want to research. Bear in mind though, depending on the service that you use, you might end up getting the same or similar IP each time you use it thus creating patterns. Activity Patterns is an interesting, but unreliable technique to identify a user. So for instance, if you used a scheduler to run your crawler every 3 minutes using an IE10 User Agent, routing your request via a publicly available VPN service that has only a limited pool of widely known IP addresses, how difficult do you think it is going to be to track your activity after say a 100 connections? Fairly accurately! Add to that your pattern of only crawling certain keywords and there, you're bare naked in front of a smart webmaster. There are too many web crawling scripts available in the public domain that it doesn't warrant mentioning only a few here. Decide on a language that you are going to use, your coding abilities and pick a script accordingly. Have fun :)
Pratik Bhonsle
Related Q & A:
- What are the best resources to learn about web crawling and scraping?Best solution by Quora
- What's a good search engine for a large corporate intranet?Best solution by Software Recommendations
- How do you add a yahoo search engine to your own website?Best solution by Yahoo! Answers
- How do I go about creating a custom search engine for my website?Best solution by Stack Overflow
- What would be a good name for a new search engine?Best solution by ChaCha
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.