How to get correct URLs in network wide menu (Multisite?

How can I get of list of all active urls/websites?

  • I want to index pages against each set of keywords. My crawler needs initial set of urls ? Is there any way to get all the active urls ? Is there any  public directory where I can get all these urls ?

  • Answer:

    blekko has donated a bunch of crawl metadata to Common Crawl which contains what you're looking for -- a list of domains we crawl with ranks, and a list of URLs we know about with their ranks. See http://commoncrawl.org/blekko-donates-search-data-to-common-crawl/ for some details - the data I'm talking is in the Common Crawl public S3 bucket.

Greg Lindahl at Quora Visit the source

Was this solution helpful to you?

Other answers

"All" URL's?   That's a long, long list.   You should give some thought to problems of scale.   How long will it take your crawler to crawl N web pages on the average?   How much memory will you need.   As N gets bigger and bigger those things matter.  At some point you become Google with huge data centers doing the indexing.You might want to take Udacity CS101 to get some insight into growing your index.

R. Drew Davis

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.