How to Get JSON from External URL?

What are the ways to get selected crawled pages from the whole crawled data?

  • Actually i found that you have started giving json file containing all indexed urls of any specific domain. So i want to know how to download all those selected indexed url via PHP ? Is there any simple tutorial which can help us to get all those selected data ?

  • Answer:

    We don't have a tutorial yet on how to pull a single page, but we're working on it! From the JSON file you'll get a URL to the ARC file the page was in, an offset and length. You should be able to do a HTTP range request for that ARC file, uncompress it and retrieve the page. Feel free to ask the google group for help or email me directly if you have more questions.

Jordan Mendelson at Quora Visit the source

Was this solution helpful to you?

Other answers

No, there's no simple tutorial.  CommonCrawl is great resource for a knowledgeable user, but they don't document that last 10% which would make it approachable by anyone.  Let's create something for you here. First, the crawl, as is the nature of all crawls, is both incomplete and redundant. Second, the index is not complete even for what has been crawled and this is not widely known or documented. Having said that, the basic process is simple: 1. Construct a query URL and fetch a JSON document with the search results 2. Use the results of the JSON to construct URLs to fetch the page contents The query URL looks like this (for domain http://usnris.freebase.com): http://urlsearch.commoncrawl.org/download?q=com.freebase.usnris It's a reversed prefix search, so com.freebase will get you freebaserecords.com because its a prefix of com.freebaserecords, so you'll have to be a little clever to match multiple subdomains, but not other pay level domains. The document you get back looks like this: {"url": "http://usnris.freebase.com/", "arcFileParition": 0, "arcSourceSegmentId": 1346876860565, "arcFileDate": 1346911315020, "compressedSize": 11858, "arcFileOffset": 7109504} {"url": "http://usnris.freebase.com/", "arcFileParition": 3985, "arcSourceSegmentId": 1346876860782, "arcFileDate": 1346910614827, "compressedSize": 12048, "arcFileOffset": 21140153} and you want to use the information from that to construct URLs which look like this (for the first example): http://urlsearch.commoncrawl.org/page/1346876860565/1346911315020/0/7109504/11858 Having said all that, I doubt that CommonCrawl's URLSearch facility is really set up to handle bulk queries, so the more polite (but harder) thing to do would be to look at the Python example at https://github.com/trivio/common_crawl_index and translate it PHP  (assuming you don't want Python).  You'd probably want to use http://aws.amazon.com/sdkforphp/ instead of boto and translate the Python logic to PHP. Good luck!

Tom Morris

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.