Why hasn't Google indexed all of the URLs in my sitemap.xml?
-
My site has a comprehensive sitemap.xml which is being frequently crawled by Google. According to their webmaster tools, only about 80% of the URLs in the sitemap are actually being included in the index. Here's a screenshot demonstrating this: http://imgur.com/gAz02 - the sitemap in question is http://lanyrd.com/sitemap.xml Should I be worried about this? Is there anything I can do to improve the ratio of indexed pages?
-
Answer:
Crawling doesn't guarantee indexing, either. Sitemaps tell search engines what to crawl, but they may not index their content. Per Tinus, unique content is important. You should also look at site architecture - make sure that all of these pages can be found in a natural crawl of the site.
Ian Lurie at Quora Visit the source
Other answers
There's a lot of great advice here. I'd break the process down into four steps: Crawling Is Googlebot crawling all the pages? I include discovery in the crawling portion. You may want to check weblogs to make certain that it's getting to all the pages. They probably are, but it's worth checking. Indexation Once crawled, does Google believe the page merits being in the index. Is it valuable enough to be presented to users. Rank Once in the index, where is that content ranked. This is essentially telling you how well your content compares against other similar content. Another value based determination by the Google algorithm. Traffic Now traffic can be traced to rank but often you'll find that good meta descriptions can increase the amount of traffic you get. Essentially, how good is your CTR. What's really nice to see is that you're using a sitemap index and breaking your site down into page spaces: books, conferences and people. Now, I'm just eyeballing it but it looks like your conferences URLs are indexed at nearly a 90% rate while your people URLs are indexed at a 60% rate. That's great information to have! Your job is to figure out why Google likes your conference URLs more than your people URLs. My Take Looking at your site, you're a social conference directory. So, it would make some sense that you're site would be strongest in and would receive the most love from Google on ... conferences. Given that your conferences URLs are so strong, I'd look at the people URLs and see what you could do to make them more valuable. A quick look at them makes me believe there's not enough text on the people URLs to be indexed highly. You might think about creating profile summaries or excerpting profile information from other sites.
AJ Kohn
Sitemaps do not guarantee indexing. 80% is actually pretty good given that your site has no html sitemap and its structured so that its very difficult for robots to reach internal pages. Look at how Flikr structures their site to get search robots into deep pages with Tag clouds, sitemaps, browsing features. Robots mimic human behavior, when they find it too difficult to get deep into a site they quit, they don't care that you provided a map. Make sure every page of your site has at least 10 internal back links from category pages, tag lists, related photos, profile pages and other browsing structures.
David S Foreman
the google crawl process can be simplified to something liks this discovery crawling indexing visible in the SERPs between each of these steps are multiple ("secret") quality assurance mechanisms (spam detection, near duplicate detection, duplicate detection, ...). the sitemap.xml pushes URLs to 1) discovery. some of them get 2) crawled, some of them don't. some of the crawled URLs make it into the 3) index, some don't. some of them match a (keyword) search and a user actually 4) see them in the SERP, most of them don't. said that, a discovery / index ratio of 80% is not that bad.
Franz Enzenhofer
I always wondered about this myself. My guess was that it was because search engines don't believe anything exists unless it is linked to. And they probably have a limit to how much internal linking qualifies. To prove whether or not this is true, try linking to one of the items that you know is not in the index, and check back in a couple of weeks to see if it is there. (And let us know what you find).
Jade Rubick
I've never had a site with over a thousand pages get fully indexed despite sitemap, well thought out architecture, fully linked, unique content etc etc. At any one time around 95% of them are indexed but never the whole thing. Would love to know why too but unfortunately I think it's just the way it is...
Steve Evans
It's probably because the content of the pages is not unique enough. I had the same problem with a product catalogue where we had products called Acme Widget 2047 and Acme Widget 2047a. Because the product descriptions only had a few different words in them one of these would not get indexed. I solved this by adding more unique descriptions for each product and add that to the description meta tag.
Tinus Guichelaar
Related Q & A:
- Why doesn't Google earth work?Best solution by Yahoo! Answers
- Why won't PDFs open in Google Chrome?Best solution by productforums.google.com
- Why doesn't Google or YouTube offer live customer service?Best solution by Quora
- Why isn't my website in google search results?Best solution by Webmasters
- Why doesn't Google work?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.