Why is Google being prevented from crawling my pages by robots.txt restrictions?

musicdoc1 · Member · Jun 16, 2011 at 8:29 am
Copy link

Add topic to favorites
After periodically attempting to do so I finally figured out how to verify my site for Google Webmaster Tools (Google searches for WordPress forum answer lead you to the wrong answer).

But…

Now I’ve discovered that many of my pages on this site could not be crawled by Google in recent attempts, 164 pages to be precise. Find some examples below. All of the failure are due to restrictions by robots.txt. Since robots.txt manipulation requires access to the server root, it’s not something I can manipulate. Are there WordPress.com settings which are preventing Google from crawling a large percentage of my pages? If so, what can we do about it? And fast!!

http://songbook1.wordpress.com/pp/fx/features-2-older-2/film-musicals/
URL restricted by robots.txt

unavailable
Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/features-2-older-2/1929-standards/stardust-hoagy-carmichael-mitchell-parish/
URL restricted by robots.txt

unavailable
Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/0-new-features/1948-standards/the-night-has-a-thousand-eyes/
URL restricted by robots.txt

unavailable
Jun 5, 2011
http://songbook1.wordpress.com/2010/11/01/hellzapoppin/
URL restricted by robots.txt

unavailable
Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/features-2-older-2/1935-standards/just-one-of-those-things/
URL restricted by robots.txt

unavailable
Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/features-2-older-2/1929-standards/with-a-song-in-my-heart/
URL restricted by robots.txt

unavailable
Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/features-2-older-2/1930-hits-and-standards/rockin-in-rhythm/
URL restricted by robots.txt
Blog url: http://songbook1.wordpress.com/
ranh · Member · Jun 16, 2011 at 8:37 am
Copy link
Your site’s robots.txt file is available here:

http://songbook1.wordpress.com/robots.txt

There is nothing in that file that would actually block one of these URLs – this is the standard robots.txt we use by default for all sites on WordPress.com.

Can you get anymore information about this from Google or from the Webmaster tool?
musicdoc1 · Member · Jun 16, 2011 at 8:44 am
Copy link
Hi ranh,

I don’t know what you mean as far as more information.

But I made a large error. It’s not 164 URL’s restricted by robots.txt, it’s 164 “in Sitemaps” with this crawler error due to restriction by robots.text, but there’s another list of 838 URLs restricted by robots.txt. It’s not clear whether the 164 is a portion of the total 838 or whether they are to be added together to get the total.

What information would be helpful?

Jim
ranh · Member · Jun 16, 2011 at 8:46 am
Copy link
Can you give us a sampling of some pages from both lists?

Also, it would be great if you could send us a full-screen screen shot where this issue is visible on Webmaster Tools.
musicdoc1 · Member · Jun 16, 2011 at 8:48 am
Copy link
Google says, in response to a general question regarding such errors:

Google was unable to crawl the URL due to a robots.txt restriction. This can happen for a number of reasons. For instance, your robots.txt file might prohibit the Googlebot entirely; it might prohibit access to the directory in which this URL is located; or it might prohibit access to the URL specifically. Often, this is not an error. You may have specifically set up a robots.txt file to prevent us from crawling this URL. If that is the case, there’s no need to fix this; we will continue to respect robots.txt for this file.

If a URL redirects to a URL that is blocked by a robots.txt file, the first URL will be reported as being blocked by robots.txt (even if the URL is listed as Allowed in the robots.txt analysis tool).
musicdoc1 · Member · Jun 16, 2011 at 8:58 am
Copy link
1. The links above are from the list of 164 crawler errors in Sitemaps, due to restrictions by robots.txt.

I’ll give you a sample from the larger list below. But first…

2. A Screenshot? Sure, but how should I do it? Put it on a page on the site? Right now it’s a paint file in a Windows XP documents folder.
musicdoc1 · Member · Jun 16, 2011 at 8:59 am
Copy link
Here’s the beginning of the large crawler error file in the Webmaster Tools account:

http://songbook1.wordpress.com/pp/photo-galleries/eckstine-billy/billy_eckstine_4/
URL restricted by robots.txt Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/features-2-older-2/judy-garland/mr-monotony-judy-garland/judy-48-monotony-jr-1-e1-s1g20/
URL restricted by robots.txt Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/features-2-older-2/judy-garland/mr-monotony-judy-garland/judy-mrmonotony-11-t100f30s-6-sh20/
URL restricted by robots.txt Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/features-2-older-2/ethel-waters/ethel-waters-02-fur/
URL restricted by robots.txt Jun 5, 2011
http://songbook1.wordpress.com/category/palomera/
URL restricted by robots.txt Jun 5, 2011
http://songbook1.wordpress.com/category/mp4/
URL restricted by robots.txt Jun 5, 2011
http://songbook1.wordpress.com/pp/photo-galleries/eckstine-billy/billy-eckstine-no-cover-no-minim-510136-2/
URL restricted by robots.txt Jun 5, 2011
http://songbook1.wordpress.com/pp/fx/features-2-older-2/irving-berlin-1907-1914/1919-berlin-you-cannot-make-your-shimmy-shake-on-tea-ziegfeld-follies-of-1919-f40-s1/
ranh · Member · Jun 16, 2011 at 9:19 am
Copy link
There seems to be some inconsistency in the Webmaster error messages.

The larger list of crawler errors contains URLs that are supposed to be ignored by Google – however, they’re not blocked in the robots.txt, they’re just not included in the sitemap:

http://songbook1.wordpress.com/sitemap.xml

The sitemap should only include the posts and pages on your blog – it doesn’t include category archives, image attachments, etc.
musicdoc1 · Member · Jun 16, 2011 at 9:51 am
Copy link
I didn’t know what they wanted when I was asked to submit a sitemap. So I sent them a directory of pages on my site that I provide for visitors. This may account for many of the errors. I don’t know.

I’ll submit the proper sitemap. I don’t know if that will correct the previous errors.

But it’s very late here. I’m going to wait until tomorrow. Hope you don’t mind.
musicdoc1 · Member · Jun 16, 2011 at 9:53 am
Copy link
But I didn’t seen category archives and image attachments in that site map. Only pages, an alphanumerical index of the 500 plus published pages on the site.
musicdoc1 · Member · Jun 16, 2011 at 9:54 am
Copy link
didn’t send category archives….etc. Sorry about my spelling errors.
ranh · Member · Jun 16, 2011 at 9:55 am
Copy link
If you submitted a sitemap yourself, it would only be processed if it’s in the proper sitemap format:

http://www.sitemaps.org/
musicdoc1 · Member · Jun 16, 2011 at 10:23 am
Copy link
Thanks, ranh. I will provide that to them if necessary, but now I’m not sure it’s needed. It just came to my attention that all of those thousand or so errors were detected by Google on or before June 5. I didn’t submit that sitemap (incorrectly) until a day or two ago, shortly after I finally got the site verified. So that’s not the issue.
staff-blorbo · Staff · Jun 16, 2011 at 5:07 pm
Copy link
15 days ago, on June 1, you set your blog to block search engines, which does alter your robots.txt and caused Google’s inability to access your posts.

You set your blog back to public 11 days ago, on June 5.
musicdoc1 · Member · Jun 17, 2011 at 5:12 am
Copy link
That sounds correct.

So what do I do to get an update with respect to robots.txt. Google seems to indicate that these 1,000 or so errors are current. But I’m new to this game and might have missed a step somewhere.
musicdoc1 · Member · Jun 17, 2011 at 8:20 am
Copy link
The number of Crawl Errors found by Google decreased for Songbook1.wordpress.com today:

From 164 to 143 “in Sitemaps” restricted by robots.txt, and

From 838 to 778, total (“in Sitemaps” is a portion of these) restricted by robots.text

I don’t know why. Perhaps Google is gradually removing the crawl errors it detected while I had the WP setting on block search engines for a few days earlier this month.
musicdoc1 · Member · Jun 17, 2011 at 8:36 am
Copy link
Perhaps I should be equally concerned with another large category of “crawl errors” labeled as “Not Found.” There are presently 787 of them. I’ve made a lot of edits to images in the past 8 months or so on this site due to a long period from roughly mid-2009 to late 2010 when I had a defective monitor. I routinely darkened the great majority of images prior to embedding them during that period. Probably at least 2,000 images needed re-editing and the task is not yet finished.

That would account for many of these I suspect. I’ve also periodically renamed URLs to shorten them or to make them conform with a system implemented some time after their creation. Some posts/pages are deleted eventually because they are working edits to be discarded later after the final edit of the post or page is published. Or a new version of a page may replace an earlier one following amendment.
musicdoc1 · Member · Jun 17, 2011 at 8:42 am
Copy link
I don’t know if I should use the URL removal tool in any of these cases. That tool is supposed to remove an unwanted URL, for example a non-existent URL, from Google searches. But I’ve just used it for the first time today and have no idea how effective it is. II wonder whether it might speed up the process of removing these errors.
staff-blorbo · Staff · Jun 17, 2011 at 11:06 pm
Copy link
At this point, just hang in there and wait for Google to iron out the kinks.

Google is not an instant service. When you set your site as private, that just happened to be the time that Google stopped by.

Your site is public now, and Google has access, but it hasn’t stopped by since it was blocked earlier.

The topic ‘Why is Google being prevented from crawling my pages by robots.txt restrictions?’ is closed to new replies.

WordPress.com forums

Why is Google being prevented from crawling my pages by robots.txt restrictions?