Are NoIndex Articles Crawled By Google Bots?

Are NoIndex Articles Crawled By Google Bots?


Does posting copy-pasted articles and deindexing or Noindex rule to them,to prevent google from indexing them can save from the google penalties for copy-pasted content ? Are Noindex articles cached by the search bots ? Posting copied articles and noindexing them would you recommend doing it or not?

These are the doubts raised by my readers and here is the discussion between us:

Reader: Actually I am posting some articles no-indexed as their content is almost present everywhere like installation steps, was thinking of noindexing partially so as it should not affect other indexed articles of the same site

Answer:  “No Index” does not mean Google will always exclude it from its database. Google will still access the page, analyze the content and decide to use it or not use it. If you say “no index”, then 99% of the time, Google will not include it in the index. Now, if you have a lot of copied content, Google will conclude your blog is a useless blog. Using a “no index” tag may reduce the chances of being penalized but having too many of them will trigger red flag to Google. Google will have some buffer zone which will ignore some copied content but when it goes beyond the limit set by Google, your site will be penalized. In reality, Google will be looking for several signals before it penalize a site. Having a lot of copied content will be a strong signal to Google. “No indexing” a lot of pages will raise some other doubts to Google, unless it is a specific directory or section within your site. If lot of your articles link to articles which are “no index” that will create doubts to Google.

Reader: Why not use rel=canonical tag in such case and give the credit to the original creator?

Answer: rel=canonical tag also will work, but again, that also will give a hint to Google that “The other site is better than mine”. Nothing may happen by doing it just for a few articles but if there are many, Google could interpret your site as a copy cat site or content syndication site. Google recommends rel=canonical on cross-domain duplication only in 2 cases:

1. Content is reproduced for syndication purpose
2. Same content is published by same site owner in multiple sites for the purpose of providing localized content or as part of moving from one domain to other.

But if you reproduce content just to get more content in your site, providing any sort of credit and hiding it will not help you. Eventually, your site will be penalized. Google has no reason to waste their resources to process duplicate content all over the web.

Reader: I have to agree with your statement, but if it a necessity, then one should use rel=canonical tag to avoid penalty. 

I’m no way advocating to do this on entire website. I’m not sure why he has asked this question or what is the purpose behind it.

Actually i had linked to a tutorial through my article but the linked site keeps going down frequently 
hence, i thought i should copy the linked content and post it with a noindex tag so as my article readers would get the content irrespective of the other blog being up/down

locking using robots.txt doesn’t mean noindexing. Basically, you are restricting Google from looking at the content but Google can still index the page. (Usually you see following message “A description for this result is not available because of this site’s robots.txt “)

I always prefer rel=canonical tag in case of cross posting. We had a client who has multinational store and he was publishing important news about company on all portals.

Need one clarification .. Having too many noindexes then definitely NOT get google suspicious. I mean why would it be ? I can have only 100 links for user from outside world while 1000 links are only for internal use( not only intranet but also suppliers who are on the web)

Answer:  That’s a grey area.. no one knows anything for sure, just guesses based on experience.. believe the one you think knows the best or do your own test.

By Grey area I mean when we don’t know something for sure and have not done sufficient tests to prove it. In this case the discussion is based on what we think we know about google not what we got from them as official information.

I have 1100 links for example . 100 are for everyone ( common use) 1000 are for internal use( for example use by suppliers/customers ). 1000 links are no indexed because other than suppliers /customers I would not want general public to access it just by searching. However if one is a supplier they have full url with them. In that case why would no index of 1000 links arouse Google suspicion. I have deliberately take high number of no index links because the person who asked the question asked if he can keep the copied content make it no indexed and still be safe. I would feel yes he would.

Reader: Rel=canonical is extremely useful in pagination and cross posting. Unfortunately, many people put noinex, nofollow tag on it which restricts Google from seeing rel=canonical tag on the page.

Answer:  In case you want to separate content onto Private and public content, then the best approach is to separate them into different folders and block the entire folder from search engines. That is a common and allowed practice. The problem arises when you mix index and no index content within same folder that too the noindex content is scrapped content.


Leave a Reply