|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Removing a web site / single URL from the Google Index In certain cases, on course to the lifting of penalties ( solving usability or content related problems, removing unwanted content, requesting re-evaluation ) removing an offending web page / access to a faulty URL may become necessary. Such changes will take effect in a web site's ranking during subsequent crawls and updates, or in case of a ban, will be viewed as the proper action taken when re-evaluating the content. However, entirely removing a URL from the index is rarely necessary, as the historic supplemental results and cached versions of a URL ( that already return server responses for deleted or permanently redirected pages ) are not determinative of the outcome. For more generic purposes, Google will remove any and all URLs from its index if its crawl requests consistently receive a 404 - not found or 410 - Gone HTTP status code message from the server. Using a permanent redirect ( status code 301 ) will gradually transfer all parameters of the source URL to the target of the redirect, and once the transition is complete the source page may be dropped from the Index. URLs to obsolete pages are likely to first become "historic" Supplemental results, then fade out from the index completely. For the removal of valid URLs that are otherwise accessible, using the robots.txt disallow features and/or limiting Googlebot indexing with META tags are the proper methods. Sometimes web site owners may feel that a page they are administering, and is currently in the Google index, should no longer be listed as a result. Removing the page from the server, and thus the server responding to the requests of Googlebot with a 404-not found server message ( 404 - not found / 410 - gone HTTP status code ) will eventually mark the given web page as supplemental, and not show it as a result for normal searches. Such historic supplemental pages may stay within the index, and may be reached by queries unique to the deleted page up to a year. During this time a copy of the last crawled version will remain in the Google cache. Such historic supplemental pages will not play a role in evaluating a web site for relevance or importance, and may generally be ignored completely by the algorithms. Unless there are legal issues with the now deleted, but still cached content of the web page, copyright breach, defamation or security problems with the information displayed, you should not need or care to remove the page entirely from the index. In these cases you may request the cache to be deleted through the Google webmaster help center, using the tool for removing URLs. Regarding the manual URL removing system, it only works if the request corresponds with either a 404 - not found server response, or a Googlebot related restriction on the pages and in the robots.txt file. Also note that pages excluded are still cached and indexed, and only remain hidden from the users for an estimated 180+ days period ( exactly 6 months or more ). The URL removal tool thus can not be used to clear the history of a web site, rather to exclude it from the search result pages. Its combination with the Reinclusion request for dealing with penalties and bans is thus redundant and generally advised against. + Resolution: Completely removing an otherwise valid URL from the Google index, or preparing to remove a soon to be deleted page should be done by restricting the crawling, indexing, caching and display of the pages on a case by case basis. Implementing the proper META tags into the HEAD section of a page will communicate the necessary information to the algorithms, and subsequent crawls and updates will see the page gradually be excluded from the results, and clear the associated cache as well. Using robots.txt disallow features will be reported as a temporary block against the crawling of these URLs, and will only work while it is in place. It is advised to first include the page specific directives, then once these have been recognized, the removal of the pages or the setup of the robots.txt disallow attributes can follow. Also it is important to note that if a URL is constantly referred to by links from other web sites, it will be tried against the robot directives over and over again, and once those are not restricting the crawl, and the page at the URL is still in place, it will be re-indexed. After about a year of a URL becoming unavailable for Googlebot to crawl, it's usually dropped from the index entirely, including the cache and occasional supplemental versions. <META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOARCHIVE, NOSNIPPETS"> is to be placed in the HEAD section of the pages that are to be excluded from the index. Another method is the <META NAME="GOOGLEBOT" CONTENT="NOFOLLOW"> directive that instructs Googlebot not to crawl the URLs that the page on which this directive is found links to. Use with caution. Below is an example for a robots.txt entry to block Googlebot from crawling certain URLs. You may place the robots.txt file into the relative root directory of your web site, or the directory in which it needs to applied. You can enter a list of URLs relative to the path, or use but the the / character to extend the directive to the entire directory ( if the robots.txt is in the domain root, this will put all the URLs of the domain or subdomain on the list of exceptions ), or use directory names with / at their end to exclude them. You may also speed up the process of removing a URL from the Index by first restricting their crawl and / or deleting them, and request the URL to be removed through the URL removal request page of Google Webmaster Tools. An example on robots.txt for disallowing an entire directory ( if placed in the domain root, i.e.. is reachable through www.example.com/robots.txt , this directive will disallow the crawl of the entire domain ) : # Disallow Googlebot Another example disallowing a directory, using a path relative to the position of the robots.txt file: # Disallow Googlebot Another example disallowing specific files, using a path relative to the position of the robots.txt file : # Disallow Googlebot http://www.askapache.com/seo/updated-robotstxt-for-wordpress.html AskApache.com robots.txt files For instance, I am disallowing /category/ in the robots.txt file below because askapache.com/category/htaccess/ is the same as askapache.com/htaccess/, and that would be duplicate content. Adding a 301 Redirect using mod_rewrite or RedirectMatch can further protect myself from this duplicate content issue. User-agent: * z.AskApache.com/robots.txt User-agent: * Robots Meta Tags Using the robots meta tag <meta name="robots" content="noindex,follow" /> Allow other robots to index the page on your site, preventing only Googles bots from indexing the page <meta name="googlebot" content="noindex,follow" /> Allow robots to index the page on your site but not to follow outgoing links <meta name="robots" content="nofollow" /> header.php Trick for Conditional Robots Meta <?php if(is_single() || is_page() || is_category() || is_home()) { ?> Robots.txt footnote Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it’s current for your site so that you don’t accidentally block the Googlebot crawler. Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include: However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results. Google tries hard to index and show pages with distinct information. This filtering means, for instance, that if your site has a “regular” and “printer” version of each article, and neither of these is blocked in robots.txt or with a noindex meta tag, we’ll choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results. Pages you block in this way may still be added to the Google index if other sites link to them. As a result, the URL of the page and, potentially, other publicly available information can appear in Google search results. However, no content from your pages will be crawled, indexed, or displayed. To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag, and ensure that the page does not appear in robots.txt. When Googlebot crawls the page, it will recognize the noindex meta tag and drop the URL from the index. You can instruct us not to include content from your site in our index or to remove content from your site that is currently in our index in the following ways: http://www.askapache.com/robots.txt User-agent: * # Google Image # Google AdSense # Does anyone care I love Google Apache htaccess, not to mention robots.txt, wordpress and hacking code # http://www.sitemaps.org/faq.php |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
| metatags - paidsite - disallow - freesite - contactUS | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||