Google devised a new way to combat this problem in January 2005. The idea was immediately welcomed by MSN and Yahoo. This new implementation favors the Google PR technology and helps to reduce the work load of the search engines by not following links not approved by the webmaster.

How it Works

The change that Google suggested is to add a nofollow attribute to any hyperlinks that you don't want the search engine to vote for / follow through. We insert the new attribute within an anchor tag.

A normal link looks like this:

<a href="http://www.mysite.com/index.html">My website </a>

A nofollow link looks like this:

<a href="http://www.mysite.com/index.html" rel="nofollow" >My website </a>

How Can We Automate The Process

The problem comes in when many people start to post comments in your blog or website. Do you manually go throught all the comments and insert the nofollow attribute? No way! We need a simple and elegant solution to add the "nofollow" attribute to all the links that the user submitted. This is not an issue to many bloggers because the most popular blogging software - "Wordpress" already has a solution at the time of writing. However, if you are not using Wordpress or are running your own website that accept user comments, the solution proposed in this article might help.

The comments that the user submitted can be captured using the $_POST variable. By using the preg_match_all and preg_replace functions (assuming you are using PHP), we can identify the anchor tags within the comments and modify it accordingly before adding it to the database.

Analyzing The Anchor Tag

Anchor tags can come in various forms which make our task abit harder..

1. With spacing, the HTML tag might look like this:

< a href = "mysite.com"> mysite </a>.

2. The user might not use any quotation or use single quotation to enclose the url:

<a href=mysite.com>mysite</a> or <a href='http://mysite.com'>mysite</a>

3. There can be other attributes within the tag:

<a href="http://mysite.com" rel="happy" title="this is the main title" whatever="abcxyz">mysite</a>

The Hard Core Part

If you need to take a deep breath, this is the time to do so. In this section, we are going to write a regular expression that can catch all the different <a href> tags as mentioned above and convert them to "nofollow". If you are not interested, feel free to jump to the implementation step

I came up with a regular expression as follows:

$preg = "/<[\s]*a[\s]*href[\s]*=[\s]*[\"\']?([\w.-]*)[\"\']?[^>]*>(.*?)<\/a>/i";

Let me briefly go through what the expression means:

Part 1: "<[\s]*a[\s]*href[\s]*=[\s]*" matches different variants of "<a href=" with spacing taken into account.

Part 2: "[\"\']?([\w.-]*)[\"\']?" tries to match the url. Eg, my-site.com/index.html, "http://abc.mysite.com" or 'mysite.com'.

Part 3: "[^>]*>" matches any other attribute tills it hits the ">" character. The ">" character signifies the opening of the anchor Tag.

Part 4: "(.*?)" matches the text link.

Part 5: "<\/a>" matches the closing tag, ie </a>.

Part 6: "/i" makes the whole expression case insensitive.

Now we got the main regex. It is time to write the entire function.

function filterHref ($str) {
$str = stripslashes($str);
$preg = "/<[\s]*a[\s]*href=[\s]*[\"\']?([\w.-]*)[\"\']?[^>]*>(.*?)<\/a>/i";
preg_match_all($preg, $str, $match);
foreach ($match[1] as $key=>$val) {
$pattern[] = '/'.preg_quote($match[0][$key],'/').'/';
$replace[] = "<a href='$val' rel='nofollow'>{$match[2][$key]}</a>";
}
return preg_replace($pattern, $replace, $str);
}

The function accepts a string argument and stripped off any backslashes in it. This is to prevent any errors when we use the preg_replace function. The preg_match_all function tries to match all the anchor tags and store them in the $match[0] array. $match[1] stores the url while $match[2] contains the text link. With all these information at hand, we rebuilt the anchor tag with the "nofollow" attribute in this important line:

$replace[] = "<a href='$val' rel='nofollow'>{$match[2][$key]}</a>";

* preg_quote exists in the function to make sure the $pattern array works in the preg_replace function.

Implementation

How do we use the function? Simply copy and paste the PHP function above and put it in your page. You may have an if-statement somewhere in your script to check if the user submitted any data from a form, like so:

if (isset($_POST['submit'])) {
// check for data integrity
// insert comments into the database
}

Before you insert the comments into the database, filter it with the "filterHref" function that we just created. This is also the place to add in other checks, like checking whether the comments are from a registered user or not. We can also make it such that we only filter the comments from unregistered users.

if (isset($_POST['submit'])) {
// check for data integrity
if ($user == 'not registered') {
$comments = filterHref($_POST['comments']);
}
// you can now safely insert $comments into the database
}

We use the "filterHref" function after the data integrity check because the preg_replace and preg_match_all function can be abit slow in processing especially if there are lots of links in the comments.

Conclusion

I may have been abit technical in this tutorial. Hopefully, by going through the steps in writing the "filterHref" function, you are able to customize the function easily to suit your own needs. Search engines are evolving and so are link spammers. By making good use of the "nofollow" attribute, we can stop spammers from damaging the integrity of our site.

About The Author
Bernard Peh is a contract Web Developer based in Melbourne. He works with experienced web designers and developers everyday, designing and developing commercial websites.