NoSpamEmailHyperlink: 1. Design

Paul Riley

Rate me:

4.90/5 (29 votes)

22 Oct 200312 min read

213.6K

3.8K

Fighting back against the e-mail harvesters.

Introduction

This is the first in a series of six articles, following the design, development and practical use of a fully functional ASP.NET custom control.

The full list of articles is as follows:

NoSpamEmailHyperlink: 1. Specification
NoSpamEmailHyperlink: 2. Properties & Rendering
NoSpamEmailHyperlink: 3. Email Encoding and Decoding
NoSpamEmailHyperlink: 4. Design-Time Functionality
NoSpamEmailHyperlink: 5. Implementation
NoSpamEmailHyperlink: 6. Customization

These articles are not intended to be a comprehensive look at custom control development (there are 700+ page books that barely cover it), but they do cover a significant number of fundamentals, some of which are poorly documented elsewhere.

The intent is to do so in the context of a single fully reusable and customizable control (as opposed to many contrived examples) with some awareness that few people will want many parts of the overall article but many people will want few parts of it.

This part of the article simply defines the purpose of the NoSpamEmailHyperlink. It is intended for those who want to use the control "as seen", already know how to include a custom control in a web page and either know or do not want to know how to write one themselves.

It also serves as a general introduction to the other articles, avoiding unnecessary repetition.

Designing the NoSpamEmailHyperlink

If you want your web site recognized by search engines, you need to get links onto other recognized sites so that their Internet spider software can find you.

Unfortunately for your registered users, as soon as the search engine spiders can find your site, you open yourself up to the "email harvesters". This is similar software to the spiders used by search engines, except that they hunt the Internet for email addresses and collate them for use by email spammers.

If you have any email addresses in the generated source of your site, especially (but not necessarily) in href=mailto: type links, there is a good chance that those addresses will later start receiving spam emails from any number of servers.

Note that spider software does not work by looking at the source code directly. Building the email address at the server does not protect your users. Spider software works by sending HTTP requests and scanning the response. This will cause the server to process the data before returning it and any email addresses will be part of that, even if they are held on a database. Remember: a spider will see what a browser sees, that cannot be avoided.

The situation is only getting worse. Most of the spam servers are insecure and unattended, leaving them vulnerable to email viruses such as the recent SoBig.F which can also harvest email addresses from address books as it propagates. SoBig.F seems to have used those addresses just to spoof its own headers, but in theory, those addresses can be returned to a server and again sold on to the email spammers. It may have happened already, it will certainly happen soon.

By not hiding the email addresses of your users you are not only inviting the spammers to attack them, you are putting them at risk from email worms and if their PCs are not secured, everyone in their address book could be the next victims of your lack of conscience.

Email spam and email harvesting are escalating problems and there is no simple way of dissuading the spammers. The problem is that mass-mailing makes money and anyone can do it. The only way to slow it down is to cut the profit margin.

The more successful your site is, the more people you will have registered and the more interested email spammers will be. Not displaying or run-time building email addresses will protect your users but the email harvesters will just move on to the next site where the developer is not so conscientious.

What if we could encode the email addresses while keeping them valid? This would cause the email harvesters to pick up faked email addresses, which is very harmful to their business. Only the more expensive email harvesters will verify an email address (e.g. through an LDAP server) as well as validate it.

No one is going to pay an email spammer vast amounts of money to advertise their site to no one. If we make the information the spammers gather less useful, it becomes a lot less valuable. Only that will dissuade the email harvesters from using our web sites to gather their target list of addresses.

The JavaScript Weapon

Fortunately for web developers, there is not yet a single email harvester on the market that claims to apply JavaScript to HTML to discover hidden email addresses. When someone does get around to writing one, it will almost certainly be excessively expensive and thus will reduce the profit margins of the email spammer using it.

Simple scripts such as the following have been proved very successful at email addresses from the harvesters:

JavaScript

<script language="JavaScript"><!--
var name = "pdriley";
var domain = "santt.com";
document.write('<a href=\"mailto:' + name + '@' + domain + '\">');
document.write(name + '@' + domain + '</a>');
// --></script>

As mentioned before, there is no point in email harvesters translating this to find the address, they simply move onto the next site. In other words, it is effective in protecting the members of the site that uses it, but is not effective in the overall fight against email harvesting.

That said, with a bit of imagination, there should be no reason we cannot use JavaScript to turn a seemingly valid email address into a very real one when the page loads, using a set of rules defined at the server. If you can process an email address at the server and then reverse the process at the client, keeping it valid at all times, you should be able to fool the harvesters quite effectively.

Valid Email Addresses

Email addresses must be of the format user@domain.

user may contain any combination of alphanumeric characters, periods, underscores, dashes and (rarely) spaces, according to the W3C. It must, however, start and end with an alphanumeric character and consecutive periods are not allowed (a..b@c.com).

domain follows the same rules but must contain at least one period and everything after that period must be in a given set of acceptable extensions (.com, .net, .co.uk).

To obfuscate an email address and still keep it valid, you can exchange any alphanumeric character before the first period in domain for any other alphanumeric character.

As long as you retain the positioning of punctuation (non-alphanumeric) characters and the format of the domain extension, you can ensure that the address does not become invalid.

Note: None of this will fool the few expensive email harvesting softwares that implement verification, but even then the address will fail to verify and will be ignored. It is highly unlikely it will ever be decoded.

Netscape: Early versions of Netscape (4.x and earlier) will not allow you to update the innerHTML property of a link object. It is therefore necessary to "hide" email addresses in the text of the control for those versions, rather than encoding it.

The Encoding Algorithm

In its simplest form, you could encode an email address by incrementing each alphanumeric character code, wrapping at the ends of the alphabet (e.g. pdriley@santt.com becomes qesjmfz@tbouu.com). This is a simple form of "substitution encryption" but that is not particularly interesting. It is far too easy to decode if everyone is using it and asks little of the power of .NET.

Why not use a pseudo-random alphanumeric-generating sequence, including all encodable characters exactly once? For example, given the following base code key:

yJzdeB4CcDnmEFbZtvuHlI1hA8SiLo9MwfN3O6Y5QaRqKTjUpxVk2WgXrP7Gs0

pdriley@santt.com (shifting each character right by one) would become xePLIBJ@0Rmvv.com.

This is the basis of so many codes used throughout history. The encoder and decoder hold a copy of the same character sequence. One applies a set of rules to the text he is sending, the other reverses the rules to decode the message.

Unfortunately, in practical terms, this cannot be as secure as historical coding algorithms because the decoder does not already have a copy of the decoding sequence or the rules it needs to apply to decode it. We effectively have to attach the decoding key to the coded text, along with those rules.

A third party can use this information to decode our message, but they need to look for the information first and then understand the rules we are using to decode it. Looking for the information is unlikely enough, given that we have kindly handed them valid but pseudo-randomly generated email addresses but, as far as any commercial email harvester currently available is concerned, the rules may as well be written in ancient Aramaic. Fortunately, the intended recipient of our encoded message (the browser) understands ancient Aramaic (JavaScript) very well and can follow the rules we are feeding it very effectively.

Once the intended recipient knows what the rules are for decoding the data provided, there should be no issues with reading whatever the encoder writes. But we should complicate the rules slightly, to further confuse any third party who might go looking.

Instead of always stepping one space up the pseudo-alphabet, we could vary the distance and direction that we move from each character to get the next.

e.g.

Character 0: Step up 6

Character 1: Step down 5

Character 2: Step up 4

Character 3: Step down 3

Character 4: Step up 2

etc...

We could also start with a different seed (6 in the above example) for each address on the page, so that an address appearing twice does not necessarily have to look the same in both cases.

We can even easily alternate the rate of change in the above sequence, just by using the indexer with which we are moving through our email address.

e.g.

Start with seed 23

Character 0: Step up 23

seed = seed - 0 = 23

Character 1: Step down 23

seed = seed + 1 = 24

Character 2: Step up 24

seed = seed - 2 = 22

Character 3: Step down 22

seed = seed + 3 = 25

Character 4: Step up 25

etc...

Now pdriley@santt.com with a seed of 23 will be encoded to 8SNk0oR@Ah60K.com. Even given the base code key and the seed, this takes a few minutes to decode by hand. If the third party is missing either one of those pieces of information (or any understanding of the rules we are applying), they have very little hope.

If you do not believe it, try this encoded email address with the above base coding string and no seed: TWL0rxwrm@0pPAaZh40ME.com. Clue: I expect eventually someone will get it using educated guesswork and verification.

This is the algorithm incorporated by the NoSpamEmailHyperlink.

Visible Addresses

In addition to scanning hyperlink href attributes, most email harvesters will pick up an address from the general flow of text. If it sees anything in the format user@domain then the harvester software will pick it up.

Say, for example, your Email Hyperlink should read as follows:

HTML

<a href="mailto:pdriley@santt.com>
    Paul Riley (pdriley@santt.com)
</a>

It serves no purpose to encode one occurrence of the email address and not encode the other. Our anti-spam control should be versatile enough to hide both addresses in the body of the HTML and adjust them when the page loads.

The NoSpamEmailHyperlink is capable of detecting where an email address appears in the visible text and replacing it, though this functionality is optional.

Customization

If the NoSpamEmailHyperlink somehow becomes so popular that it starts causing serious damage to the email spammer's business, it would unfortunately be quite easy to start detecting the controls and either ignore them or even decode them.

It is important to have some variation in the workings of the control as it appears on different sites.

To avoid having many different versions of the same control, each with duplicated code that cannot be maintained if a change becomes necessary, it is imperative that developers can inherit the common functionality of the control and customize many of the key aspects easily.

The NoSpamEmailHyperlink allows inheritors to override the code key and many of the names used for key variables in the JavaScript, to avoid simple detection techniques. It also allows inheritors to override the Encode / Decode functionality and create a whole new control without having to recode the basic functionality of an email hyperlink.

Professional Appearance

Just because the NoSpamEmailHyperlink costs nothing, there is no reason to avoid proper design-time functionality. Indeed implementing good design-time features gives us a good chance to look at some of the least documented but most powerful features of the .NET Framework for control design.

The NoSpamEmailHyperlink uses a variety of custom classes and attributes to ensure that it works properly from the Visual Studio .NET toolbox and always appears in any WYSIWYG page designer, even when "empty" or databound into a DataList or DataGrid.

Conclusion

Simply drop a NoSpamEmailHyperlink into a web page and see your users' email addresses hidden from the Internet spiders but not from other users.

Download and view the code provided to understand the inner workings of the control, or to help you follow the later articles and develop your own custom controls.

To customize the control, use the DLL provided and derive from the NoSpamEmailHyperlink to inherit much of the common functionality (including the design-time appearance) and allow for easy patching in the future.

The DLL must be accessible to your web application, so must either be installed in the /bin folder of the application or in a versioned subfolder of [Windows]\Microsoft.NET\Framework.

If your site is hosted and you do not have control over the server, recommend to your web space provider that they install the DLL in such a common folder so that you and other users can access it directly. This is much more efficient for them than having multiple users install the DLL in their applications and encourages the war against spam harvesting to people who do not generally visit CodeProject.

Revision History

1.0 12-Oct-2003 - Created.
1.1 23-Oct-2003 - Added note about Netscape to section "Visible Addresses".

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Written By

Paul Riley

Web Developer

United Kingdom

Paul lives in ~~the heart of En~~ a backwater village in the middle of England. Since writing his first Hello World on an Oric 1 in 1980, Paul has become a programming addict, got married and lost most of his hair (these events may or may not be related in any number of ways).

Since writing the above, Paul got divorced and moved to London. His hair never grew back.

Paul's ambition in life is to be the scary old guy whose house kids dare not approach except at halloween.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.