> In particular, I have a form with a textarea that I wish to allow *some*
> HTML tags (strong,em,a href,img,etc.) - such as used in many blogging apps.
> I'm searching for a methodology to check input for possible maliousness.

Don't use the "check for known badnesses; if none, let the message go through
unaltered" approach.

As well as having to know all the ways scripting can be inserted into a page,
you would have to know the parsing bugs of all major browsers that could result
in malformed input - which your checking couldn't detect - being interpreted
as malicious code.

(For example: disallowing "<script" fails to take into account that IE will
happily parse "<[ASCII 0 character]script" as a script tag.)

I've been collecting examples of JavaScript injection techniques which I can
post if anyone's interested. But the point is that these sorts of bugs have
been found again and again in every application that allows straight-through
user markup, including every webmail provider and bulletin board system. You
just can't make this approach secure without infinite debugging time.

It's better to parse the input yourself and then send it back out in a process
that you know cannot generate malicious or malformed code.

> Any suggestions? UBB type markup? Regex? Other?

Easiest is probably to parse the input using a standard XML parser, then
remove all but a small number of allowed element and attribute names, then
use a standard XML serialiser to write out the results; this will ensure that
stray quotes, ampersands, control characters etc. will get escaped correctly
rather than causing potential security problems.

Of course this requires XHTML being used everywhere (though you could
conceivably use HTML Tidy as an input stage so users don't have to input
well-formed markup).

Alternatively, invent your own noddy markup language which you can parse and
then output with proper escaping, so user input such as '<' and '&' is
never echoed directly to the browser. Here you can start simple - plain text
with newline converted to new paragraph - then add just whatever features are
needed. UBB started like this, but got a bit carried away and added every
conceivable feature *and HTML markup as well*, which isn't brilliant for

Other issues to look out for include URIs -

  only ever allow a few known-good URI types like http, https, ftp etc.;
  don't attempt to just detect and disallow the known-bad like
  javascript: as there are more than you think and many ways to
  obfuscate them;

- and Unicode -

  IE and old versions of Opera support invalid UTF-8 sequences, and as
  such the user can include character 0xC0 followed by 0xBC to get a
  '<', without triggering many naive filters. Ideally your server
  environment should be using Unicode strings for everthing internally
  (don't know if that's the case with ASP) so should catch the invalid
  sequence, otherwise you'd have to check for them manually;

Allowing user input securely can be pretty hard. It doesn't help that 90% of
webapp example code doesn't even try to get it right.