A problem on most sites that allow user-submitted HTML is verifying that the HTML is valid (won't break the page, mess up the formatting, etc). Unfortunately there isn't a lot of help on the Internet from what I've seen short of just suggesting locking down almost all HTML tags. On blogs, this isn't such a great idea because people like to express themselves and stay away from just the content.
Well, here's a specific problem I have seen little about: closing mismatched and unclosed tags. For example, if you leave in one <div> or </div> you could totally wrek a page's layout. Another problem is removing nasty attributes on a tag, like onClick or onHover where potential XSS exploits could surface.
So without further discussion, here's a nice function in PHP I have been working on that can clean the structure of an XHTML input and clear illegal tag attributes:
function clean_tag_structure($input, $badAttributes = '') {
$i = 0;
$tagStack = $matches = array();
preg_match_all("/(.*?)<([^\s>]+)(.*?)>([^<]*)/", $input, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);
for($t = 0; $t < count($matches); $t++) {
$output .= $match[1][0];
$match = $matches[$t];
$tag = strtolower(trim($match[2][0]));
if($tag{0} == '/') {
$tag = substr($tag, 1);
if($i != 0) {
$output .= "</{$tagStack[$i]}>";
$i--;
}
} else if($tag{strlen($tag)-1} != '/') {
$tagStack[++$i] = $tag;
if($badAttributes != '') {
$match[3][0] = preg_replace("/($badAttributes)=\"[^\"]*\"/", '', $match[3][0]);
$match[3][0] = preg_replace("/($badAttributes)='[^']*'/", '', $match[3][0]);
}
$output .= "<{$tag} {$match[3][0]}>";
} else {
$output .= $tag;
}
$output .= $match[4][0];
}
for($j = 1; $j <= $i; $j++)
$output .= "</{$tagStack[$j]}>";
return $output;
}
Here's a sample of it at work:
$input = "</b>This <em><strong>TEST</strong></em> </div> <b> is <em onHover=\"deleteIT()\"> a </b></em> test! <div align=\"left\" onClick='deleteHardDrive();'> <div align='left'> </b>";
<div align="\"left\"" onClick='deleteHardDrive();'><div align="left">echo "Before: $input \n";
echo "After: " . clean_tag_structure($input, "onclick|onhover");</div></div>
Output:
Before: </b>This <em><strong>TEST</strong></em> </div> <b> is <em onHover="deleteIT()"> a </b></em> test! <div align="left" onClick='deleteHardDrive();'> <div align='left'> </b>
After: This <em ><strong >TEST</strong></em> <b > is <em > a </em></b> test! <div align="left" > <div align='left'> </div></div>
Cool!