SpawnPoint

My Blog

Cleaning Tag Structure

A problem on most sites that allow user-submitted HTML is verifying that the HTML is valid (won't break the page, mess up the formatting, etc). Unfortunately there isn't a lot of help on the Internet from what I've seen short of just suggesting locking down almost all HTML tags. On blogs, this isn't such a great idea because people like to express themselves and stay away from just the content.

Well, here's a specific problem I have seen little about: closing mismatched and unclosed tags. For example, if you leave in one <div> or </div> you could totally wrek a page's layout. Another problem is removing nasty attributes on a tag, like onClick or onHover where potential XSS exploits could surface.

So without further discussion, here's a nice function in PHP I have been working on that can clean the structure of an XHTML input and clear illegal tag attributes:
function clean_tag_structure($input, $badAttributes = '') {
$i = 0;
$tagStack = $matches = array();
preg_match_all("/(.*?)<([^\s>]+)(.*?)>([^<]*)/", $input, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);

for($t = 0; $t < count($matches); $t++) {
$output .= $match[1][0];
$match = $matches[$t];
$tag = strtolower(trim($match[2][0]));
if($tag{0} == '/') {
$tag = substr($tag, 1);
if($i != 0) {
$output .= "</{$tagStack[$i]}>";
$i--;
}
} else if($tag{strlen($tag)-1} != '/') {
$tagStack[++$i] = $tag;
if($badAttributes != '') {
$match[3][0] = preg_replace("/($badAttributes)=\"[^\"]*\"/", '', $match[3][0]);
$match[3][0] = preg_replace("/($badAttributes)='[^']*'/", '', $match[3][0]);
}
$output .= "<{$tag} {$match[3][0]}>";
} else {
$output .= $tag;
}
$output .= $match[4][0];
}

for($j = 1; $j <= $i; $j++)
$output .= "</{$tagStack[$j]}>";

return $output;
}

Here's a sample of it at work:

$input = "</b>This <em><strong>TEST</strong></em> </div> <b> is <em onHover=\"deleteIT()\"> a </b></em> test! <div align=\"left\" onClick='deleteHardDrive();'> <div align='left'> </b>";
<div align="\"left\"" onClick='deleteHardDrive();'><div align="left">echo "Before: $input \n";
echo "After: " . clean_tag_structure($input, "onclick|onhover");</div></div>

Output:

Before: </b>This <em><strong>TEST</strong></em> </div> <b> is <em onHover="deleteIT()"> a </b></em> test! <div align="left" onClick='deleteHardDrive();'> <div align='left'> </b>
After: This <em ><strong >TEST</strong></em> <b > is <em > a </em></b> test! <div align="left" > <div align='left'> </div></div>


Cool!

Comments »

Streetbum @ 2007-08-23 00:04:52
Simply put, you amaze me...

Wayetender's Profile Image
  • Wayetender

    Code & Development Team
  • Member Since:2007-05-24 16:11:00
  • Last Online:2008-10-04 17:47:08

SpawnPoint Info

Blog Stats:

  • Total Blogs: 18
  • Popular Blogs: 17

My Games

World of Warcraft [PC]

World of Warcraft [PC]

Four years have passed since the aftermath of Warcraft III: Reign of Chaos, and a great tension now ...

SpawnPoint: Game File Community featuring PC Video Gaming, Free PC Games, Counter-Strike, Action And Strategy Files, Game Forums and Game News