Security risks associated with utf8_decode and XSS filters

BlackHat USA 2009; Eduardo Vela Nava (sirdarckcat) and David Lindsay presented a paper entitled “Our Favorite XSS Filters and How to Attack Them”. Very interesting paper, you should definitely take a look at it.

In this paper, besides other things, they presented a very interesting way to bypass XSS filters using Unicode charcters.

XSS filters

Consider the following piece of code:

xss_utf8_decode

This code is using the utf8_decode function to decode the input to single-bytes characters. Later, it will check if the decoded input contains dangerous characters and reject the input if that’s the case. Using this function, utf8_decode is/(used to be) recommended to protect against obfuscated Unicode encoding.

Here is a quote from OWASP’s discussion  page about “Testing_for_Cross_site_scripting”;

The following PHP functions help mitigate Cross-Site Scripting Vulnerabilities:

utf8_decode() converts UTF-8 encoding to single byte ASCII characters. Decoding Unicode input prior to filtering it can help you detect attacks that the attacker has obfuscated with Unicode encoding.

However, in this case, as Eduardo and David showed, utf8_decode is the problem and not the solution. You can bypass the filter with a query string like:

vuln.php?input=%F6%3Cimg+onmouseover=prompt(/xss/)//%F6%3E

I’ve edited the code to show the input before and after utf8_decode to understand what’s going on:

input (before utf8_decode): ö<img acu onmouseover=prompt(400854747531)//ö>

decoded input (after utf8_decode): ?g acu onmouseover=prompt(400854747531)//?

The initial string contained 2 filtered characters < (%3C) and > (%3E). However, because of the %F6 character, utf8_decode is replacing them (and two more characters) with a question sign. The filter is bypassed and the code is vulnerable to XSS (cross site scripting).

utf8_decode and addslashes

However, this problem is not only related with XSS filters.  A similar case will appear when using utf8_decode to convert escaped strings (e.g. addslashes()).

he following PHP functions help mitigate Cross-Site Scripting Vulnerabilities:
Utf8_decode() converts UTF-8 encoding to single byte ASCII characters. Decoding Unicode input prior to filtering it can help you detect
attacks that the attacker has obfuscated with Unicode encoding.

Some sample source code:

sql_injection_addslashes_utf8_decode

This code is using addslashes (which is not a proper way to protect against SQL injection but still people use it) together with utf8_decode.   If you try to insert a single quote, addslashes will protect against SQL injection:

index.php?username=%27&password=a

user: test’

pass: a

SQL query: SELECT * FROM users WHERE uname = ‘test” and pass = ‘a’

I’ve updated the code to show the inputs and the SQL query. However, this code can be exploited using a query string like:

index.php?username=test%FC%27%27+or+1=1+–+&password=a

This will generate the following output:

user: test?’ or 1=1 –

pass: a

SQL query: SELECT * FROM users WHERE uname = ‘test?’ or 1=1 — ‘ and pass = ‘a’

Again, utf8_decode replaced the characters after %FC with a question mark, making the code vulnerable to SQL injection. The PHP directive magic_quotes_gpc is on by default, and it essentially runs addslashes() on all GET, POST, and COOKIE data.

While looking into this problem, I’ve found a very useful comment on the PHP page for the utf8_decode function:

Warning!
This function contains a possible security risk when you try to convert escaped strings (see addslashes() and related functions).
It reacts nasty on broken multibyte sequences. In UTF-8, follow-up bytes ALWAYS have the binary pattern 10xxxxxx, but this fact is not handled by utf8_decode in the way you would expect: If you pass a start byte (110xxxxx, 1110xxxx, 11110xxx – or even invalid sequences like 11111100), followed by one or more non-multibyte chars (0xxxxxxx), the start sequence “char” will be replaced by ‘?’ (0x3F) and up to three following chars will disappear even if they are single-byte-chars (0xxxxxxx). So if you escape a string with a typical escape char like backslash, you would expect that your escaping would always survive a call to utf8decode because the escape char is in the assumed safe ascii range 0-127, but that is NOT the case!
Try things like utf8_decode(“test: ü”123456″) to check it out.
To avoid problems take care that string-escaping always is the last step of data manipulation when you depend on leak-proof escaping.

This comment explains very well what’s going on. We’ve also updated Acunetix WVS to test for this kind of vulnerabilities in the latest build (build 20090813).


  • Respect for this article!
    I have been searching a long time for a good paper like this to
    secure my xss-filter.

    Keep going,

    Errorman

  • Leave a Reply

    Your email address will not be published.


    *