Home Misc Index
  Special Character Handling
 P van Diemen

Special characters in this context are characters that have a special meaning in the formal language they are used in.  Usually it has a special meaning in the syntactical sense.  Typically they include the ‘escape’ codes in that language.

It is not complex, but you should be aware of the intricacies when using HTML-forms to enter data in an SQL database.

HTML handling

For normal HTML texts, the HTML special codes (', ", <, >, &) are usually no problem, though it is wise the encode the characters <, > and & with their corresponding escaped values &lt;, &gt; and &amp; (their so called Numerical Character Reference, NCR).  HTML is rather forgiving for (syntactical) errors (it doesn’t display error messages, but tries to make the best of it).  However, within HTML tags you must escape all the special characters in attribute data, in particular the quotes as they may otherwise be misinterpreted for syntax.

This occurs in particular for text <input> in HTML-forms, e.g.

which is an HTML syntax error.  HTML sees 'It' as string for the value to be presented to the user, doesn’t recognise (and ignores) “s late'” and continues after the >.
When the value parameter would contain a greater than-sign, it would seriously ruin your form:
<INPUT TYPE=TEXT NAME=F1 VALUE='the >greater than-sign'>

When coding HTML by hand, you can often obtain a desirable result by using the appropriate quotes (single or double) to enclose the attribute value.  However, when the value is unknown beforehand (e.g. a value from user input or a database), you have a problem:  the value may include ' and/or " (and/or any of the other special characters) which obviously interferes with the syntax.
So you always have to escape such use of (database) values in a form.

When generating such forms in PHP you can use the htmlspecialchars-function (or htmlentities) for this:  it will encode the special characters (' " < > &) to their NCR-equivalent (&apos; &quot; &lt; &gt; and &amp; respectively).
The second parameter in htmlspecialchars reflects the translation flags;  exact translation depends on the ENT_HTML401/­ENT_HTML5 flag but that is hardly relevant.  Important is that a single quote is only translated when the ENT_QUOTES flag is set.
The third parameter is the character set (e.g. 'UTF-8'), and the fourth and last parameter in the htmlspecialchars- (and htmlentities-)function is for double encoding of HTML-entities;  it should be false for our purposes (otherwise an entity like &eacute; is shown literally and not like é, usually an unwanted effect).
Such a translation has no consequence for the display of these characters, or for editing them (HTML browsers treat an NCR sequence as a single character).

Consequently, the <input> should be coded as:

$val = "It's late";
echo "<INPUT TYPE=TEXT NAME=F1 VALUE='" . htmlspecialchars( $val, ENT_QUOTES, 'UTF-8', false ) . "'>";

Htmlspecialchars and htmlentities are equivalent except that htmlentities will convert all characters which have an HTML-entity equivalent (we don't need that and we have double encoding of HTML-entities false anyway).  For handling <input> values, htmlspecialchars conversion of  ', ", <, > and & is adequate.

Above is also valid for the <textarea>-tag;  it is more likely that such text contains single and double quotes, the angular brackets and the ampersand.  Make sure that you perform htmlspecialchars only once for each field, otherwise the consequences are the same as with double_encoding parameter true:  you get the HTML-entities visibly in the input field.  If the string contains a <br> and you want to preserve the layout, use str_replace( "<br>", "\n", $string ) before the encoding (you may not want to reverse the substitution).
On the other hand, if you display such a text as normal HTML text, you want the newline "\n" replaced by a "<br>".

If you want to display values containing HTML-tags (e.g. from a database) as text (show the tags and not let them be interpreted by the browser), you should use the htmlentities-function with double encoding true.

If you need to reverse the effect of htmlspecialchars or htmlentities, you can use the html_entity_decode (or htmlspecialchars_decode)-function.  The html_entity_decode and htmlspecialchars_decode functions are equivalent except that html_entity_decode will decode all HTML-entities (not juist &apos; &quot; &lt; &gt; and &amp; but also e.g. &eacute;).  It is usually desirable to convert all HTML-entities for a database (with potentially less desirable cases &nbsp; and &shy; as these are not visible anymore).
Apart from the string, these decode functions have one additional parameter:  the translation flags.  Usually, ENT_QUOTES | ENT_HTML5 will do.

Parameter Transfer handling

The encoding of parameters through htmlspecialchars suggests that you have to decode these HTML-entities back to their character code equivalent before entering the value in a database (or you would pollute the database with NCR sequences). 
But, surprise, the parameters through the GET- and POST-mechanisms have already undergone a htmlspecialchars_decode.

In old PHP versions, the parameters were also processed by addslashes() but that is no longer applicable.

If you pass parameters through the hyperlink mechanism, e.g. echo "<a href='newpage.php?par=$str'>click here</a>";, or the GET-mechanism, and the parameter string $str may contain 'funny' characters (like a hyperlink), you have to use the urlencode-function.  It will encode all non-alphanumeric characters except hyphen -, underscore _ and period . as %«hexbyte».
And here as well, you don't have to use urldecode as parameters through the GET-mechanisms are already decoded.

However, when you have such a string as <input…> parameter in a <form>, you don't have the possibility to urlencode that string.  So you can't use the GET-mechanism and have to use the POST-mechanism.
The POST-mechanism is strongly preferred anyway as it doesn't provide users –and potential hackers– any clue to what parameters and what values are transferred, and avoids simple meddling with them to attempt an SQL-inserts.

Special case

There is a special case when handling booleans like from a HTML <input type=checkbox …> in SQL.  HTML passes a boolean value as no value (or empty value) for false, or the string 'on' for true;  PHP accepts (numeric) 0, false, Null and empty string as false, and anything else as true.
SQL doesn't have a boolean data type (it accepts boolean literals false as 0 and true as 1, and it has the data type BIT accepting values like 0b01).  Often, a TINYINT is used to store a boolean in SQL, but any assignment to that type needs a numeric value (not an empty value, or on).

There is a simple solution for this:  use double type juggling in PHP like   $truth = (int) (bool) @$_POST['truth'];.

SQL Query handling

When handling SQL from a scripting language there is a basic problem in creating the query:  SQL requires its commands as an ASCII string, but that query may contain parameters having quoted strings as values.  E.g.

$name = "O'Neill";
$query = "UPDATE tbl3 SET name='$name' WHERE id=12345";
which is interpreted by SQL as:
UPDATE tbl3 SET name='O'Neill' WHERE id=12345
and which leads to an SQL syntax error.

Again, as single quotes ' are more common than double quotes ", one may occasionally circumvent the above problem by using double quotes as string separator in the query.  But if a parameter value may (also) contain double quotes, it proves not to be a solution.
One may encode the quotes as &apos; or &#39 and &quot; or &#34; respectively (i.e. the htmlspecialchars()-treatment as used in HTML):  in HTML it will look the same, but it is not really the same in the database.  So not a good solution either.

SQL allows an escape code in strings through the backslash \:  the next character will be taken literally in the string.  This is valid for strings enclosed by single quotes or by double quotes.  And the same escape method applies to PHP !

The point is now to add the backslash just before each occurrence of a quote in the string.  And before any occurrence of the backslash itself as it is the escape character (though it is a rare character). 
You should use the DBMS-specific escape function (e.g. mysqli_real_escape_string( $db_link, $string ) which takes care of NUL (ASCII 0), \n, \r, \, ', ", and CTRL+Z. ).  PHP also offers a simple function addslashes() which does nearly the same (NUL, \, ' and ") and will commonly work, but mysqli_real_escape_string() is recommended (against SQL-inserts).

Problem solved.  And it may prevent simple SQL-inserts at the same time.
But here as well, make sure you perform mysqli_real_escape_string() only once, otherwise you get backslashes in your database.  If for some reason you need to have the orignal form back, use stripslashes().

NB:  There is also the function addcslashes (and stripcslashes) which has an extra parameter for the characters to quote.  That parameter may even include ranges of character, which is more powerful than we need.  In our case you would use '\'"\\'.