- added information about Unicode to coding guidelines

- two little changes git-svn-id: file:///svn/phpbb/trunk@8035 89ea8834-ac86-4346-8a33-228a782c2dd0
author: Nils Adermann <naderman@naderman.de> 2007-08-16 12:19:26 +0000
committer: Nils Adermann <naderman@naderman.de> 2007-08-16 12:19:26 +0000
commit: 487ca9229997f0bd1c5ed228cf7dc3a033fce329 (patch)
tree: 91c74e3ffb1e41695812718e34bbe896d56ac563 /phpBB/docs
parent: c9dcf849b9d196f27131c21b79ebf1793f3c1cda (diff)
download: forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.tar
forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.tar.gz
forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.tar.bz2
forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.tar.xz
forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.zip
1 files changed, 88 insertions, 7 deletions
diff --git a/phpBB/docs/coding-guidelines.html b/phpBB/docs/coding-guidelines.html
index 14deabf135..d7d40d926e 100644
--- a/phpBB/docs/coding-guidelines.html
+++ b/phpBB/docs/coding-guidelines.html
@@ -3,7 +3,7 @@
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
 <title>Coding Guidelines</title>
-<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <meta http-equiv="Content-Style-Type" content="text/css" />
 <meta name="resource-type" content="document" />
 <meta name="description" lang="en" content="Olympus coding guidelines document" />
@@ -215,6 +215,7 @@ p a {
 	</li>
 	<li><a href="#styling">Styling</a></li>
 	<li><a href="#templating">Templating</a></li>
+	<li><a href="#charsets">Character Sets and Encodings</a></li>
 	<li><a href="#translation">Translation (<abbr title="Internationalisation">i18n</abbr>/<abbr title="Localisation">L10n</abbr>) Guidelines</a>
 	<ol type="i">
 		<li><a href="#standardisation">Standardisation</a></li>
@@ -1558,9 +1559,83 @@ div
 
 <hr />
 
-<a name="translation"></a><h1>5. Translation (<abbr title="Internationalisation">i18n</abbr>/<abbr title="Localisation">L10n</abbr>) Guidelines</h1>
+<a name="charsets"></a><h1>5. Character Sets and Encodings</h1>
 
-	<a name="standardisation"></a><b>5.i. Standardisation</b>
+<div class="paragraph">
+
+<h3>What are Unicode, UCS and UTF-8?</h3>
+<p>The <a href="http://en.wikipedia.org/wiki/Universal_Character_Set">Universal Character Set (UCS)</a> described in ISO/IEC 10646 consists of a large amount of characters. Each of them has a unique name and a code point which is an integer number. <a href="http://en.wikipedia.org/wiki/Unicode">Unicode</a> - which is an industry standard - complements the Universal Character Set with further information about the characters' properties and alternative character encodings. More information on Unicode can be found on the <a href="http://www.unicode.org/">Unicode Consortium's website</a>. One of the Unicode encodings is the <a href="http://en.wikipedia.org/wiki/UTF-8">8-bit Unicode Transformation Format (UTF-8)</a>. It encodes characters with up to four bytes aiming for maximum compatability with the <a href="http://en.wikipedia.org/wiki/ASCII">American Standard Code for Information Interchange</a> which is a 7-bit encoding of a relatively small subset of the UCS.</p>
+
+<h3>phpBB's use of Unicode</h3>
+<p>Unfortunately PHP does not faciliate the use of Unicode prior to version 6. Most functions simply treat strings as sequences of bytes assuming that each character takes up exactly one byte. This behaviour still allows for storing UTF-8 encoded text in PHP strings but many operations on strings have unexpected results. To circumvent this problem we have created some alternative functions to PHP's native string operations which use code points instead of bytes. These functions can be found in <code>/includes/utf/utf_tools.php</code>. They are also covered in the <a href="http://area51.phpbb.com/docs/code/">phpBB3 Sourcecode Documentation</a>. A lot of native PHP functions still work with UTF-8 as long as you stick to certain restrictions. For example <code>explode</code> still works as long as the first and the last character of the delimiter string are ASCII characters.</p>
+
+<p>phpBB only uses the ASCII and the UTF-8 character encodings. Still all Strings are UTF-8 encoded because ASCII is a subset of UTF-8. The only exceptions to this rule are code sections which deal with external systems which use other encodings and character sets. Such external data should be converted to UTF-8 using the <code>utf8_recode()</code> function supplied with phpBB. It supports a variety of other character sets and encodings, a full list can be found below.</p>
+
+<p>With <code>request_var()</code> you can either allow all UCS characters in user input or restrict user input to ASCII characters. This feature is controlled by the function's third parameter called <code>$multibyte</code>. You should allow multibyte characters in posts, PMs, topic titles, forum names, etc. but it's not necessary for internal uses like a <code>$mode</code> variable which should only hold a predefined list of ASCII strings anyway.</p>
+
+<blockquote><pre>
+// an input string containing a multibyte character
+$_REQUEST['multibyte_string'] = 'K&#228;se';
+
+// print request variable as a UTF-8 string allowing multibyte characters
+echo request_var('multibyte_string', '', true);
+// print request variable as ASCII string
+echo request_var('multibyte_string', '');
+</pre></blockquote>
+
+<p>This code snippet will generate the following output:</p>
+
+<blockquote><pre>
+K&#228;se
+K??se
+</pre></blockquote>
+
+<h3>Unicode Normalization</h3>
+
+<p>If you retrieve user input with multibyte characters you should additionally normalize the string using <code>utf8_normalize_nfc()</code> before you work with it. This is necessary to make sure that equal characters can only occur in one particular binary representation. For example the character &#197; can be represented either as <code>U+00C5</code> (LATIN CAPITAL LETTER A WITH RING ABOVE) or as <code>U+212B</code> (ANGSTROM SIGN). phpBB uses Normalization Form Canonical Composition (NFC) for all text. So the correct version of the above example would look like this:</p>
+
+<blockquote><pre>
+$_REQUEST['multibyte_string'] = 'K&#228;se';
+
+echo utf8_normalize_nfc(request_var('multibyte_string', '', true));
+echo request_var('multibyte_string', '');
+</pre></blockquote>
+
+<h3>Case Folding</h3>
+
+<p>Case insensitive comparison of strings is no longer possible with <code>strtolower</code> or <code>strtoupper</code> as some characters have multiple lower case or multiple upper case forms depending on their position in a word. So instead you should use case folding which gives you a case insensitive version of the string which can be used for case insensitive comparisons. An NFC normalized string can be case folded using <code>utf8_case_fold_nfc()</code>.</p>
+
+<p class="bad">// Bad - The strings might be the same even if strtolower differs</p>
+
+<blockquote><pre>
+if (strtolower($string1) == strtolower($string2))
+{
+	echo '$string1 and $string2 are equal or differ in case';
+}
+</pre></blockquote>
+
+<p class="good">// Good - Case folding is really case insensitive</p>
+
+<blockquote><pre>
+if (utf8_case_fold_nfc($string1) == utf8_case_fold_nfc($string2))
+{
+	echo '$string1 and $string2 are equal or differ in case';
+}
+</pre></blockquote>
+
+<h3>Confusables Detection</h3>
+
+<p>phpBB offers a special method <code>utf8_clean_string</code> which can be used to make sure string identifiers are unique. This method uses Normalization Form Compatibility Composition (NFKC) instead of NFC and replaces similarly looking characters with a particular representative of the equivalence class. This method is currently used for usernames and group names to avoid confusion with similarly looking names.</p>
+
+</div>
+<a href="#top">Top</a>
+<br /><br />
+
+<hr />
+
+<a name="translation"></a><h1>6. Translation (<abbr title="Internationalisation">i18n</abbr>/<abbr title="Localisation">L10n</abbr>) Guidelines</h1>
+
+	<a name="standardisation"></a><b>6.i. Standardisation</b>
 	<br /><br />
 	<div class="paragraph">
 	
@@ -1854,7 +1929,7 @@ div
 	<a href="#top">Top</a>
 	<br /><br />
 
-	<a name="otherconsiderations"></a><b>5.ii. Other considerations</b>
+	<a name="otherconsiderations"></a><b>6.ii. Other considerations</b>
 	<br /><br />
 	<div class="paragraph">
 
@@ -2118,7 +2193,7 @@ div
 	<a href="#top">Top</a>
 	<br /><br />
 
-	<a name="writingstyle"></a><b>5.iii. Writing Style</b>
+	<a name="writingstyle"></a><b>6.iii. Writing Style</b>
 	<br /><br />
 	<div class="paragraph">
 	
@@ -2229,13 +2304,19 @@ div
 
 <hr />
 
-<a name="changes"></a><h1>6. Guidelines Changelog</h1>
+<a name="changes"></a><h1>7. Guidelines Changelog</h1>
 <div class="paragraph">
 
+<h2>Revision 1.24</h2>
+
+<ul class="menu">
+	<li>Added <a href="#translation">5. Character Sets and Encodings</a> section to explain the recommended treatment of strings in phpBB.</li>
+</ul>
+
 <h2>Revision 1.16</h2>
 
 <ul class="menu">
-	<li>Added <a href="#translation">5. Translation (<abbr title="Internationalisation">i18n</abbr>/<abbr title="Localisation">L10n</abbr>) Guidelines</a> section to explain expected format and authoring considerations for language packs that are to be created for phpBB.</li>
+	<li>Added <a href="#translation">6. Translation (<abbr title="Internationalisation">i18n</abbr>/<abbr title="Localisation">L10n</abbr>) Guidelines</a> section to explain expected format and authoring considerations for language packs that are to be created for phpBB.</li>
 </ul>
 
 <h2>Revision 1.11-1.15</h2>
author	Nils Adermann <naderman@naderman.de>	2007-08-16 12:19:26 +0000
committer	Nils Adermann <naderman@naderman.de>	2007-08-16 12:19:26 +0000
commit	487ca9229997f0bd1c5ed228cf7dc3a033fce329 (patch)
tree	91c74e3ffb1e41695812718e34bbe896d56ac563 /phpBB/docs
parent	c9dcf849b9d196f27131c21b79ebf1793f3c1cda (diff)
download	forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.tar forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.tar.gz forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.tar.bz2 forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.tar.xz forums-487ca9229997f0bd1c5ed228cf7dc3a033fce329.zip