<div dir="ltr"><font face="arial, helvetica, sans-serif">There are a couple of postings about this topic, and I have certainly spent many hours fighting with special character issues and stray ? <span style="line-height:17px">Åö </span><font style="line-height:17px">☐ </font><span style="line-height:17px">characters. A couple weeks back I prepared an explanation for a question I received about a subscript3 ³ getting messed up in a complex processing flow through many programs, spreadsheets, databases, editors and browsers.</span></font><div>
<span style="line-height:17px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><span style="line-height:17px"><font face="arial, helvetica, sans-serif">I think this information could be helpful for many others, and I may have more to understand so feel free to correct or clarify. This is a bit long. I hope I'm not violating a forum rule.</font></span><div>
<font face="arial, helvetica, sans-serif"><br></font></div><div><span style="font-family:arial,helvetica,sans-serif;line-height:17px">Due to the mix of authoring, editing, transfer, storage and processing programs and software, and the legacy techniques, data markup, and older programs the character issue becomes a bit complex. </span><font face="arial, helvetica, sans-serif"><br>
</font></div><div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">Framemaker 9.0 and up can handle UTF-8 encoded character data. <span style="line-height:18px">Unicode 6.0 and ISO/IEC 10646:2010 defines 109,449 code points, values ( i.e. characters). That's far more than the basic 256 ANSI characters that include m</span><font color="#000000"><span style="line-height:18px">ost of the western European characters used for the French, Spanish, and other languages. The 256 ANSI characters are represented using the values, a.k.a. code points, between 32 and 255, or in Unicode hexadecimal representation U+0020 to U+00FF.</span></font></font><div>
<font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px"><br></span></font></div><div><font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px">Many other characters have values above the basic 256 characters that can be troublesome such as:</span></font></div>
<div><p style="margin-bottom:1.2pt;margin-left:19.2pt;line-height:14.4pt"><font face="arial, helvetica, sans-serif"><span style="line-height:17px">☢ ≤ ≥ ∂ ∆ € ℓ ∑ ☒ £ ₇ ⁸ √x₍̅₁̅₂̅₃₎̅ </span><span style="line-height:14px">№ ℥ ℃ ⅓ ⅘ ⅚ ⅞ ↺ ✔☑ ☐ ✈ </span><span style="line-height:16px">ど カ </span><span style="line-height:16px">␍␊</span><br>
</font></p><p style="margin-bottom:1.2pt;margin-left:19.2pt"><span style="line-height:16px"><font face="arial, helvetica, sans-serif">Just in case e-mail messes up the characters above, here is an image of the characters.</font></span></p>
<p style="margin-bottom:1.2pt;margin-left:19.2pt"><span style="line-height:16px"><font face="arial, helvetica, sans-serif"><img src="cid:ii_13ffd3df59335d6c" alt="Inline image 1"><br></font></span></p><div><span style="line-height:16px"><font face="arial, helvetica, sans-serif"><br>
</font></span></div></div><div><span style="line-height:18px"><font face="arial, helvetica, sans-serif">There are five parts to the "character" puzzle:</font></span></div><div><span style="line-height:18px"><font face="arial, helvetica, sans-serif">1) The value used to represent a character. ASCII, ANSI, UNICODE. (I recommend using <b>Unicode</b>)</font></span></div>
<div><span style="line-height:18px"><font face="arial, helvetica, sans-serif">2) The encoding, or how the character's value is stored using 1, 2, 3, or 4 bytes. Where a byte is 8 bits, ones and zeros. (I recommend <b>UTF-8</b>)</font></span></div>
<div><font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px">3) The font. The set of glyphs that define how the characters will appear. (I recommend a <b>Unicode / UTF-8 based font</b>)</span></font></div>
<div><font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px">4) The character set declarations in XML, XSL, CSS, HTML, and software coding options for file open, read and write statements.</span></font></div>
<div><font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px">5) The capabilities and limitations of various programs (browsers, spreadsheets, editors, etc.) and data transfer methods.</span></font></div>
<div><font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px"><br></span></font></div><div><font face="arial, helvetica, sans-serif"><font color="#000000"><span style="line-height:18px">The 1963 ASCII standard character set used a 7-bit encoding and hence was limited to 128 values, 2</span><span style="line-height:18px">⁷. Later the</span></font><span style="line-height:17px"> ANSI standard used all 8 bits providing </span><font color="#000000"><span style="line-height:18px">2</span></font><span style="line-height:17px">⁸ or 256 codepoints or characters</span><font color="#000000"><span style="line-height:18px"> (0 - 255) In order to display mathematical or other special symbols not defined by the ASCII or ANSI standards, custom fonts were developed such as Symbol, Wingdings and Wingdings2 that displayed the limited 256 values differently. So π, pi, could be displayed in a browser using <font face="symbol">p</font> or in MS Word by changing the font for the character with the value of 112, i.e the "p" in a Symbol font.</span></font></font></div>
<div><font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px"><br></span></font></div><div><font face="arial, helvetica, sans-serif"><font color="#000000"><span style="line-height:18px">So the value of 112 is not always a "p" unless a font is being used that assigns the "p" glyph to the decimal value of 112 (hex x0070). </span><span style="line-height:18px">Using the Symbol font, 112 is a pi symbol π, and in Wingding3 font it is solid triangle ▲</span></font><span style="line-height:17px">. </span></font></div>
<div><font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px"><br></span></font></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:18px">Unicode is the set of values assigned to characters using 20 bits providing 2²</span><font color="#000000"><span style="line-height:18px">⁰</span></font><span style="line-height:18px"> or </span><span style="line-height:19.1875px">1,112,064</span><span style="line-height:18px"> codepoints</span><font color="#000000"><span style="line-height:18px">. </span></font><span style="line-height:18px">Unicode 6.0 and ISO/IEC 10646:2010 defines 109,449 characters all with unique codepoint values. In Unicode pi is U+03C0 (960 decimal) and the triangle is U+25B2 (9,650 decimal). </span><span style="line-height:18px">It is customary to represent the Unicode values as hexadecimal preceeded by "U+"</span><span style="line-height:18px">.</span><span style="line-height:18px"> Since codepoints 03C0 (960) and 25B2 (9,650) cannot be stored using 1-byte,i.e. 8 bits, multiple bytes are required. This is where "encoding" comes in. </span></font></div>
<div><font color="#000000" face="arial, helvetica, sans-serif"><span style="line-height:18px"><br></span></font></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:18px">ANSI, ISO-1252, UTF-8, UTF-16, UTF-32 are encodings, i.e the way that the values are stored using 8, 16, 24, or 32 bits (ones and zeros). ASCII uses 7 bits and is limited to 128 characters. </span><span style="line-height:18px">ANSI and ISO-1252 are 1 byte encodings that use all 8 bits with a limit of 256 codepoint values. UTF-8 uses 1 byte for the first 128 characters and then uses 2, 3, or 4 bytes as required. UTF-16 always uses 16 bits, 2-bytes, unless more are required and UTF-32 always uses 4 bytes even for the basic 256 characters. See the UTF-8 reference below for details on the binary encoding</span><span style="line-height:18px">. UTF stands for Unicode Transformation Format. The UTF encodings all assume the Unicode codepoint values are being used.</span></font></div>
<div><span style="line-height:18px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:18px">Then there is the font. If the font does not have a glyph defined for the character codepoint (value) it will typically display as a box or a question mark.<b> Arial Unicode MS</b> is a Windows font that supports glyphs for most of the first 65,533 Unicode codepoints.<b> Verdana</b> defines 780 Unicode codepoints while <b>Century Schoolbook</b> defines 650 Unicode codepoints and 20 "Private Use" glyphs. The basic 256 characters are the same between the three fonts, not the appearance but an "a" is an "a". T the rest character in each font use the save values but each font contains a different collection of values and characters. One may contain the infinity symol while another may not.</span><br>
</font></div><div><div><span style="line-height:18px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:18px">When a single character displays likes </span><span style="line-height:17px">Åö it is due to mismatched encoding, not the font. When a </span><span style="line-height:17px">character's codepoint</span><span style="line-height:17px"> is larger than 127, U+007F, and is saved to a file as UTF-8 it uses 2 or 3 bytes. If a program such as a browser, editor or spreadsheet is expecting, reading, and displaying the data as individual single byte characters data such as ANSI, then the 2 bytes are displayed as if there are 2 characters. In other words the 16 binary ones and zeros are not decoded incorrectly to represent two or three characters and the wrong font glyphs are displayed resulting in the </span><span style="line-height:17px">Åö</span><span style="line-height:17px"> looking stuff</span><span style="line-height:17px">.</span></font></div>
<div><span style="line-height:17px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:17px">I am trying to move to using Unicode codepoints, UTF-8 encoding, and a compatible fonts for everything I do, but its not easy.</span><br>
</font></div><div><span style="line-height:17px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><span style="line-height:17px"><font face="arial, helvetica, sans-serif">When the codepoint for a character creates a problem in one of the storage, transfer, or processing programs my solution was to use a named entity such as ≤ for the less-than-equal-to symbol and transform the ≤ to a character, a character wrapped in a Symbol font, a numeric entitiy for HTML, or use a Framemaker Read/Write rule. In the past I also used named entities for many of the western European characters such as é Now that I understand more I no longer need to do that as much. </font></span></div>
<div><span style="line-height:17px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:17px">I still use named entities like ≤ in some of my XML data, but convert them to Unicode values for Framemaker and HTML product files and use either the Arial Unicode MS or MS Gothic fonts for those characters.</span><span style="line-height:17px"> I could use the numeric entities in my XML but it makes authoring difficult since looking at &2264; doesn't tell me what the character is supposed to be as well as the ≤. In addition some processes will convert the numeric entities to the actual character and then subsequent programs might choke and convert the character to a question mark. So the conversion to numeric entities is always near the last step in my processing. My goal is to someday use every character directly, and have it transfer correctly between text editors, document and publication editors, spreadsheets, databases, browsers and book readers. Some software and operating systems have to catch up to the Unicode and UTF-8 standard.</span></font></div>
<div><font face="arial, helvetica, sans-serif"><span style="line-height:17px"><br></span></font></div><div style><font face="arial, helvetica, sans-serif"><span style="line-height:17px">I also just learned that in MS Word I can enter a Unicode value like 2264 or 2A81 then press Alt-x and it converts the value to the Unicode character using the MS Gothic font when required. I also use </span></font><a href="http://graphemica.com/%E2%89%A4">http://graphemica.com</a> to look up Unicode values by searching for the character, value or name, such <font face="arial, helvetica, sans-serif">as <span style="line-height:115%"><font size="4">∞</font></span><span style="font-size:10pt;line-height:115%">, 221E, or infinity</span></font> <span style="line-height:17px;font-family:arial,helvetica,sans-serif"> </span></div>
<div><span style="line-height:17px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><font face="arial, helvetica, sans-serif">Codepoints (a.k.a character values), encoding and fonts cover the first three parts of the puzzle. Parts 4 and 5 of<span style="line-height:18px"> the "character" puzzle have to do with file declaration, programming statements, and understanding software limitations.</span></font></div>
<div><span style="line-height:18px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:18px">An XML UTF-8 file must include the declaration </span><span style="line-height:18px"><?xml version="1.0" encoding="UTF-8"?></span></font></div>
<div><span style="line-height:18px"><font face="arial, helvetica, sans-serif">A HTML5 UTF-8 file must include <meta charset="UTF-8"></font></span></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:18px">A HTML4 and XHTML UTF-8 file must include </span><span style="line-height:18px"><meta http-equiv="Content-type" content="text/html;charset=UTF-8"></span></font></div>
<div><font face="arial, helvetica, sans-serif"><span style="line-height:18px"><br></span></font></div><div style><font face="arial, helvetica, sans-serif"><span style="line-height:18px">And, the files must be <b><u>saved as UTF-8</u></b> if that is what is intended.</span></font></div>
<div><font face="arial, helvetica, sans-serif"><span style="line-height:18px"><br></span></font></div><div><span style="line-height:18px"><font face="arial, helvetica, sans-serif">To open and edit HTML files saved as UTF-8 using some text editors the file must also include the XML declaration described above. These declarations tell browsers, editors, and other software what the encoding is or the program may assume it's encoded as ANSI using one byte for each character. </font></span></div>
<div><span style="line-height:18px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><span style="line-height:18px"><font face="arial, helvetica, sans-serif">Many programs will detect the UTF-8 encoding when opening a file, but some may have to have an option selected. When saving a file in other than the native format, such as saving a new text file in TextPad, saving an Excel spreadsheet as a Tab Delimited file, or copying and pasting from one program to another special options and setting may have to be specified. Writing javascript, Perl, C#, Visual basic or other programs will require that the files are opened for reading or writing, and then data read and written using the appropriate options for ANSI, UTF-8 or another encoding encoding as required.</font></span></div>
<div><span style="line-height:18px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div style><span style="line-height:18px"><font face="arial, helvetica, sans-serif">I use Structured Framemaker to open and produce PDF files for publications that are maintained as single source XML files. So I can't really make specific Framemaker .FM encoding recommendations, but I think FM, as of version 9, saves files as UTF-8 and can support the Unicode character values.</font></span></div>
<div><span style="line-height:18px"><font face="arial, helvetica, sans-serif"><br></font></span></div><div><span style="line-height:18px"><font face="arial, helvetica, sans-serif">If single source data is being used for multiple processing streams, then the source data must be such that it can be transformed to support the limitations of software, programs, and processes that consume and display the data. </font></span></div>
<div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif"><span style="line-height:17px">I think Unicode and UTF-8 encoding are the best standards to use at this time. If the hexadecimal numbers like 2A3B are messing with your brain I can provide some insight on the decimal, binary, hexadecimal, byte, and bit jargon as well, but I'd probably take it off line since it's not Frame specific. </span></font></div>
<div><span style="line-height:17px"><font face="arial, helvetica, sans-serif"> </font></span></div><div><span style="line-height:17px"><font face="arial, helvetica, sans-serif">Ed Nodland</font></span></div><div><span style="line-height:17px"><font face="arial, helvetica, sans-serif"><br>
</font></span></div><div><p style="margin-bottom:1.2pt;margin-left:19.2pt;line-height:14.4pt"><font face="arial, helvetica, sans-serif"><b style="line-height:14.4pt">Additional References</b><br></font></p><p style="margin-bottom:1.2pt;margin-left:19.2pt;line-height:14.4pt">
<font face="arial, helvetica, sans-serif"><a href="https://en.wikipedia.org/wiki/Unicode" target="_blank">https://en.wikipedia.org/wiki/Unicode</a><br></font></p><p style="margin-bottom:1.2pt;margin-left:19.2pt;line-height:14.4pt">
<font face="arial, helvetica, sans-serif"><a href="https://en.wikipedia.org/wiki/UTF-8" target="_blank">https://en.wikipedia.org/wiki/UTF-8</a><br></font></p><p style="margin-bottom:1.2pt;margin-left:19.2pt;line-height:14.4pt">
<font face="arial, helvetica, sans-serif"><a href="https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings" target="_blank">https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings</a><span style="line-height:16px"><br>
</span></font></p><p style="margin-bottom:1.2pt;margin-left:19.2pt;line-height:14.4pt"><font face="arial, helvetica, sans-serif"><a href="https://en.wikipedia.org/wiki/Character_encoding" target="_blank">https://en.wikipedia.org/wiki/Character_encoding</a><br>
</font></p></div></div></div><div><br></div></div></div>