Fonts, Character Sets, Unicode and UTF-8 Explained

Sat Jul 20 12:28:14 PDT 2013

There are a couple of postings about this topic, and I have certainly spent
many hours fighting with special character issues and stray ? Åö ☐
characters.
 A couple weeks back I prepared an explanation for a question
I received about a subscript3 ³ getting messed up in a complex processing
flow through many programs, spreadsheets, databases, editors and browsers.

I think this information could be helpful for many others, and I may have
more to understand so feel free to correct or clarify.  This is a bit long.
 I hope I'm not violating a forum rule.

Due to the mix of authoring, editing, transfer, storage and processing
programs and software, and the legacy techniques, data markup, and older
programs the character issue becomes a bit complex.

Framemaker 9.0 and up can handle UTF-8 encoded character data.  Unicode 6.0
and ISO/IEC 10646:2010 defines 109,449 code points, values ( i.e.
characters).  That's far more than the basic 256 ANSI characters that
include most of the western European characters used for the French,
Spanish, and other languages.  The 256 ANSI characters are represented
using the values, a.k.a. code points, between 32 and 255, or in
Unicode hexadecimal representation U+0020 to U+00FF.

Many other characters have values above the basic 256 characters that can
be troublesome such as:

☢  ≤ ≥ ∂ ∆ € ℓ ∑ ☒ £ ₇ ⁸   √x₍̅₁̅₂̅₃₎̅  № ℥ ℃ ⅓ ⅘ ⅚ ⅞ ↺ ✔☑ ☐ ✈ ど カ ␍␊

Just in case e-mail messes up the characters above, here is an image of the
characters.

[image: Inline image 1]

There are five parts to the "character" puzzle:
1) The value used to represent a character.  ASCII, ANSI, UNICODE.  (I
recommend using *Unicode*)
2) The encoding, or how the character's value is stored using 1, 2, 3, or 4
bytes.  Where a byte is 8 bits, ones and zeros.  (I recommend *UTF-8*)
3) The font.  The set of glyphs that define how the characters will appear.
(I recommend a *Unicode / UTF-8 based font*)
4) The character set declarations in XML, XSL, CSS, HTML, and software
coding options for file open, read and write statements.
5) The capabilities and limitations of various programs (browsers,
spreadsheets, editors, etc.) and data transfer methods.

The 1963 ASCII standard character set used a 7-bit encoding and hence was
limited to 128 values, 2⁷. Later the ANSI standard used all 8 bits
providing 2⁸ or 256 codepoints or characters (0 - 255)  In order to display
mathematical or other special symbols not defined by the ASCII or ANSI
standards, custom fonts were developed such as Symbol, Wingdings and
Wingdings2 that displayed the limited 256 values differently.  So π, pi,
could be displayed in a browser using <font face="symbol">p</font> or in MS
Word by changing the font for the character with the value of 112, i.e the
"p" in a Symbol font.

So the value of 112 is not always a "p" unless a font is being used that
assigns the "p" glyph to the decimal value of 112 (hex x0070). Using the
Symbol font, 112 is a pi symbol π, and in Wingding3 font it is solid
triangle ▲.

Unicode is the set of values assigned to characters using 20 bits providing
2²⁰ or 1,112,064 codepoints. Unicode 6.0 and ISO/IEC 10646:2010 defines
109,449 characters all with unique codepoint values. In Unicode pi is
U+03C0 (960 decimal) and the triangle is U+25B2 (9,650 decimal). It is
customary to represent the Unicode values as hexadecimal preceeded by
"U+". Since
codepoints 03C0 (960) and 25B2 (9,650) cannot be stored using 1-byte,i.e. 8
bits, multiple bytes are required.  This is where "encoding" comes in.

ANSI, ISO-1252, UTF-8, UTF-16, UTF-32 are encodings, i.e the way that the
values are stored using 8, 16, 24, or 32 bits (ones and zeros). ASCII uses
7 bits and is limited to 128 characters. ANSI and ISO-1252 are 1 byte
encodings that use all 8 bits with a limit of 256 codepoint values.  UTF-8
uses 1 byte for the first 128 characters and then uses 2, 3, or 4 bytes as
required. UTF-16 always uses 16 bits, 2-bytes, unless more are required and
UTF-32 always uses 4 bytes even for the basic 256 characters.  See the
UTF-8 reference below for details on the binary encoding. UTF stands for
Unicode Transformation Format. The UTF encodings all assume the Unicode
codepoint values are being used.

Then there is the font.  If the font does not have a glyph defined for the
character codepoint (value) it will typically display as a box or a
question mark.* Arial Unicode MS* is a Windows font that supports glyphs
for most of the first 65,533 Unicode codepoints.* Verdana* defines 780
Unicode codepoints while *Century Schoolbook* defines 650 Unicode
codepoints and 20 "Private Use" glyphs.  The basic 256 characters are the
same between the three fonts, not the appearance but an "a" is an "a". T
the rest character in each font use the save values but each font contains
a different collection of values and characters.  One may contain the
infinity symol while another may not.

When a single character displays likes Åö it is due to mismatched encoding,
not the font. When a character's codepoint is larger than 127, U+007F, and
is saved to a file as UTF-8 it uses 2 or 3 bytes. If a program such as a
browser, editor or spreadsheet is expecting, reading, and displaying the
data as individual single byte characters data such as ANSI, then the 2
bytes are displayed as if there are 2 characters.  In other words the 16
binary ones and zeros are not decoded incorrectly to represent two or three
characters and the wrong font glyphs are displayed resulting in the Åö looking
stuff.

I am trying to move to using Unicode codepoints, UTF-8 encoding, and a
compatible fonts for everything I do, but its not easy.

When the codepoint for a character creates a problem in one of the storage,
transfer, or processing programs my solution was to use a named entity such
as ≤ for the less-than-equal-to symbol and transform the ≤ to a
character, a character wrapped in a Symbol font, a numeric entitiy for
HTML, or use a Framemaker Read/Write rule.  In the past I also used named
entities for many of the western European characters such as é  Now
that I understand more I no longer need to do that as much.

I still use named entities like ≤ in some of my XML data, but convert
them to Unicode values for Framemaker and HTML product files and use either
the Arial Unicode MS or MS Gothic fonts for those characters.  I could use
the numeric entities in my XML but it makes authoring difficult since
looking at &2264; doesn't tell me what the character is supposed to be as
well as the ≤.  In addition some processes will convert the numeric
entities to the actual character and then subsequent programs might choke
and convert the character to a question mark.  So the conversion to numeric
entities is always near the last step in my processing. My goal is to
someday use every character directly, and have it transfer correctly
between text editors, document and publication editors, spreadsheets,
databases, browsers and book readers.  Some software and operating systems
have to catch up to the Unicode and UTF-8 standard.

I also just learned that in MS Word I can enter a Unicode value like 2264
or 2A81 then press Alt-x and it converts the value to the Unicode character
using the MS Gothic font when required. I also use
http://graphemica.com<http://graphemica.com/%E2%89%A4> to
look up Unicode values by searching for the character, value or name, such
as ∞, 221E, or infinity

Codepoints (a.k.a character values), encoding and fonts cover the first
three parts of the puzzle. Parts 4 and 5 of the "character" puzzle have to
do with file declaration, programming statements, and understanding
software limitations.

An XML UTF-8 file must include the declaration <?xml version="1.0"
encoding="UTF-8"?>
A HTML5 UTF-8 file must include <meta charset="UTF-8">
A HTML4 and XHTML UTF-8 file must include <meta http-equiv="Content-type"
content="text/html;charset=UTF-8">

And, the files must be *saved as UTF-8* if that is what is intended.

To open and edit HTML files saved as UTF-8 using some text editors the file
must also include the XML declaration described above.  These declarations
tell browsers, editors, and other software what the encoding is or the
program may assume it's encoded as ANSI using one byte for each character.

Many programs will detect the UTF-8 encoding when opening a file, but some
may have to have an option selected.  When saving a file in other than the
native format, such as saving a new text file in TextPad, saving an Excel
spreadsheet as a Tab Delimited file, or copying and pasting from one
program to another special options and setting may have to be specified.
 Writing javascript, Perl, C#, Visual basic or other programs will require
that the files are opened for reading or writing, and then data read and
written using the appropriate options for ANSI, UTF-8 or another encoding
encoding as required.

I use Structured Framemaker to open and produce PDF files for publications
that are maintained as single source XML files. So I can't really make
specific Framemaker .FM encoding recommendations, but I think FM, as of
version 9, saves files as UTF-8 and can support the Unicode character
values.

If single source data is being used for multiple processing streams, then
the source data must be such that it can be transformed to support the
limitations of software, programs, and processes that consume and display
the data.

I think Unicode and UTF-8 encoding are the best standards to use at this
time.  If the hexadecimal numbers like 2A3B are messing with your brain I
can provide some insight on the decimal, binary, hexadecimal, byte, and bit
jargon as well, but I'd probably take it off line since it's not Frame
specific.

Ed Nodland

*Additional References*

https://en.wikipedia.org/wiki/Unicode

https://en.wikipedia.org/wiki/UTF-8

https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

https://en.wikipedia.org/wiki/Character_encoding
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.frameusers.com/pipermail/framers-frameusers.com/attachments/20130720/5654c237/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 3051 bytes
Desc: not available
URL: <http://lists.frameusers.com/pipermail/framers-frameusers.com/attachments/20130720/5654c237/attachment.png>