Big and Little Endians [was: Re: Procedure How to Write a Manual!]

Fri May 22 12:14:18 PDT 2009

On Fri, 22 May 2009 13:36:11 +0000, Bodvar Bjorgvinsson 
<bodvar at gmail.com> wrote:

>Regarding the "endianess", I had a problem some 13 years ago with some
>UNIX software that was supposed to work on Linux. It did not. I sent a
>query to an Icelandic guy on the "Basic Linux Training" list I
>subscribed to and he came up with a solution. Then he expained to me
>that there was a difference between Linux an UNIX that one used big
>endian and the other little endian in the same code of software.

In current computer systems, there are two kinds of
"endianess", called "LSB (Least Significant Byte)
first" and "MSB (Most Significant Byte) first".
For any given system, what determines this is not
the operating system (Linux, Windows, etc.), it's
the processor (CPU).  All Intel CPUs are LSB first;
others, like Sun SPARC and Motorola 68K, are MSB
first.  So Linux on a Sun SPARC would be MSB first,
but on an Intel box it would be LSB first.

Technically, the difference is indeed *byte* order,
not *bit* order (which is constant).  Suppose you
have a hex number 0xABCD.  The most significant
byte is 0xAB; the least significant byte is 0xCD.
Now imagine that you store this number in memory
at address 0.  ;-)  You will get:

Location  SPARC  Intel
00000000  0xAB   0xCD
00000001  0xCD   0xAB

Well-designed programs where portability matters
will work with *either* CPU.  They do this by not
caring what the storage order in memory is, and
always accessing multibyte numbers through a set
of functions that work regardless of byte order.
For example, Mif2Go was originally developed on
a Sun SPARC system, then ported to Windows very
easily because it followed those design rules.

There's actually a third flavor, but it was used
only on the DEC PDP-11.  Since the last of those
is probably in the Smithsonian, you won't see it
in current software.  It is the same as Intel
for two-byte numbers (shorts) but switches the
byte pairs for 4-byte numbers (longs).  So the
number 0x12345678 is 0x34, 0x12, 0x78, 0x56.

Endianness also affects Unicode, in the UTF-16
and UTF-32 encodings of it, but *not* in UTF-8.
It is the reason for the UTF-16 BOM (Byte Order
Mark), U+FEFF,  In UTF-16 Big-endian (MSB first),
the bytes are 0xFE 0xFF.  In UTF-16 Little-endian
(LSB first), they are 0xFF 0xFE.  UTF-32 adds two 
zero bytes, before it for Big and after for Little. 

The Unicode BOM may also be used as an encoding
signature, but I digress...   ;-)  Good thing
it's Friday, eh?

HTH!

-- Jeremy H. Griffith, at Omni Systems Inc.
  <jeremy at omsys.com>  http://www.omsys.com/