Skip to main content
MISMO Logo

Architecture Workgroup

Go Search
Home
  
MISMO Wiki > Architecture Workgroup > Wiki Pages > MEG0005 Unicode and Internationalization  

MEG0005 Unicode and Internationalization

MEG0005 Unicode and Internationalization
 
Version: 0.03
Last Updated: 12/17/2008
Status:
 
Content
 

C1. Content

C1.1 Summary of Guidelines

Ref Guideline
5.1 Characters MUST be limited to the Basic Multilingual Plane (BMP).
5.2 Pre-composed characters SHOULD be used rather than dynamic composition
5.3 Ligatures SHOULD NOT be used
5.4 ISO8859-1 SHOULD be used with care
5.5 The Unicode private area (E000-F8FF) MUST not be used
5.6 Compatibility characters MUST NOT be used
5.7 Byte Order Marks (BOMs) MUST NOT be used
5.8 The object replacement character MUST NOT be used
5.9 Interlinear annotation characters MUST NOT be used
5.10 Line and paragraph separators MUST NOT be used
5.11 BIDI embedding controls (U+202A..U+202E) MUST NOT be used
5.12 Characters deprecated in Unicode 3.0 MUST NOT be used

C1.2 Guidelines

Ref Guideline
5.1 Characters MUST be limited to the Basic Multilingual Plane (BMP).

The BMP or Plane 0 includes the most common characters from most living languages.
5.2 Pre-composed characters SHOULD be used rather than dynamic composition

Certain sequences of characters can be represented as a single character, called a pre-composed character (or composite or decomposible character). For example, the character “ü” can be encoded as the single code point U+00FC “ü” rather than as the base character U+0075 “u” followed by the non-spacing character U+0308 “¨”.
The Unicode Standard encodes pre-composed characters for compatibility with established standards such as Latin 1, which includes many pre-composed characters such as “ü” and “ñ”.
Such pre-composed character representations SHOULD be favored over use of both composing and combining characters.
Dynamic composition refers to the creation of composite forms such as accented letters from a sequence of characters.
As described above pre-composed characters should be used where possible. If composing or combining characters are used they must be in Unicode’s Normalization Form C (NFC) for MCF compliance. Note that this is the normalization form recommended in the W3C‘s
Character Model for the World Wide Web, and corresponds to the W3C notion of “early normalization” on the character processing model of the world wide web.
5.3 Ligatures SHOULD not be used

Ligatures are combinations of certain characters into a single character. For example the name Aegis could be represented with ligatures as Ægis. Don’t do it. There is no mechanism available for asserting semantic equivalence between a document with ligatures and one without. Semantic equivalence is a requirement for comparing XML signatures and XML Encryptions. Use of ligatures will result in interoperability problems.
If ligatures are used, be aware that these will be normalized to Unicode’s Normalization Form C (NFC) in MCF or other canonicalization processes required in XML signatures and XML Encryption.
5.4 ISO8859-1 SHOULD be used with care

Whereas MCF restricts to UTF-8, the MISMO Profile allows ISO-8859-1. There are a lot of documents that are labeled as ISO-8859-1 but that actually are a platform specific variant of ISO-8869-1 known as Microsoft CP-1252 (code page 1252) see Appendix A.
5.5 The Unicode private area (E000-F8FF) MUST not be used

Trading partners cannot use the Unicode private area without exchanging definitions out of band. This violates the “Self Contained” principle of interoperability.
Trading partner wishing to use the Unicode Private area may do so, but the messages in such an implementation would not be consider MISMO interoperable.
5.6 Compatibility characters MUST NOT be used

Towards the end of the BMP is a range of code points for compatibility characters. These are character variants that are encoded only to enable transcoding to earlier standards and legacy implementations, which made use of them.
MISMO Standards and MISMO Interoperable messages do not require these characters.
If a process behind a service boundary needs compatibility characters to interface with legacy systems they are free to do so. However, the MISMO Interoperable messages produced and consumed in implementations based on MISMO standards should not use them.
5.7 Byte Order Marks (BOMs) MUST NOT be used

UTF-8 is byte order independent. Byte Order Marks MUST NOT be used when encoding in UTF-8. They MAY be used for UTF-16.
5.8 The object replacement character MUST NOT be used

The character U+FFFC within the Unicode “specials category” is used to stand in place of an object (e.g., an image) included in a text. The object replacement character was included in Unicode only in order to reserve a codepoint for a very frequent application-internal use. Including an object replacement character in markup text does not work because the additional information is not available The object replacement character is also problematic when used in plain text, because there is no way in plain text to provide the actual object information or a reference to it.
MISMO provides the EMBEDED_FILE contained to use instead of object replacement character.
5.9 Interlinear annotation characters MUST NOT be used

The characters U+FFF9 to U+FFFB within the Unicode “Specials category” are used to delimit interlinear annotations in certain circumstances. Including interlinear annotation characters in marked-up text does not work because the additional formatting information (how to position the annotation,…) is not available. The interlinear annotation characters are also problematic when used in plain text, and are not intended for that purpose. In particular, on older display systems that simply ignore or replace the Interlinear Annotation Characters, the meaning of the text may be changed.
When the placement of characters in relation to one another is important MISMO standards provide alternatives that do work. For example The VIEW container of a SMARTDOC uses XHTML or SVG markup to place characters into tight formatting constraints. Also EMBEDED_FILE can be use to hold PDF or other binary data.
5.10 Line and paragraph separators MUST NOT be used

The line separator (U+2028) and paragraph separator (U+2029) provide unambiguous means to denote hard line breaks and paragraph delimiters in plain text. Including these characters in markup text does not work where it would duplicate the existing markup commands for delimiting paragraphs and lines.
The separator characters can also be problematic when used in plain text, because legacy data is usually converted code point for code point into Unicode and all receivers of Unicode plain text have to effectively be able to interpret the existing use of control codes for this purpose. As a result, fewer Unicode implementations support these characters, than would be the case otherwise.
5.11 BIDI embedding controls (U+202A..U+202E) MUST NOT be used

The BIDI embedding controls are required to supplement the Unicode Bidirectional Algorithm in plain text. These characters duplicate available markup, which is better suited to handle the stateful nature of their effect.
5.12 Characters deprecated in Unicode 3.0 MUST NOT be used

As described in
Unicode in XML and other Markup Languages such characters were retained from draft versions of ISO 10646, originally intended to allow explicit activation of contextual shaping, numeric digit rendering and symmetric swapping. The most likely effect of their occurrence in generated text would be that of a ‘garbage’ character. When received by a browser as part of marked up text, they may be ignored. When received in an editing context, they may be removed, possibly with a warning.
 
 
Metadata
 

M1. Metadata

Element Description
Title Unicode and Internationalization
Identifier MEG0005
Category
Date Created 7/30/2006 6:18:00 PM
Last Updated 12/17/2008 7:53:00 PM
Publisher MISMO
Copyright ©2008 MISMO. All Rights Reserved.

M1.1 Release History

Release Date Version No. Comments
7/30/2006 0.01 Initial Version
10/20/2006 0.02 Follow up information
12/17/2008 0.03 Global MEG format changes; changed “Disclaimer” tab to “Terms” and updated content.

M1.2 Known Issues or Omissions

Additional review required

M1.3 Contacts

Name Organization Contact Details
<ENTER NAME> MISMO <ENTER CONTACT DETAILS>

M1.4 References

Title Location
RFC 2119 http://rfc.net/rfc2119.html
 
 
Purpose
 

P1. Purpose

P1.1 Introduction

As described in MEG0002, MISMO has defined a profile for XML within the context of Real Estate Finance. The MISM0 Profile is based on the Unicode Character Set. As with many other standards, Unicode is a complex standard with multiple versions, multiple optional features and multiple divergent implementations. This MEG provides guidelines for the use of a profiled subset of Unicode’s features aimed at minimizing the interoperability problems that can result from this complexity. Purpose This MEG describes the use of Unicode within the MISMO Architecture. This MEG is concerned solely with the XML that flows between services that use MISMO transactions. It is not concerned with what goes on behind service boundaries, and places no restrictions on what XML features or technologies are used within the confines of a service. The XML files the MISMO publishes to define standards (XSD, XSLT etc.) will obey these guidelines

P1.2 Audience

Software developers working with the MISMO archetecture who are familiar with XML and the MISMO Interoperability Principles described in rig0012.

P1.3 Terminology

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119. See «add reference to MISMO glossary for terminology used in this MEG».

P1.4 Assumptions

The information supplied in this document reflects the MISMO interoperability principles at the time of writing. It is a living document, which will be updated as required to reflect the evolving nature of XML technologies and service requirements identified by MISO constituency. Comments on this document should be sent to the MISMO designated contact identified in the document preface.

P1.5 Rationale

Background Historically, the only characters supported by computer systems were digits and unaccented English letters–each of which was represented with a number between 32 and 127 called the ASCII code.

Since all the ASCII code characters could be stored in 7 bits, the remaining 8th bit was available and allowed the code range to be extended to represent up to 255 characters. In the absence of a single global standard, multiple different coding systems were created to handle characters from 128 and up, depending on country. The different systems (called codes) resulted in interoperability problems because, different languages used different code pages with different interpretations of the same high numbers. To resolve this problem, the Unicode standard was created to define a single world-wide usable character set encompassing all language writing systems.

Unicode gives every letter of any writing system in the world a unique code number (called code point). Therefore, one page can contain various writing systems, which was previously not possible without switching codepages depending on language, like “Latin 1”, “Latin 2” or “Cyrillic”.

The majority of common-use characters fit into the first 64,000 code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are about 6,300 unused code points for future expansion in the BMP, plus over 870,000 unused supplementary code points on the other planes. The Unicode Standard also reserves code points for private use. Unicode allows Vendors or end users to assign these internally for their own characters and symbols, or use them with specialized fonts.

Unicode and XML The W3C‘s Character Model for the World Wide Web, describes its goal to:

“facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access. One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way.”

Fundamental to this goal is the use of Unicode on the basis that it is:

  • The only universal character repertoire available,
  • Provides a way of referencing characters independent of the encoding of the text,
  • Being updated/completed carefully, and
  • Widely accepted and implemented by industry.

The W3C has adopted Unicode as the document character set for XML 1.0. W3C specifications and applications now use Unicode as the common reference character set.

Unicode and the MISMO Profile As described in MEG0002, the MISMO Profile is based on the use of XML 1.0 and Unicode. The MISMO profile seeks to retain conformance with these standards while excluding those aspects that could result in interoperability issues among the constituent services in the Real Estate Finance Industry.

This document describes the guidelines for use of Unicode within the context of MISMO based messages. Many of these guidelines originate from best practice as described in the W3C‘s Unicode in XML and other Markup Languages which should be read in conjunction with this MEG.

Note: The services may reject messages that contravene “MUST” guidelines described in this document.

Comments

None.

Terms

1. Terms and Conditions

1.1 Disclaimer

MISMO accepts no liability for the accuracy, adequacy, or completeness of the information contained in this MISMO Engineering Guideline (MEG).

1.2 Reproduction

Material in this MEG may be reproduced free of charge without obtaining explicit permission from MISMO, provided that the source is acknowledged, the document title given, and the material used in context.

1.3 Copyright

©2008 MISMO. All material in this MEG is the property of MISMO. All Rights Reserved.

 
 

Last modified at 8/13/2010 8:08 AM  by PROD-SPOINT\administrator