P1. Purpose
P1.1 Introduction
As described in MEG0002, MISMO has defined a profile for XML within the context of Real Estate Finance. The MISM0 Profile is based on the Unicode Character Set. As with many other standards, Unicode is a complex standard with multiple versions, multiple optional features and multiple divergent implementations. This MEG provides guidelines for the use of a profiled subset of Unicode’s features aimed at minimizing the interoperability problems that can result from this complexity. Purpose This MEG describes the use of Unicode within the MISMO Architecture. This MEG is concerned solely with the XML that flows between services that use MISMO transactions. It is not concerned with what goes on behind service boundaries, and places no restrictions on what XML features or technologies are used within the confines of a service. The XML files the MISMO publishes to define standards (XSD, XSLT etc.) will obey these guidelines
P1.2 Audience
Software developers working with the MISMO archetecture who are familiar with XML and the MISMO Interoperability Principles described in rig0012
.
P1.3 Terminology
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119
. See «add reference to MISMO glossary for terminology used in this MEG».
P1.4 Assumptions
The information supplied in this document reflects the MISMO interoperability principles at the time of writing. It is a living document, which will be updated as required to reflect the evolving nature of XML technologies and service requirements identified by MISO constituency. Comments on this document should be sent to the MISMO designated contact identified in the document preface.
P1.5 Rationale
Background Historically, the only characters supported by computer systems were digits and unaccented English letters–each of which was represented with a number between 32 and 127 called the ASCII code.
Since all the ASCII code characters could be stored in 7 bits, the remaining 8th bit was available and allowed the code range to be extended to represent up to 255 characters. In the absence of a single global standard, multiple different coding systems were created to handle characters from 128 and up, depending on country. The different systems (called codes) resulted in interoperability problems because, different languages used different code pages with different interpretations of the same high numbers. To resolve this problem, the Unicode standard was created to define a single world-wide usable character set encompassing all language writing systems.
Unicode gives every letter of any writing system in the world a unique code number (called code point). Therefore, one page can contain various writing systems, which was previously not possible without switching codepages depending on language, like “Latin 1”, “Latin 2” or “Cyrillic”.
The majority of common-use characters fit into the first 64,000 code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are about 6,300 unused code points for future expansion in the BMP, plus over 870,000 unused supplementary code points on the other planes. The Unicode Standard also reserves code points for private use. Unicode allows Vendors or end users to assign these internally for their own characters and symbols, or use them with specialized fonts.
Unicode and XML The W3C‘s Character Model for the World Wide Web
, describes its goal to:
“facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access. One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way.”
Fundamental to this goal is the use of Unicode on the basis that it is:
- The only universal character repertoire available,
- Provides a way of referencing characters independent of the encoding of the text,
- Being updated/completed carefully, and
- Widely accepted and implemented by industry.
The W3C has adopted Unicode as the document character set for XML 1.0
. W3C specifications and applications now use Unicode as the common reference character set.
Unicode and the MISMO Profile As described in MEG0002, the MISMO Profile is based on the use of XML 1.0 and Unicode. The MISMO profile seeks to retain conformance with these standards while excluding those aspects that could result in interoperability issues among the constituent services in the Real Estate Finance Industry.
This document describes the guidelines for use of Unicode within the context of MISMO based messages. Many of these guidelines originate from best practice as described in the W3C‘s Unicode in XML and other Markup Languages
which should be read in conjunction with this MEG.
Note: The services may reject messages that contravene “MUST” guidelines described in this document.
Comments
None.
Terms
1. Terms and Conditions
1.1 Disclaimer
MISMO accepts no liability for the accuracy, adequacy, or completeness of the information contained in this MISMO Engineering Guideline (MEG).
1.2 Reproduction
Material in this MEG may be reproduced free of charge without obtaining explicit permission from MISMO, provided that the source is acknowledged, the document title given, and the material used in context.
1.3 Copyright
©2008 MISMO. All material in this MEG is the property of MISMO. All Rights Reserved.