Unicode input

The Unicode logo

Unicode input is the insertion of a specific Unicode character on a computer by a user; it is a common way to input characters not directly supported by a physical keyboard. Unicode characters can be inserted in three ways: from the screen by means of an applet from which one can select the character, by pasting from the operating system's clipboard, or by typing a certain sequence of keys on a physical keyboard. Unicode is similar to ASCII, but provides many more options and can store more signs.[1]

A Unicode input system needs to provide a large repertoire of characters, ideally all valid Unicode code points. This is different from a keyboard layout which defines keys and their combinations only for a limited number of characters appropriate for a certain locale.

KCharSelect picks some of Unicode Mathematical Operators

Unicode numbers

Unicode characters are distinguished by code points, which are conventionally represented by the letter U followed by four or five hexadecimal digits, for example U+00AE or U+1D310. Characters in the Basic Multilingual Plane (BMP), containing modern scripts – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as emoticons, playing cards and many CJK characters) have 5-digit codes.

Decimal input

In some applications on Microsoft Windows, particularly those using the RichEdit control, decimal Unicode code points (for example, 256 for U+0100) are supported with Alt codes.

Unicode in HTML

HTML uses a different syntax for code points. Character codes may be specified after ampersand (&) and the number sign (#), and are followed by the semicolon (;). The number can be either in decimal or in hexadecimal. Preceding zeros may be omitted. If the input is in hexadecimal, the number is preceded by an "x". Some characters can also be used by "entity name".
Example: The HTML code of the copyright sign U+00A9 (or ©) can be:
© (decimal input)
© (hexadecimal input)
© (entity name)

Availability

The ability to input a Unicode character does not guarantee that it can be displayed; it can only be displayed when the application supports Unicode text and can access a font which contains a glyph for the character.[2] Very few fonts have full Unicode coverage; most only contain the glyphs needed to support a few writing systems and natural languages, at most.

Applications generally only access one font at a time for a given span of text, so when the current font does not support a particular character, the character will usually be shown as an empty box, a question mark or other generic replacement character, e.g. "�". This behavior was common in older web browsers and editors, but most modern browsers and other text-processing applications are able to display multilingual content because they perform font substitution, automatically switching to a fallback font when necessary to display characters which aren't supported in the current font. Which fonts are used for fallback and the thoroughness of Unicode coverage varies by software and operating system; some software will search for a suitable glyph in all of the installed fonts, others only search within certain fonts.

Selection from a screen

Applet for character selection

Many systems provide a way to select Unicode characters visually. ISO 14755 refers to this as a screen-selection entry method.

Microsoft Windows has provided a Unicode version of the Character Map program (find it by hitting ⊞ Win+R then type charmap then hit ↵ Enter) since version NT 4.0 – appearing in the consumer edition since XP. This is limited to characters in the Basic Multilingual Plane (BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block. More advanced third-party tools of the same type are also available (a notable freeware example is BabelMap).

macOS provides a "character palette" with much the same functionality, along with searching by related characters, glyph tables in a font, etc. It can be enabled in the input menu in the menu bar under System Preferences → International → Input Menu (or System Preferences → Language and Text → Input Sources) or can be viewed under Edit → Emoji & Symbols in many programs.

Equivalent tools – such as gucharmap (GNOME) or kcharselect (KDE) – exist on most Linux desktop environments.

Hexadecimal code input

Different glyphs of Unicode U+0061.

Clause 5.1 of ISO 14755 describes a Basic method whereby a beginning sequence is followed by the hex number representation of the code point and the ending sequence. On some systems, this is limited to the BMP (characters up to U+FFFF).

In Microsoft Windows

In order to enable a universal (independent of language settings) input method in Windows, one can add a string type (REG_SZ) value called EnableHexNumpad to the registry key HKEY_CURRENT_USER\Control Panel\Input Method and assign the value data 1 to it. Users need to log off/in on Windows 8.1/8.0, Windows 7, and Vista or reboot on earlier systems after editing the registry for this input method to start working. Unicode characters can then be entered by holding down Alt, pressing the + on the numeric keypad, followed by the hex code - using the numeric keypad for digits from 0 to 9 and letter keys for A to F digits - and then releasing Alt.[2]

Some individual Windows programs or apps already support the input of Unicode: for instance the RichEdit control on Microsoft Windows (as used in for example WordPad) supports the following input method: one first enters the character’s hexadecimal code (between two and six hexadecimal digits), then immediately presses Alt+x. For example, entering f1 and then pressing the combination will produce the character ñ. Unless it is six hexadecimal digits long, the code must not be preceded by any digit or letters a–f as they will be treated as part of the code to be converted. For example, entering af1 followed by Alt+x will produce ૱ (U+0AF1), but entering a0000f1 followed by Alt+x will produce a ñ. This also works in Microsoft Word 2002/2003 for Windows. In Microsoft Word 2007/2010 for Windows Alt+c has to be used.

In macOS

In macOS the "Emoji & Symbols" (Command+Ctrl+Space) menu can be found in the Edit menu in many programs. This brings up the Characters palette allowing the user to choose any character from a variety of views. The user can then also search for the character or Unicode plane by name.[3] In Mac OS 8.5 and later: one chooses the Unicode Hex Input keyboard layout; in OS X Yosemite, this can be added in Keyboard->Input Sources. Holding down the ⌥ Option, one then types the four-digit hex Unicode code point and the equivalent character appears. One can then release the ⌥ Option key.[4] Characters outside of the BMP exceed the four-digit limit of the Unicode hex input mechanism but can be entered using the search entry box in the Character Viewer (Edit → Emoji & Symbols) or by using surrogate pairs.[5] To use surrogate pairs, hold down the ⌥ Option key, the first surrogate, the + key (shift key is ignored), the second surrogate and then release the Option key.

In X11 (Linux and other Unix variants)

The possibility of hexadecimal code input on operating systems using the X Window System depends on the system and applications. Hex input is not implemented in the common X.Org Server.[6] Individual input methods and GUI toolkits can provide hex input independent of the X server.

For example, GTK+ is an ISO 14755-conformant system. The beginning sequence is Ctrl+⇧ Shift+U and the ending sequence is ↵ Enter or Space. Programs based on GTK+, such as GNOME applications, support Unicode input.

There are two common methods for direct input of Unicode characters:

In Inkscape, for example, only the second method works.

In non-GTK applications, however, there usually is no escape sequence to input arbitrary input characters. For example, Qt, KDE rely on the standard X Input Method (XIM) framework, and do not implement their own solutions.[7] In xterm, these input methods are not supported, but using escape sequences is an alternative. rxvt-unicode implements optional ISO 14755, enabled by default.

However, regardless of the toolkit used, the Compose key subsystem can be used to configure certain key stroke combinations to input a subset of unicode.

In platform-independent applications

The capability of Vim to create custom mnemonics, as described below, which could be employed on an ad-hoc basis, requires the decimal code point.

Character mnemonics

RFC 1345 defines a large number (1,893) of suggested mnemonics for code points in Unicode 1.0 (as well as characters in ISO 2DIS 10646 and many other character sets in use at the time of publication). Although the document does not restrict the length of a mnemonic (for example, "10000R" for U+2821), most (1,338) of the mnemonics are two characters long, and most (416) of the remaining are three-characters. While never complete, and targeting obsolescent set definitions, the mnemonics themselves can still be used.

RFC 1345 predates the introduction of the Euro sign (€, U+20AC), but the above applications included it as the mnemonic "Eu".

See also

External links

Wikibooks has a book on the topic of: Unicode/List of useful symbols

References

This article is issued from Wikipedia - version of the 11/16/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.