ARMENIAN CHARACTER SETS
IMPLEMENTATION GUIDE


Document version 006.en.html
June 30, 1999

Abstract

This document presents the set of Armenian characters used in information systems in accordance with AST 34.001 standard of the State Standards Commission of the Republic of Armenia. It also provides information on the classification and sorting of Armenian characters and recommendations for implementation of basic algorithms of text processing.

Table of Contents

1. Introduction
2. Basic Character Set

2.1. Naming
2.2. Classification and Sorting
2.3. Ligatures

3. Encoding

3.1. Basic principles
3.2. Cross Reference of Coding Tables

4. Character Set and Language Tags

4.1. Character Set Tags
4.2. Language Tags

5. Acknowledgements
6. Author's Address
7. References

1. Introduction

The publication of comments in reference to the standards is due to the following considerations:

1. Armenian character sets have been used in different computer systems since at least 1982, although a national standard was established only in 1997. This time lag resulted in the emergence of incompatible coding systems. Some of the existing discrepancies are also due to the existence of two different grammars of the Armenian language.

2. The emergence of internationalized operating systems and an important number of multilingual applications result in difficulties when national language support is implemented by programmers that are not familiar with Armenian.

The present memo is a recommendation rather than a binding standard.

The recommendations set forth herein are elaborated on the basis of the national standard AST 34.001 (reg.no. 166-97), as well as the ArmSCII Version 2 standard.


2. Basic Character Set

2.1. Naming

The Armenian character set presented below follows the standard AST 34.001. The first column contains full naming of the characters, and the second column provides abbreviations thereof that can be used in the systems confined to the Latin character set. The detailed classification of the characters follows in the points below.

In spite of the fact that the space, numbers and Latin script are also part of the Armenian character set, these were not included in the AST 34.001 standard since these are present in all systems.

Table 1. Basic Character Set

  1 2
Armenian Eternity Sign armeternity
Armenian Ligature "ew" armew
Armenian Section Sign armsection
Armenian Full Stop (Verjaket) armfullstop
Armenian Right Parenthesis armparenright
Armenian Left Parenthesis armparenleft
Armenian Right Quotation Mark armquotright
Armenian Left Quotation Mark armquotleft
Armenian EM Dash armemdash
Armenian Dot (Mijaket) armdot
Armenian Separation Mark (But) armsep
Armenian Comma armcomma
Armenian EN Dash armendash
Armenian Hyphen (Yentamna) armyentamna
Armenian Ellipsis armellipsis
Armenian Apostrophe armapostrophe
Armenian Exclamation Mark (Amanak) armexclam
Armenian Accent (Shesht) armaccent
Armenian Question Mark (Paruyk) armquestion
Armenian Capital Letter [ayb] Armayb
Armenian Small Letter [ayb] armayb
Armenian Capital Letter [ben] Armben
Armenian Small Letter [ben] armben
Armenian Capital Letter [gim] Armgim
Armenian Small Letter [gim] armgim
Armenian Capital Letter [da] Armda
Armenian Small Letter [da] armda
Armenian Capital Letter [yech] Armyech
Armenian Small Letter [yech] armyech
Armenian Capital Letter [za] Armza
Armenian Small Letter [za] armza
Armenian Capital Letter [e] Arme
Armenian Small Letter [e] arme
Armenian Capital Letter [at] Armat
Armenian Small Letter [at] armat
Armenian Capital Letter [to] Armto
Armenian Small Letter [to] armto
Armenian Capital Letter [zhe] Armzhe
Armenian Small Letter [zhe] armzhe
Armenian Capital Letter [ini] Armini
Armenian Small Letter [ini] armini
Armenian Capital Letter [lyun] Armlyun
Armenian Small Letter [lyun] armlyun
Armenian Capital Letter [khe] Armkhe
Armenian Small Letter [khe] armkhe
Armenian Capital Letter [tsa] Armtsa
Armenian Small Letter [tsa] armtsa
Armenian Capital Letter [ken] Armken
Armenian Small Letter [ken] armken
Armenian Capital Letter [ho] Armho
Armenian Small Letter [ho] armho
Armenian Capital Letter [dza] Armdza
Armenian Small Letter [dza] armdza
Armenian Capital Letter [ghat] Armghat
Armenian Small Letter [ghat] armghat
Armenian Capital Letter [tche] Armtche
Armenian Small Letter [tche] armtche
Armenian Capital Letter [men] Armmen
Armenian Small Letter [men] armmen
Armenian Capital Letter [hi] Armhi
Armenian Small Letter [hi] armhi
Armenian Capital Letter [nu] Armnu
Armenian Small Letter [nu] armnu
Armenian Capital Letter [sha] Armsha
Armenian Small Letter [sha] armsha
Armenian Capital Letter [vo] Armvo
Armenian Small Letter [vo] armvo
Armenian Capital Letter [cha] Armcha
Armenian Small Letter [cha] armcha
Armenian Capital Letter [pe] Armpe
Armenian Small Letter [pe] armpe
Armenian Capital Letter [je] Armje
Armenian Small Letter [je] armje
Armenian Capital Letter [ra] Armra
Armenian Small Letter [ra] armra
Armenian Capital Letter [se] Armse
Armenian Small Letter [se] armse
Armenian Capital Letter [vev] Armvev
Armenian Small Letter [vev] armvev
Armenian Capital Letter [tyun] Armtyun
Armenian Small Letter [tyun] armtyun
Armenian Capital Letter [re] Armre
Armenian Small Letter [re] armre
Armenian Capital Letter [tso] Armtso
Armenian Small Letter [tso] armtso
Armenian Capital Letter [vyun] Armvyun
Armenian Small Letter [vyun] armvyun
Armenian Capital Letter [pyur] Armpyur
Armenian Small Letter [pyur] armpyur
Armenian Capital Letter [ke] Armke
Armenian Small Letter [ke] armke
Armenian Capital Letter [o] Armo
Armenian Small Letter [o] armo
Armenian Capital Letter [fe] Armfe
Armenian Small Letter [fe] armfe

2.2. Classification and Sorting

The basic character set can be divided into the following functional subsets:

unclassified-symbols ::= {armeternity, armew, armsection}

punctuation-signs ::= {armfullstop, armparenright, armparenleft, armquotright, armquotleft, armemdash, armdot, armsep, armcomma, armendash}

modifier-letters ::= {armyentamna, armellipsis, armapostrophe}

combining-punctuation ::= {armexclam, armaccent, armquestion}

letters ::= {capital-letters, small-letters}

capital-letters ::= {Armayb, Armben, Armgim, Armda, Armyech,Armza, Arme, Armat, Armto, Armzhe, Armini, Armlyun, Armkhe, Armtsa, Armken, Armho, Armdza, Armghat, Armtche, Armmen, Armhi, Armnu, Armsha, Armvo, Armcha, Armpe, Armje, Armra, Armse, Armvev, Armtyun, Armre, Armtso, Armvyun, Armpyur, Armke, Armo, Armfe}

small-letters ::= {armayb, armben, armgim, armda, armyech, armza, arme, armat, armto, armzhe, armini, armlyun, armkhe, armtsa, armken, armho, armdza, armghat, armtche, armmen, armhi, armnu, armsha, armvo, armcha, armpe, armje, armra, armse, armvev, armtyun, armre, armtso, armvyun, armpyur, armke, armo, armfe}

The sorting order is important for letter characters only and should follow the order presented in the Table 1.

Capitalization applies to letter characters only. The shift from upper case to lower case replaces the capital-letter character with the following character as per Table 1. Accordingly, the shift from lower case to upper case replaces the small-letter character with the preceding character as per the Table 1.

Text search and dictionary applications should take into account the following factors: (1) in the Armenian language, a word is a sequence of letters, combining-punctuation, and modifier-letters; (2) in comparison of words in the text or dictionary, the combining-punctuation and modifier-letters may be ignored.

In reference to the combining-punctuation, the following factors are important: (1) the combining-punctuation mark follows the letter to which it applies (which can only be a vowel in Armenian), (2) a letter can be followed by more than one combining-punctuation mark.

2.3. Ligatures

A ligature is a traditional or convenient graphical presentation of a sequence of letters, e.g. the Latin ligature "fi", the German ligature "ss", the Armenian ligature "armmen+armnu", etc. The ligatures can be officially registered and codified (as in the UCS), but the systems supporting ligatures substitute them automatically only on the screen, printer, or other graphical devices.

The Armenian ligature armewthat is a combination of armyechand armvyunwas included in the AST 34.001 standard in view of the following considerations: (1) armew is a "ligature symbol" rather than a ligature, and (2) armew carries an "and" denotation similar to the "&" character.


3. Encoding

3.1. Basic Principles

The Coded Character Set is a mapping of a set of characters into a set of integer numbers, e.g. ArmSCII-7, ArmSCII-8 and ArmSCII-8A tables.

The term "unification" is used in the following denotation: as a rule, the mapping of an Armenian character set takes place in operating environments where other character sets are already available; thus, certain characters, in particular punctuation marks, may have identical graphical mapping and similar functions. In such cases, some characters of the Armenian character set may be mapped into already existing codified characters. The details of unification of Armenian punctuation marks are reviewed below.

The mapping of characters in coding tables has several aspects (in order of priority): (1) scope of the character mapping, (2) sequence of mapping, (3) character unification requirements, (4) general requirements of a given operating environment.

The encoding in every new operating environment should, to the extent possible, use the already existing coding tables (see the next section). Should this be impossible, the newly created coding tables should follow as much as possible the following general principles:

1. The Armenian character set should be comprehensive (with due regard to the unification)

2. The Armenian character set should be mapped into a continual sequence of codes in the order these are presented in the Table 1. The unified character codes should be left absolute, i.e. should not be used for other purposes. The most important is the letter sequence.

3. The unification implies both graphical and functional identity of characters. For example, mapping of the parenthesis (armparenleft and armparenright) into the parenthesis existing in the ASCII is not an error. On the other hand, the similarity of the Armenian full stop (armfullstop) and the colon is purely graphical. The armdotand armsepbear functions different from the Latin dot and the grave accent character accordingly. Another important factor of character unification is the use of the Latin alphabet and punctuation marks in formal languages. It should be born in mind, for example, that a comma is often used as a separator in lists (e.g. in a keyword list in HTML document header), and in order to avoid confusion, the armcomma character may be mapped into a Latin comma.

4. It may often happen that the requirements of a given operating environment may contradict the above principles. For example, the pseudo-graphical characters in DOS that were supported by video-adapters ("ninth pixel" factor), resulted in the creation of an alternative 8-bit coding table ArmSCII-8A. Another example is Macintosh OS where codes like ellipsis, nbsp and soft hyphen are recognized and interpreted in a special by numerous applications, which rendered the meaningful use the ArmSCII standard in this system impossible (the ArmSCII-8A table is used in OS Macintosh).

ArmSCII coding table does not fully correspond to the above principles, and the Armenian block in the current version of Unicode (2.1) corresponds to neither (1), (2), nor (3).

3.2. Cross Reference of Coding Tables

Table 2. Cross reference

1 - Short name
2 - ArmSCII-7
3 - ArmSCII-8
4 - ArmSCII-8A
5 - Unicode Version 2.1

  1 2 3 4 5
armeternity 21 A1 DC -
armew 22 A2 15 0587
armsection - - - 00A7
armfullstop 23 A3 3A 0589
armparenright 24 A4 29 0029
armparenleft 25 A5 28 0028
armquotright 26 A6 AF 00BB
armquotleft 27 A7 AE 00AB
armemdash 28 A8 2D 2014
armdot 29 A9 2E 002E
armsep 2A AA 60 055D
armcomma 2B AB 2C 002C
armendash 2C AC 5F 002D
armyentamna 2D AD DD 058A
armellipsis 2E AE DE 2026
armapostrophe 7E FE FE 055A
armexclam 2F AF 7E 055C
armaccent 30 B0 27 055B
armquestion 31 B1 DF 055E
Armayb 32 B2 80 0531
armayb 33 B3 81 0561
Armben 34 B4 82 0532
armben 35 B5 83 0562
Armgim 36 B6 84 0533
armgim 37 B7 85 0563
Armda 38 B8 86 0534
armda 39 B9 87 0564
Armyech 3A BA 88 0535
armyech 3B BB 89 0565
Armza 3C BC 8A 0536
armza 3D BD 8B 0566
Arme 3E BE 8C 0537
arme 3F BF 8D 0567
Armat 40 C0 8E 0538
armat 41 C1 8F 0568
Armto 42 C2 90 0539
armto 43 C3 91 0569
Armzhe 44 C4 92 053A
armzhe 45 C5 93 056A
Armini 46 C6 94 053B
armini 47 C7 95 056B
Armlyun 48 C8 96 053C
armlyun 49 C9 97 056C
Armkhe 4A CA 98 053D
armkhe 4B CB 99 056D
Armtsa 4C CC 9A 053E
armtsa 4D CD 9B 056E
Armken 4E CE 9C 053F
armken 4F CF 9D 056F
Armho 50 D0 9E 0540
armho 51 D1 9F 0570
Armdza 52 D2 A0 0541
armdza 53 D3 A1 0571
Armghat 54 D4 A2 0542
armghat 55 D5 A3 0572
Armtche 56 D6 A4 0543
armtche 57 D7 A5 0573
Armmen 58 D8 A6 0544
armmen 59 D9 A7 0574
Armhi 5A DA A8 0545
armhi 5B DB A9 0575
Armnu 5C DC AA 0546
armnu 5D DD AB 0576
Armsha 5E DE AC 0547
armsha 5F DF AD 0577
Armvo 60 E0 E0 0548
armvo 61 E1 E1 0578
Armcha 62 E2 E2 0549
armcha 63 E3 E3 0579
Armpe 64 E4 E4 054A
armpe 65 E5 E5 057A
Armje 66 E6 E6 054B
armje 67 E7 E7 057B
Armra 68 E8 E8 054C
armra 69 E9 E9 057C
Armse 6A EA EA 054D
armse 6B EB EB 057D
Armvev 6C EC EC 054E
armvev 6D ED ED 057E
Armtyun 6E EE EE 054F
armtyun 6F EF EF 057F
Armre 70 F0 F0 0550
armre 71 F1 F1 0580
Armtso 72 F2 F2 0551
armtso 73 F3 F3 0581
Armvyun 74 F4 F4 0552
armvyun 75 F5 F5 0582
Armpyur 76 F6 F6 0553
armpyur 77 F7 F7 0583
Armke 78 F8 F8 0554
armke 79 F9 F9 0584
Armo 7A FA FA 0555
armo 7B FB FB 0585
Armfe 7C FC FC 0556
armfe 7D FD FD 0586

 


4. Character Set and Language Tags

4.1. Coded Character Set Tags

In the systems and protocols using mnemonic tags for coded character sets, the following tags should be used (name, official source):

Name:   armscii-8
Source:   Armenian Standard Code for Information Interchange, 8-bit coded character set
     
Name:   armscii-8a
Source:   Armenian Standard Code for Information Interchange, alternative 8-bit coded character set

4.2. Language Tags

Dictionaries, spelling checkers and other linguistic systems, as well as operating environments distinguishing human languages and locale identification should take into consideration the existence of 4 mutually incomprehensible forms (dialects) of the Armenian language: Eastern Armenian, Western Armenian, Grabar and Middle Armenian. Table 3 presents suggested MIME-style (RFC-1766) mnemonic tags.

Table 3. Language tags

Mime-style name   Full name
hy-eastern   Eastern Armenian
hy-western   Western Armenian
hy-grabar   Grabar
hy-middle   Middle Armenian

5. Acknowledgements

This document is the result of long and intensive consultations and cooperation with the staff of the Standards Working Group of the Armenian Computer Center. Special thanks for most valuable inputs and comments go to (in alphabetical order):

Vahram Mekhitarian (vm@acc.am)
Aram Hayrapetian (aramhayr@hotmail.com)
Hovhannes Gizoghian (hkizogh@acc.am)
Tigran Haroutunian (nt1@noyan-tapan.am)
Rouben Taroumian-Hakobian (tarumian@acc.am)
Michael Everson (everson@indigo.ie)


6. Author's Address

Hovik Melikyan
ArmSCII Working Group
Yerevan, Republic of Armenia
hovik@moon.yerphi.am


7. References

[AST 34.001-97]

Information Technologies -- Character Set And Information Encoding: Character Set -- State Standardization Committee of the Republic of Armenia, July 1997

[ArmSCII]

Armenian Standard Code for Information Interchange -- Center of Humane Technologies "Armenian Computer", June 1991

[ArmSCII Version 2]

Armenian Standard Code for Information Interchange, Version 2 -- ArmSCII Working Group, May 1999

[RFC-1766]

Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995.

[Unicode]

The Unicode Consortium, "The Unicode Standard -- Version 2.0", Addison-Wesley, 1996.

[Unicode Version 2.1]

Unicode Technical Report #8, The Unicode Standard, Version 2.1 -- http://www.unicode.org/unicode/reports/tr8.html