proxy70

From: Andras Kornai <andras@calera.com>
Subject: Re: One more kermit question
To: fdc@watsun.cc.columbia.edu (Frank da Cruz)
Date: Thu, 11 Mar 93 21:33:45 PST

----------------------------------------------------------------------
CYRILLIC ENCODING FAQ Version 1.3, March 13 1993

ACKNOWLEDGEMENTS Most of the information was provided by the following:

David J. Birnbaum <djbpitt+@pitt.edu>
Frank da Cruz <fdc@watsun.cc.columbia.edu>
Bur Davis <bdavis@adobe.com>
George Fowler <gfowler@ucs.indiana.edu>
Richard B. Paine <RPAINE@CCNODE.Colorado.EDU>
Slava Paperno <PAPY@CORNELLA.cit.cornell.edu>
Keld J. Simonsen <Keld.Simonsen@dkuug.dk>
Glenn E. Thobe <thobe@getunx.info.com>
Dimitri Vulis <DLV@CUNYVMS1.BITNET>
Johan W. van Wingen <precal@rulmvs.leidenuniv.nl>


Thanks to all who contributed -- I am responsible for the errors that
still remain. 

Andras Kornai (andras@calera.com, kornai@csli.stanford.edu)


Q: What are the commonly used computer encodings for Cyrillic?  
A: Broadly speaking, there are three kinds of schemes in use: those that
replace Cyrillic characters by 7-bit ascii values, those that use the
full 8-bit range 0-255, and those using multi-byte codes.  Presently
only the first two types are in wide use, but for reference purposes I
will also discuss the third type.


Q: What kind of transliteration schemes are there?  
A: The most important one is called KOI-7: the Russian alphabet is given
by the ASCII characters (note the exchange of upper and lower cases):

UPPER CASE:  abwgde$vzijklmnoprstufhc~{}"yx|`q
lower case:  ABWGDE#VZIJKLMNOPRSTUFHC^[]_YX\@Q

The following extensions to the official standard KOI-7 are supported in
Glenn Thobe's conversion programs for invertibility: '"'=YER, '#'=yo,
'$'=YO, '<'=left guillemet, '>'=right guillemet.

A slightly different (multicharacter) scheme is employed by Steve
Gaardner's (gaarder@theory.tc.cornell.edu) conversion code from Old
KOI-8, included below. This particular scheme provides easy
readability but suffers from some transliteration weirdness, such as
mapping short ii and yeri on the same character. Since proper
transliteration often requires context-sensitive rules, and differs
from language to language within the same script, a fuller discussion
is beyond the scope of the present document. For an overview of the
major Cyrillic to Latin transliteration schemes used in the US, see pp
457-460 of the Style Manual of the US Government Printing Office, for
sale by the Superintendent of Documents, USGPO, Washington DC 20402,
Stock Number 021-000-00120-1 (paper) or 021-000-00120-0 (hardbound).
See also the Chicago Manual of Style, and Transliteracija russkikh
slov latinskimi bukvami, GOST 167876-71


#include <stdio.h>
char transtbl[64][5] =
        {"yu", "a", "b", "ts", "d" , "e", "f", "g", "kh", "i", "y" , "k", "l",
        "m", "n", "o", "p", "ya", "r" , "s", "t", "u", "zh", "v", "'",
        "y", "z", "sh", "e", "shch", "ch", "`", 
        "YU", "A", "B", "TS", "D" , "E", "F", "G", "KH", "I", "Y" , "K", "L",
        "M", "N", "O", "P", "YA", "R" , "S", "T", "U", "ZH", "V", "'",
        "Y", "Z", "SH", "E", "SHCH", "CH", "`" };
main()
{
        int c;

        while ((c = getchar()) != EOF)
        {       if ( c > 0x80) c -= 0x80;
                if ( c < 0x40) putchar(c);
                else printf("%s",transtbl[c-0x40]);
        }
}


Q: What are the eight-bit schemes?  

A: For the IBM mainframe world, which includes the ES (edinnaja sistema)
clones of 360-370 mainframes, the basic scheme, called DKOI-8, extends
EBCDIC by putting the Cyrillic letters in the unused slots, mostly in
the rectangle 0x8a to 0xff (first hex digit >=8, second digit >=a). The
mysteries of EBCDIC/ASCII conversion go beyond the scope of this
document, and in the table that follows I will ignore 8-bit ascii values
below 0xa0 and refer the reader to Dimitri Vulis' excellent document,
which sheds some light on the IBM meaning of the characters 0x80-0x9f
which are reserved in both IS0 8859-1 (Latin-1) and 8859-5 (Cyrillic).

/* From 8859-5 to DKOI-8. ebcdic(isoval) = isotoibm[isoval-160] */

int isotoibm[96] = {
0x41,0xaa,0x4a,0xb1,0x9f,0xb2,0x6a,0xb5,
0xbd,0xb4,0x9a,0x8a,0x5f,0xca,0xaf,0xbc,
0x90,0x8f,0xea,0xfa,0xbe,0xa0,0xb6,0xb3,
0x9d,0xda,0x9b,0x8b,0xb7,0xb8,0xb9,0xab,
0x64,0x65,0x62,0x66,0x63,0x67,0x9e,0x68,
0x74,0x71,0x72,0x73,0x78,0x75,0x76,0x77,
0xac,0x69,0xed,0xee,0xeb,0xef,0xec,0xbf,
0x80,0xfd,0xfe,0xfb,0xfc,0xad,0xae,0x59,
0x44,0x45,0x42,0x46,0x43,0x47,0x9c,0x48,
0x54,0x51,0x52,0x53,0x58,0x55,0x56,0x57,
0x8c,0x49,0xcd,0xce,0xcb,0xcf,0xcc,0xe1,
0x70,0xdd,0xde,0xdb,0xdc,0x8d,0x8e,0xdf
};

There are minor variations to DKOI, called Cyrillic Extended Code Page
037 (most common on BITNET), CECP 500 (which is the definitive one), the
"JNET" and the "FORTRAN" mappings. The differences between these are
tabulated below. Notice that EBCDIC/DKOI, unlike ASCII, is not uniquely
defined even on the 0-127 range:


8859-5 037 500 JNET FORTRAN

0x21 0x5a 0x4f 0x5a 0x4f exclamation point (bang)
0x5b 0xba 0x4a 0xad 0x4a opening square bracket
0x5d 0xbb 0x5a 0xbd 0x5a closing square bracket
0x5e 0xb0 0x5f 0x5f 0x5f circumflex accent
0x7c 0x4f 0xbb 0x6a 0x4f logical or (vertical bar)
[a2] 0x4a 0xb0 0x43 0x43 centsign (in 037)/capital dje (in 500)
[ac] 0x5f 0xba 0x54 0x54 logical not (in 037)/capital kje (in 500)
0xd5 0xef 0xef 0xbb 0xad small ie 
0xe3 0x46 0x46 0x4a 0xbb small u
0xe5 0x47 0x47 0xfc 0xbd small kha
0xfc 0xdc 0xdc 0x6a 0xfc small kje 




For the Internet, the most important code seems to be Old KOI-8, widely used
in the Relcom groups (but probably not a whole lot elsewhere). Old KOI-8
(GOST 19768-74) from 1974 more or less follows Latin transliteration order
and does not include upper-case hard sign, or letters common to other Slavic
Cyrillic alphabets (Bulgarian, Macedonian, Serbian, Ukrainian...).  In the
0-127 range it is identical with ascii, and for the 192-254 region see the
transtabl array above.  Some software, including uunpack (also used in
Sergej Ryzhkov's bml, aka Beauty Mail system for PCs) which is distributed
by Relcom, force upper-case hard sign to 255, others (and the standard!)
declare this incorrect, or perhaps reserve 255 for DEL.  In an earlier
version of Andrew Hume's <andrew@research.att.com> tcs, which supports
conversion across a wide variety of Cyrillic encodings, this was called the
"mystery DOS Cyrillic encoding", except that his sha and shcha seem to be
interchanged. Tcs is available for anon ftp from research.att.com in
directory /dist/tcs.shar.Z. The semantics of 128-191 in Old KOI is unclear
to me. If there is an official code page (it was suggested that Xenix users
might have one), please post it.

For the PC community, Code Page 866 seems to be quite important. This is
what Microsoft is using in its russified version of MS-DOS. In 0-31
ascii control chars are replaced by a random selection of dingbats. In
32-126 it is identical to ascii, and in 127 it has something that looks
like a little house (the interpretation of such positions seems to be
subject to much uncertainty). The Russian part (128-255) is identical to
Brjabrin's alternativnyj variant, except for 242-251, where some of the
accents/symbols of AV are replaced by non-Russian Cyrillic characters
and other symbols. Unfortunately CP 866 covers only Ukrainian and
Belorussian, with the vague suggestion that e.g.  Macedonian users could
redefine the six non-Russian Cyrillic positions.  This problem is
largely resolved in Code Page 1251, the Microsoft Cyrillic Windows 3.1
character set, (also endorsed by WordPerfect and Adobe), which contains
all Cyrillic letters used by modern Slavic languages. CP 1251 is fully
compatible with ascii on 0-127 (leaves control positions undefined), has
the Russian alphabet (in order, but without io) in 192-256, and puts the
non-Russian Cyrillic, Russian io, and a few symbols in 128-191.

Brjabrin's Alternativnyj Variant (AV) is also widely used on PCs.  It
has Russian in 128 to 175 in alphabetical order except for yo, graphics
characters in 176 to 223, again Russian in 224-241. The same set of
graphics characters, but not in the same order, is used in Brajabin's
Osnovnoj Variant: they are similar to, but not identical with, IBM
Extended ASCII graphics chars (neither the set of shapes nor the code
values are the exact same). AV and OV have no non-Russian Cyrillic or
accented characters, but four accent marks are provided: 242 (acute
below the symbol), 243 (grave below the symbol), 244 (acute above the
symbol), and 245 (grave above the symbol). These, as well as upper case
and lower case yo, codes 240 and 241, are in the same position in
Osnovnoj Variant as well. Codes 246 - 249 are arrows, pointing right,
left, down, up, in that order.  Codes 250 and 251 are, in both sets
described by Briabrin, the division sign and the plus/minus sign (the
latter becomes a radical sign in 866). 252 is the Number symbol, 253 is
a sunburst, and 254 is "end of proof". 255 is in principle unused -- in
practice people put things there.

For the academic community, the lack of accents is remedied by the
Academic version of AV developed at Cornell, which includes upper and
lower case acute-accented vowels, and lower case grave-accented vowels.
These replace all but six of the graphics characters (the six that were
retained are those that are necessary for drawing a single-line box).
The accented vowels in this set include a grave-accented lower case yo.
Also included are the letters with diacritics used in French, German,
and Spanish. The complete chart and DOS/Windows software may be
requested from Exceller Software Corp.  800-426-0444. (This is NOT a
product endorsement -- I haven't even seen the stuff!) Cornell also
developed an Academic version of CP1251.  In this, non-Russian Slavic
languages are not supported: their letters have been replaced by Russian
accented vowels.  These include upper and lower case acute-accented
vowels, and lower case grave-accented vowels. Also included are upper
and lower case grave-accented yo. The AcademicFont Cyrillic character
set was developed by University Microcomputers, who pioneered the use of
Slavic languages on IBM-compatible computers in the US in the
mid-eighties. This set is included among the 11 sets in Exceller's
product. It supports Slavic and some non-Slavic languages, but not
accented vowels.

For the Macintosh community, there is a separate code page. It is ascii
below 128, has the Russian capital letters in 128-159 in alphabetical order
(as usual, io is treated separately) and the Russian lowercase letters in
240-254, but lower case ja is moved to 239, its place taken by the sunburst
symbol. In the 160-238 range we finde the same set of (ISO 8859-5)
non-Russian Cyrillic characters as in CP 1251. The symbols that appear here
are also largely the same as in 1251, but the orderings are completely
different and a few symbols are unique to one or the other, e.g.  permille
in 1251, capital delta in the Mac encoding.  While a Macintosh version
capable of character conversion is still on the drawing boards, for most
other platforms Columbia Kermit is capable of converting between a large
variety of Cyrilic encodings.  Anon ftp to watsun.cc.columbia.edu: for
C-Kermit 5A(188) (Unix, VMS, OS/2, Amiga etc) get file kermit/b/ckaaaa.hlp,
read it, take it from there. For MS-DOS Kermit 3.11, get (in binary mode)
kermit/bin/msvibm.zip, then unzip. For IBM Mainframe Kermit 4.2 and later,
get kermit/b/ik0*.* plus one of the following: kermit/b/ikc*.* for VM/CMS,
kermit/b/ikt*.* for MVS, kermit/b/ikx*.* for CICS or kermit/b/ikm*.* for
MUSIC. There is also a large collection of character-set tables under
kermit/charsets.

Finally, the most broadly accepted standard outside these communities seems
to be GOSTSCI (GOSTCII), a term used colloquially to refer to Brjabrin's
Osnovnoj Variant or to ISO 8859-5 (which is also ECMA 113), although these
two are not identical when it comes to non-Russian Cyrillic. The term "New
KOI-8" means the 1987 revision of KOI-8 (GOST 19768-87) -- all these use the
same (alphabetical, except for yo) order as 8859/5, starting with A at 176.
However, the non-Russian Cyrillic characters (160-176 and 240-255 in new
KOI-8) are not part of OV, their space is taken up by some graphics chars
described for AV above. ISO 8859-5 provides for the Cyrillic characters
required for writing all major Slavic Cyrillic alphabets (Belorussian,
Bulgarian, Macedonian, Serbian, Ukrainian...), but not for those alphabets
that were devised for non-Slavic languages in the Soviet Union (Abkhazian,
Bashkir, Chukchee, Khanty, Tajik, ....), or archaic letters.


Q: Is this a big mess or what?
A: To straighten this out, it seems necessary to adopt a fixed point of
reference, which I take to be Unicode V1.1 = ISO 10646-1.2. While in
principle 10646 is a four-byte standard and Unicode uses 16-bit integers,
the "Basic Multilingual Plane" of 10646 is by definition identical to the
values assigned in Unicode 1.1, both being two-byte quantities (called UCS-2
by ISO). The following list gives the essential part of the names of the
Cyrillic characters and the last two hex digits of their Unicode/10646
encoding.

For reasons of space, the official Unicode/10646 names have been
abbreviated. For a full list of names, anon ftp to unicode.org, cd to
pub/MappingTables, and get namesall.lst (which is slightly over 200k).  To
get back the full official name from the abbreviations, always add the
prefix CYRILLIC, unless the position is UNUSED. Further, expand CAP (SMA) to
CAPITAL (SMALL). Finally, the word LETTER should be added after CAP/SMA,
unless it is THOUSANDS, LIGATURE, or COMBINING.  The numerical code values
given in the second column have also been abbreviated to the last two
digits, since the preceding two hex digits (really signifying "Cyrillic")
are always 04 in Unicode/10646.

The third column gives the-two character mnemonic abbreviations suggested in
Keld Simonsen's RFC1345 where they exist, to facilitate cross-reference to
this document (available by anon ftp e.g. from sunsite.unc.edu as
/pub/doc/rfp/rfp1345.txt.Z) which has tables for Serbian, Macedonian, as
well as other Cyrillic encodings (IBM CP 880, INIS-cyrillic = ISO-IR-51,
ECMA-cyrillic = ISO-IR-111) whose domain of usage is unclear to me, and
whose table for Old KOI seems to be in fact a New KOI table. I will add
conversion tables for these (or for any other) encodings provided a real
user community exists and actually generates some public domain
machine-readable texts.

UNUSED				00
CAP IO				01  IO
CAP DJE				02  D%
CAP GJE				03  G%
CAP E				04  IE
CAP DZE				05  DS
CAP I				06  II
CAP YI				07  YI
CAP JE				08  J%
CAP LJE				09  LJ
CAP NJE				0A  NJ
CAP TSHE			0B  Ts
CAP KJE				0C  KJ
UNUSED				0D
CAP SHORT U			0E  V%
CAP DZHE			0F  DZ
CAP A				10  A=
CAP BE				11  B=
CAP VE				12  V=
CAP GE				13  G=
CAP DE				14  D=
CAP IE				15  E=
CAP ZHE				16  Z%
CAP ZE				17  Z=
CAP II				18  I=
CAP SHORT II			19  J=
CAP KA				1A  K=
CAP EL				1B  L=
CAP EM				1C  M=
CAP EN				1D  N=
CAP O				1E  O=
CAP PE				1F  P=
CAP ER				20  R=
CAP ES				21  S=
CAP TE				22  T=
CAP U				23  U=
CAP EF				24  F=
CAP KHA				25  H=
CAP TSE				26  C=
CAP CHE				27  C%
CAP SHA				28  S%
CAP SHCHA			29  Sc
CAP HARD SIGN			2A  ="
CAP YERI			2B  Y=
CAP SOFT SIGN			2C  %"
CAP REVERSED E			2D  JE
CAP IU				2E  JU
CAP IA				2F  JA
SMA A				30  a=
SMA BE				31  b=
SMA VE				32  v=
SMA GE				33  g=
SMA DE				34  d=
SMA IE				35  e=
SMA ZHE				36  z%
SMA ZE				37  z=
SMA II				38  i=
SMA SHORT II			39  j=
SMA KA				3A  k=
SMA EL				3B  l=
SMA EM				3C  m=
SMA EN				3D  n=
SMA O				3E  o=
SMA PE				3F  p=
SMA ER				40  r=
SMA ES				41  s=
SMA TE				42  t=
SMA U				43  u=
SMA EF				44  f=
SMA KHA				45  h=
SMA TSE				46  c=
SMA CHE				47  c%
SMA SHA				48  s%
SMA SHCHA			49  sc
SMA HARD SIGN			4A  ='
SMA YERI			4B  y=
SMA SOFT SIGN			4C  %'
SMA REVERSED E			4D  je
SMA IU				4E  ju
SMA IA				4F  ja
UNUSED				50   
SMA IO				51  io
SMA DJE				52  d%
SMA GJE				53  g%
SMA E				54  ie
SMA DZE				55  ds
SMA I				56  ii
SMA YI				57  yi
SMA JE				58  j%
SMA LJE				59  lj
SMA NJE				5A  nj
SMA TSHE			5B  ts
SMA KJE				5C  kj
UNUSED				5D
SMA SHORT U			5E  v%
SMA DZHE			5F  dz
CAP OMEGA			60
SMA OMEGA			61
CAP YAT				62  Y3
SMA YAT				63  y3
CAP IOTIFIED E			64
SMA IOTIFIED E			65
CAP LITTLE YUS			66
SMA LITTLE YUS			67
CAP IOTIFIED LITTLE YUS		68
SMA IOTIFIED LITTLE YUS		69
CAP BIG YUS			6A  O3
SMA BIG YUS			6B  o3
CAP IOTIFIED BIG YUS		6C
SMA IOTIFIED BIG YUS		6D
CAP KSI				6E
SMA KSI				6F
CAP PSI				70
SMA PSI				71
CAP FITA			72  F3
SMA FITA			73  f3
CAP IZHITSA			74  V3
SMA IZHITSA			75  v3
CAP IZHITSA DOUBLE GRAVE	76
SMA IZHITSA DOUBLE GRAVE	77
CAP UK DIGRAPH			78
SMA UK DIGRAPH			79
CAP ROUND OMEGA			7A
SMA ROUND OMEGA			7B
CAP OMEGA TITLO			7C
SMA OMEGA TITLO			7D
CAP OT				7E
SMA OT				7F
CAP KOPPA			80  C3
SMA KOPPA			81  c3
THOUSANDS SIGN			82
NON-SPACING TITLO		83
NON-SPACING PALATALIZATION	84
NON-SPACING DASIA PNEUMATA	85 
NON-SPACING PSILI PNEUMATA	86 
UNUSED				87
UNUSED				88
UNUSED				89
UNUSED				8A
UNUSED				8B
UNUSED				8C
UNUSED				8D
UNUSED				8E
UNUSED				8F
CAP GE WITH UPTURN		90  G3
SMA GE WITH UPTURN		91  g3
CAP GE BAR			92
SMA GE BAR			93
CAP GE HOOK			94
SMA GE HOOK			95
CAP ZHE WITH RIGHT DESCENDER	96
SMA ZHE WITH RIGHT DESCENDER	97
CAP ZE CEDILLA			98
SMA ZE CEDILLA			99
CAP KA WITH RIGHT DESCENDER	9A
SMA KA WITH RIGHT DESCENDER	9B
CAP KA VERTICAL BAR		9C
SMA KA VERTICAL BAR		9D
CAP KA BAR			9E
SMA KA BAR			9F
CAP REVERSED GE KA		A0
SMA REVERSED GE KA		A1
CAP EN WITH RIGHT DESCENDER	A2
SMA EN WITH RIGHT DESCENDER	A3
CAP EN GE			A4
SMA EN GE			A5
CAP PE HOOK			A6
SMA PE HOOK			A7
CAP O HOOK			A8
SMA O HOOK			A9
CAP ES CEDILLA			AA
SMA ES CEDILLA			AB
CAP TE WITH RIGHT DESCENDER	AC
SMA TE WITH RIGHT DESCENDER	AD
CAP STRAIGHT U			AE
SMA STRAIGHT U			AF
CAP STRAIGHT U BAR		B0
SMA STRAIGHT U BAR		B1
CAP KHA WITH RIGHT DESCENDER	B2
SMA KHA WITH RIGHT DESCENDER	B3
CAP TE TSE			B4
SMA TE TSE			B5
CAP CHE WITH RIGHT DESCENDER	B6
SMA CHE WITH RIGHT DESCENDER	B7
CAP CHE VERTICAL BAR		B8
SMA CHE VERTICAL BAR		B9
CAP H				BA
SMA H				BB
CAP IE HOOK			BC
SMA IE HOOK			BD
CAP IE HOOK OGONEK		BE
SMA IE HOOK OGONEK		BF
PALOCHKA			C0
CAP SHORT ZHE			C1
SMA SHORT ZHE			C2
CAP KA HOOK			C3
SMA KA HOOK			C4
UNUSED				C5
UNUSED				C6
CAP EN HOOK			C7
SMA EN HOOK			C8
UNUSED				C9
UNUSED				CA
CAP CHE WITH LEFT DESCENDER	CB
SMA CHE WITH LEFT DESCENDER	CC
UNUSED				CD
UNUSED				CE
UNUSED				CF
CAP A WITH BREVE		D0
SMA A WITH BREVE		D1
CAP A WITH DIAERESIS		D2
SMA A WITH DIAERESIS		D3
CAP LIGATURE A IE		D4
SMA LIGATURE A IE		D5
CAP IE WITH BREVE		D6
SMA IE WITH BREVE		D7
CAP SCHWA			D8
SMA SCHWA			D9
CAP SCHWA WITH DIAERESIS	DA
SMA SCHWA WITH DIAERESIS	DB
CAP ZHE WITH DIAERESIS		DC
SMA ZHE WITH DIAERESIS		DD
CAP ZE WITH DIAERESIS		DE
SMA ZE WITH DIAERESIS		DF
CAP ABKHASIAN DZE		E0
SMA ABKHASIAN DZE		E1
CAP I WITH MACRON		E2
SMA I WITH MACRON		E3
CAP I WITH DIAERESIS		E4
SMA I WITH DIAERESIS		E5
CAP O WITH DIAERESIS		E6
SMA O WITH DIAERESIS		E7
CAP BARRED O			E8
SMA BARRED O			E9
CAP BARRED O WITH DIAERESIS	EA
SMA BARRED O WITH DIAERESIS	EB
CAP U WITH ACUTE		EC
SMA U WITH ACUTE		ED
CAP U WITH MACRON		EE
SMA U WITH MACRON		EF
CAP U WITH DIAERESIS		F0
SMA U WITH DIAERESIS		F1
CAP U WITH DOUBLE ACUTE		F2
SMA U WITH DOUBLE ACUTE		F3
CAP CHE WITH DIAERESIS		F4
SMA CHE WITH DIAERESIS		F5
CAP DJE WITH ACUTE		F6
SMA DJE WITH ACUTE		F7
CAP YERU WITH DIAERESIS		F8
SMA YERU WITH DIAERESIS		F9
UNUSED				FA
UNUSED				FB
UNUSED				FC
UNUSED				FD
UNUSED				FE
UNUSED				FF



Q: Is everything clear now? 

A: Probably not. To ease the pain, here follow some tentative conversion
tables *from* the 8-bit schemes described above *to* Unicode. Since the
Unicode/10646 character set is much larger, no tables are provided in
the other direction.

In the 0-127 range everything is ASCII (except for the CP866 dingbats in
the range 0-31 which are at any rate optional, and for EBCDIC/DKOI-8, for
which see above) so here tables are only provided for 128-255. Notice
that often values other than starting with 0x04 are given, meaning that
the Unicode equivalent is outside the Unicode Cyrillic range
0x0400-0x04ff, but included at some other place, typically among the
arrows (0x2190-0x21ff) or other semigraphic material (0x2500-0x25ff). If
a particular encoding leaves (by official definition, not necessarily in
practical usage) some code unused, this is designated by "-1" in the
conversion table. For some positions the tables show a "-2", meaning
that I have no information on the intended meaning.  (This is not the
same as there being no Unicode codepoint for the character in question,
a situation we potentially encounter with AV and OV 242-245, see note
there.)



/* From old Koi-8 to Unicode */

long oldkoi8tou[128] = {
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
0x044e,0x0430,0x0431,0x0446,0x0434,0x0435,0x0444,0x0433,
0x0445,0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,
0x043f,0x044f,0x0440,0x0441,0x0442,0x0443,0x0436,0x0432,
0x044c,0x044b,0x0437,0x0448,0x044d,0x0449,0x0447,0x044a,
0x042e,0x0410,0x0411,0x0426,0x0414,0x0415,0x0424,0x0413,
0x0425,0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,
0x041f,0x042f,0x0420,0x0421,0x0422,0x0423,0x0416,0x0412,
0x042c,0x042b,0x0417,0x0428,0x042d,0x0429,0x0427,0x042a
};


/* From CP866 to Unicode */

long cp866tou[128] = {
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x2591,0x2592,0x2593,0x2502,0x2524,0x2561,0x2562,0x2556,
0x2555,0x2563,0x2551,0x2557,0x255d,0x255c,0x255b,0x2510,
0x2514,0x2534,0x252c,0x251c,0x2500,0x253c,0x255e,0x255f,
0x255a,0x2554,0x2569,0x2566,0x2560,0x2550,0x256c,0x2567,
0x2568,0x2564,0x2565,0x2559,0x2558,0x2552,0x2553,0x256b,
0x256a,0x2518,0x250c,0x2588,0x2584,0x258c,0x2590,0x2580,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
0x0401,0x0451,0x0404,0x0454,0x0407,0x0457,0x040e,0x045e,
0x00b0,0x2022,0x00b7,0x221a,0x2116,0x00a4,0x25a0,   -1
};


/* From CP1251 to Unicode */

long cp1251tou[128] = {
0x0402,0x0403,0x201a,0x0453,0x201e,0x2026,0x2020,0x2021,
    -1,0x2030,0x0409,0x2039,0x040a,0x040c,0x040b,0x040f,
0x0452,0x2018,0x2019,0x201c,0x201d,0x2022,0x2013,0x2014,
    -1,0x2122,0x0459,0x203a,0x045a,0x045c,0x045b,0x045f,
0x00a0,0x040e,0x045e,0x0408,0x00a4,0x0490,0x00a6,0x00a7,
0x0401,0x00a9,0x0404,0x00ab,0x00ac,0x00ad,0x00ae,0x0407,
0x00b0,0x00b1,0x0406,0x0456,0x0491,0x00b5,0x00b6,0x00b7,
0x0451,0x2116,0x0454,0x00bb,0x0458,0x0405,0x0455,0x0457,
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
};


/* From Mac to Unicode */

long mactou[128] = {
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x2020,0x00b0,0x0490,0x00a3,0x00a7,0x2022,0x00b6,0x0406,
0x00ae,0x00a9,0x2122,0x0402,0x0452,0x2260,0x0403,0x0453,
0x221e,0x00b1,0x2264,0x2265,0x0456,0x03bc,0x0491,0x0408,
0x0404,0x0454,0x0407,0x0457,0x0409,0x0459,0x040a,0x045a,
0x0458,0x0405,0x00ac,0x221a,0x0192,0x2248,0x0394,0x00ab,
0x00bb,0x2026,0x0020,0x040b,0x045b,0x040c,0x045c,0x0455,
0x00b0,0x00b1,0x0406,0x0456,0x0491,0x00b5,0x00b6,0x00b7,
0x040e,0x045e,0x040f,0x045f,0x2116,0x0401,0x0451,0x044f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x00a4,
};


/* From Alternativnyj Variant to Unicode */

long avtou[128] = {
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x2591,0x2592,0x2593,0x2502,0x2524,0x2561,0x2562,0x2556,
0x2555,0x2563,0x2551,0x2557,0x255d,0x255c,0x255b,0x2510,
0x2514,0x2534,0x252c,0x251c,0x2500,0x253c,0x255e,0x255f,
0x255a,0x2554,0x2569,0x2566,0x2560,0x2550,0x256c,0x2567,
0x2568,0x2564,0x2565,0x2559,0x2558,0x2552,0x2553,0x256b,
0x256a,0x2518,0x250c,0x2588,0x2584,0x258c,0x2590,0x2580,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
0x0401,0x0451,0x0317,0x0316,0x0301,0x0300,0x2192,0x2190,
0x2193,0x2191,0x00f7,0x00b1,0x2116,0x00a4,0x25a0,   -1
};

/* The interpretation of the four symbols following the second
alphabetic block in AV remains unclear. One suggestion was to treat
these as (non-spacing) grave and acute, as appearing above upper- or
lowercase letters, but the graphical rendering in Briabin's original
article makes clear that the distinction is between acute and grave,
above or below the letter: this is what the table now has.

But the preponderance of graphical symbols in AV suggests that the
intention was to provide facilities for character graphics, in which
case the interpretation is simply straight lines connecting two
adjacent midpoints of the bounding box.  If the box is the unit
square, these would run from (.5,0) to (0,.5) and to (1,.5), and from
(.5,1) to (0,.5) and to (1,.5), in this order. (The line segments are
of course directionless.) Such symbols are not present in Unicode --
the closest things are 0x25de 0x25df 0x25dc 0x25dd (in this order) but
these are curved, not straight.

Whether the graphics or the accent usage is more prevalent in actual
usage only those plugged into the Russian PC community can tell. If
the graphics usage turns out to be prevalent, these four symbols would
be reasonable candidates for incorporation into Unicode, perhaps at
positions 0x25ef to 0x25f3. */


/* From Osnovnoj Variant to Unicode */

long ovtou[128] = {
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
0x0401,0x0451,0x0317,0x0316,0x0301,0x0300,0x2192,0x2190,
0x2193,0x2191,0x00f7,0x00b1,0x2116,0x00a4,0x25a0,   -1
};

/* The same problem with the interpretation of 242-245 as in AV (these
rows are definitely identical). The low positions of OV are probably
identical to 176-223 in AV... */


/* From ISO8859-5 to Unicode */

long newkoi8tou[128] = {
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
0x00a0,0x0401,0x0402,0x0403,0x0404,0x0405,0x0406,0x0407,
0x0408,0x0409,0x040a,0x040b,0x040c,0x00ad,0x040e,0x040f,
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
0x2116,0x0451,0x0452,0x0453,0x0454,0x0455,0x0456,0x0457,
0x0458,0x0459,0x045a,0x00a7,0x045c,0x045d,0x045e,0x045f
};

/* Use newkoi8tou in combination with isotoibm to derive the unicode
meaning of the Cyrillic range in the DKOI extension of EBCDIC. If
someone has DKOI-8 text available, I'd love to actually try... */