[Back to Home Page]

Lingloss

Welcome to the Lingloss project page!

In 1967, I designed what was meant to be an international auxiliary language called Lingloss. Like many other such projects, it was never really ready enough to inflict upon the public. In 2012 Lingloss still remains a work in progress; however, I believe I have recently made some progress on one aspect of the overall problem. The reasons for this belief are more fully detailed at

http://www.richardsandesforsyth.net/docs/bunnies.pdf .

So I am using this webpage share some software which, when more fully developed, may help designers of the coming international auxiliary language. (Yes, there will have to be one eventually: the human race can always be relied upon to do the right thing, as Churchill said of the Americans, once they have exhausted the alternatives.) The software is concerned with the problem of establishing a suitable core vocabulary. This is an obstacle that prior efforts have never convincingly overcome.

What you will find when you download and unzip

[glossoft.zip]

is a pair of programs written in Python3 (along with various ancillary files) which address the following aspects of the vocabulary-building problem:

1.  How to choose a core collection of lexical items, i.e. what Hogben (1963) calls a "list of essential semantic units" (LESU), which is concise enough to be learnt in a matter of weeks and at the same time extensive enough to support the great majority of essential communicative functions;

2.  How to choose a suitable international word for each of the items in the LESU.

Towards a Core Vocabulary

The program corevox1.py takes in several lists of essential semantic units (formatted one item per line) and produces a consensus list consisting of all the items that occur in at least minfreq of the input lists, where minfreq is an integer from 1 (in which case the output is all the items that occur in any of the input lists) to N, the number of input lists (in which case the output is only those items common to all the input lists).

Where do the input lists come from? Well, to test the program, four files containing previous attempts to come up with a LESU are provided (baslist, hoglist, longlist and maclist). These are, respectively: the Basic English wordlist (Ogden,1937); the LESU of "Essential World English" (Hogben, 1963); the defining vocabulary of the Longman English Dictionary (Longman, 2003); the defining vocabulary of the MacMillan English dictionary for advanced learners (MacMillan, 2002). [subfolder: lexicons]

Ogden and Hogben were trying to establish minimal subsets of words needed for the majority of communicative purposes in simplified versions of English. Compilers of the Longman and MacMillan dictionaries were trying to establish basic word lists in terms of which all the other entries in their dictionaries could be defined. Thus all four lists represent principled attempts to create concise but effective vocabularies. They didn't all settle on the same words, but any term that appears in more than one of these sets is likely have a strong claim for inclusion in anyone's core vocabulary.

Note that, although most of the entries in these lists are relatively common, they are not mere frequency lists. They result from attempts to cover the most commonly used concepts without redundancy. Therefore some high-frequency terms will be excluded if they are redundant.

I should perhaps apologize for anglocentric bias here; although in mitigation it should be noted that there is nothing in this software that limits it to the English language. I am most at home with English examples, but I would hope that others could apply the same methods to other languages: the comparisons would be instructive.

Towards an International Vocabulary

The second program, avwords3.py, is more innovative, as far as the field of interlinguistics is concerned. It finds the 'verbal average' of a number of different words. As far as I know, nobody has ever defined what a verbal average might be; so, to be a little more specific, the heart of this program is a function that takes in a number of strings (usually words, though they could be short phrases) and produces a string which is, in a certain sense, the most typical representative of those input strings. As currently implemented, it works in 2 stages. Firstly, using a string-similarity scoring function, the string in the group which is most similar to all the others of that group is chosen. Secondly, certain manipulations, such as dropping a character or swapping 2 adjacent characters, are tried to see if they increase the similarity score of that string in relation to the rest and, if so, the modified string is accepted.

For example, given the following inputs

['cheval', 'caballo', 'cavallo', 'cavalo', 'cal', 'equus', 'cavall']

which are the French, Spanish, Italian, Portuguese, Romanian, Latin and Catalan words for 'horse', the program computes that

'cal'

is the most central or typical item. In this case, no deletions or letter-exchanges make it more typical, so it is retained.

The program works by reading in several (utf8) files in the format exemplified below.


 young    giovane
 you      voi
 yes      sì
 yellow   giallo
 year     anno
 would    sarebbe
 work     lavorare
 word     parola
 wool     lana
 woods    bosco
 wood     legno
 woman    donna
 with     con
 wire     filo
 wing     ala
 wine     vino
 window   finestra

This is an extract from a simple English-Italian lexicon: each line consists of a source-language term followed by a target-language equivalent, with tab character separating them.

Each of these input lexicons uses the same source language (English in the examples provided) with a different target language (various Romance languages in the examples provided). These sample bilingual lexicons can be found in the lexicons folder after you have unzipped the software.

Incidentally, the part that hasn't been automated is going from the LESU produced as output by corevox1.py to the several lexicons needed as input by avwords3.py. There are lots of public-domain bilingual lexicons, so it would be possible to write software that took a LESU and an existing lexicon (English-to-target-language in the present case) and produced suitable input for avwords3.py, but to do it properly would, I suspect, require human scrutiny anyway, so that task is left as "an exercise for the reader".

The output of avwords3.py is a lexicon in the same format as the inputs, where each source-language item is associated with the 'verbal average' of the terms in the various target languages -- intended as a first approximation to an English-Lingloss dictionary. Example output produced from the seven small example inputs in the lexicons folder follows below.


Mon Dec 24 16:28:24 2012
window 	 fenestra
wine 	 vin
wing 	 ala
wire 	 fil
with 	 con
woman 	 mulier
wood 	 lea
woods 	 bos
wool 	 lana
word 	 parala
work 	 trabaar
would 	 voudrais
year 	 ano
yellow 	 gallo
yes 	 si
you 	 voi
young 	 jove

On the basis of the example data provided here, Lingloss, if it ever gets into circulation, would look very much like a Romance language, a kind of simplified, modernized Latin. However, that decision is by no means set in stone. The main point of computerizing parts of the process is to permit exploration of alternative design decisions.

The English word 'would' isn't expressed by a single word in these languages, thus illustrates the need for human pre-processing or post-processing. In fact, avword3.py also produces a listing file in which the quality of the 'verbal averages' is shown. This is meant to provide serious users with information to enable them to decide which of the proposed term equivalents need further attention.

These programs are prototypes, intended to illustrate a particular methodology, which I believe is novel. Much work remains to be done. For example, comparison of alternative string-similarity scoring functions would be a good idea; as would a test of whether each target word should be rendered into a common phonetic representation or just taken as spelled; and so on. The main point is to stimulate such work.

Running the programs

To execute the programs you will have to obtain Python (version 3 not 2) if you don't already have it. This can be found at

www.python.org

I have tested these programs under Windows7, but I believe they should run without alteration under Linux as well.

Then you will have to unzip the file

glossoft.zip

preferably at your top-level directory. This will have subfolders as follows.

lexicons    sample LESUs and small-scale bilingual lexicons

libs        common routines and variables for the programs in p3

op          default directory to receive output

p3          Python3 programs

parapath    directory to hold parameter files

Each program requires certain input parameters, which are put into a text file that can be edited by Notepad, Notepad++ or other text editors. Example parameter files for using the example data provided will be found on the parapath folder once the zipped file has been unpacked. Each line of a parameter file starts with a parameter name then one or more spaces then the value for that parameter. Unknown parameters are ignored. Parameters not given a value in the parameter file receive a default value.

A table of parameters used by the programs follows.

parameter name

type

default

description

casefold

0 .. 1

1

whether to fold uppercase to lower case on input; 1 implies yes, 0 implies no.

jobname

alphanumeric string

same name as program

name to link output files

minfreq

integer

2

minimum number of input LESU files in which a term must appear in to be kept for output

outgloss

Windows or Linux file-spec

avwords_glos

output file for consensus lexicon

vocfile

Windows or Linux file-spec

corevox_vocs

output file for consensus LESU

voclists

Windows or Linux file-spec

lesu.dat / lexicons.txt

input text file containing list of input file-specs, 1 per line

withkey

0 .. 1

0

whether to include the source-language term along with the target-language equivalents in avwords (1), or not (0)

The content of coretest.txt, a simple initial parameter file for corevox1.py, is copied below.

voclists c:\glossoft\parapath\lesu.txt
vocfile c:\glossoft\op\corelist.txt
minfreq 2

The content of wordavs.txt, a starter parameter file for avwords3.py, is copied below.

voclists c:\glossoft\parapath\glossies.txt
outgloss c:\glossoft\op\glossout.txt
withkey 0

Pretty simple, eh?


References

Hogben, L. (1943). Interglossa. Harmondsworth: Penguin Books.

Hogben, L. (1963). Essential World English. London: Michael Joseph Ltd.

Longman (2003). Dictionary of Contemporary English. Harlow: Pearson Educational Ltd.

Macmillan (2002). MacMillan English Dictionary for Advanced Learners. Oxford: MacMillan Education.

Ogden, C.K. (1937). The ABC of Basic English. London: Kegan, Paul, Trench, Trubner & Co. Ltd.


Appendix

Constructed Auxiliary Languages :

Year

Language

Surname

Forename(s)

1661

Universal Character

Dalgarno

George

1668

Real Character

Wilkins

Bishop

1699

Characteristica Universalis

Leibniz

Gottfried

1765

Nouvelle Langue

de Villeneuve

Faiguet

1866

Solresol

Sudre

Francois

1868

Universalglot

Pirro

Jean

1880

Volapuk

Schleyer

Martin

1886

Pasilingua

Steiner

Paul

1887

Bopal

de Max

Saint

1887

Esperanto

Zamenhof

Lazarus

1888

Lingua

Henderson

George

1888

Spelin

Bauer

Georg

1890

Mundolingue

Lott

Julius

1892

Latinesce

Henderson

George

1893

Balta

Dormoy

Emile

1893

Dil

Fieweger

Julius

1893

Orba

Guardiola

Jose

1896

Veltparl

von Arnim

Wilhelm

1899

Langue Bleu

Bollack

Leon

1902

Idiom Neutral

Rosenberger

Waldemar

1903

Latino sine Flexione

Peano

Giuseppe

1906

Ro

Foster

Edward

1907

Ido

de Beaufront

Louis

1913

Esperantido

de Saussure

Rene

1922

Occidental

de Wahl

Edgar

1928

Novial

Jespersen

Otto

1943

Interglossa

Hogben

Lancelot

1944

Mondial

Heimer

Helge

1951

Interlingua

Gode

Alexander

1957

Frater

Thai

Pham Xuan

1961

Loglan

Brown

James

1967

Lingloss

Forsyth

Richard

1983

Uropi

Landais

Joel

1996

Unish

Jung

Young Hee

1998

Lingua Franca Nova

Boeree

George

2002

Mondlango

Yafu

He

2011

Angos

Wood

Benjamin

 

[Back to Home Page]