Lingloss
Welcome to the Lingloss project page!
In 1967, I designed what was meant to be an international auxiliary language called Lingloss. Like many other such projects, it was never really ready enough to inflict upon the public. In 2012 Lingloss still remains a work in progress; however, I believe I have recently made some progress on one aspect of the overall problem. The reasons for this belief are more fully detailed at
http://www.richardsandesforsyth.net/docs/bunnies.pdf .
So I am using this webpage share some software which, when more fully developed, may help designers of the coming international auxiliary language. (Yes, there will have to be one eventually: the human race can always be relied upon to do the right thing, as Churchill said of the Americans, once they have exhausted the alternatives.) The software is concerned with the problem of establishing a suitable core vocabulary. This is an obstacle that prior efforts have never convincingly overcome.
What you will find when you download and unzip
is a pair of programs written in Python3 (along with various ancillary files) which address the following aspects of the vocabulary-building problem:
1. How to choose a core collection of lexical items, i.e. what Hogben (1963) calls a "list of essential semantic units" (LESU), which is concise enough to be learnt in a matter of weeks and at the same time extensive enough to support the great majority of essential communicative functions;
2. How to choose a suitable international word for each of the items in the LESU.
Towards a Core Vocabulary
The program corevox1.py takes in several lists of essential semantic units (formatted one item per line) and produces a consensus list consisting of all the items that occur in at least minfreq of the input lists, where minfreq is an integer from 1 (in which case the output is all the items that occur in any of the input lists) to N, the number of input lists (in which case the output is only those items common to all the input lists).
Where do the input lists come from? Well, to test the program, four files containing previous attempts to come up with a LESU are provided (baslist, hoglist, longlist and maclist). These are, respectively: the Basic English wordlist (Ogden,1937); the LESU of "Essential World English" (Hogben, 1963); the defining vocabulary of the Longman English Dictionary (Longman, 2003); the defining vocabulary of the MacMillan English dictionary for advanced learners (MacMillan, 2002). [subfolder: lexicons]
Ogden and Hogben were trying to establish minimal subsets of words needed for the majority of communicative purposes in simplified versions of English. Compilers of the Longman and MacMillan dictionaries were trying to establish basic word lists in terms of which all the other entries in their dictionaries could be defined. Thus all four lists represent principled attempts to create concise but effective vocabularies. They didn't all settle on the same words, but any term that appears in more than one of these sets is likely have a strong claim for inclusion in anyone's core vocabulary.
Note that, although most of the entries in these lists are relatively common, they are not mere frequency lists. They result from attempts to cover the most commonly used concepts without redundancy. Therefore some high-frequency terms will be excluded if they are redundant.
I should perhaps apologize for anglocentric bias here; although in mitigation it should be noted that there is nothing in this software that limits it to the English language. I am most at home with English examples, but I would hope that others could apply the same methods to other languages: the comparisons would be instructive.
Towards an International Vocabulary
The second program, avwords3.py, is more innovative, as far as the field of interlinguistics is concerned. It finds the 'verbal average' of a number of different words. As far as I know, nobody has ever defined what a verbal average might be; so, to be a little more specific, the heart of this program is a function that takes in a number of strings (usually words, though they could be short phrases) and produces a string which is, in a certain sense, the most typical representative of those input strings. As currently implemented, it works in 2 stages. Firstly, using a string-similarity scoring function, the string in the group which is most similar to all the others of that group is chosen. Secondly, certain manipulations, such as dropping a character or swapping 2 adjacent characters, are tried to see if they increase the similarity score of that string in relation to the rest and, if so, the modified string is accepted.
For example, given the following inputs
['cheval', 'caballo', 'cavallo', 'cavalo', 'cal', 'equus', 'cavall']
which are the French, Spanish, Italian, Portuguese, Romanian, Latin and Catalan words for 'horse', the program computes that
'cal'
is the most central or typical item. In this case, no deletions or letter-exchanges make it more typical, so it is retained.
The program works by reading in several (utf8) files in the format exemplified below.
young giovane you voi yes sì yellow giallo year anno would sarebbe work lavorare word parola wool lana woods bosco wood legno woman donna with con wire filo wing ala wine vino window finestra
This is an extract from a simple English-Italian lexicon: each line consists of a source-language term followed by a target-language equivalent, with tab character separating them.
Each of these input lexicons uses the same source language (English in the examples provided) with a different target language (various Romance languages in the examples provided). These sample bilingual lexicons can be found in the lexicons folder after you have unzipped the software.
Incidentally, the part that hasn't been automated is going from the LESU produced as output by corevox1.py to the several lexicons needed as input by avwords3.py. There are lots of public-domain bilingual lexicons, so it would be possible to write software that took a LESU and an existing lexicon (English-to-target-language in the present case) and produced suitable input for avwords3.py, but to do it properly would, I suspect, require human scrutiny anyway, so that task is left as "an exercise for the reader".
The output of avwords3.py is a lexicon in the same format as the inputs, where each source-language item is associated with the 'verbal average' of the terms in the various target languages -- intended as a first approximation to an English-Lingloss dictionary. Example output produced from the seven small example inputs in the lexicons folder follows below.
Mon Dec 24 16:28:24 2012 window fenestra wine vin wing ala wire fil with con woman mulier wood lea woods bos wool lana word parala work trabaar would voudrais year ano yellow gallo yes si you voi young jove
On the basis of the example data provided here, Lingloss, if it ever gets into circulation, would look very much like a Romance language, a kind of simplified, modernized Latin. However, that decision is by no means set in stone. The main point of computerizing parts of the process is to permit exploration of alternative design decisions.
The English word 'would' isn't expressed by a single word in these languages, thus illustrates the need for human pre-processing or post-processing. In fact, avword3.py also produces a listing file in which the quality of the 'verbal averages' is shown. This is meant to provide serious users with information to enable them to decide which of the proposed term equivalents need further attention.
These programs are prototypes, intended to illustrate a particular methodology, which I believe is novel. Much work remains to be done. For example, comparison of alternative string-similarity scoring functions would be a good idea; as would a test of whether each target word should be rendered into a common phonetic representation or just taken as spelled; and so on. The main point is to stimulate such work.
Running the programs
To execute the programs you will have to obtain Python (version 3 not 2) if you don't already have it. This can be found at
I have tested these programs under Windows7, but I believe they should run without alteration under Linux as well.
Then you will have to unzip the file
glossoft.zip
preferably at your top-level directory. This will have subfolders as follows.
lexicons sample LESUs and small-scale bilingual lexicons libs common routines and variables for the programs in p3 op default directory to receive output p3 Python3 programs parapath directory to hold parameter files
Each program requires certain input parameters, which are put into a text file that can be edited by Notepad, Notepad++ or other text editors. Example parameter files for using the example data provided will be found on the parapath folder once the zipped file has been unpacked. Each line of a parameter file starts with a parameter name then one or more spaces then the value for that parameter. Unknown parameters are ignored. Parameters not given a value in the parameter file receive a default value.
A table of parameters used by the programs follows.
parameter name |
type |
default |
description |
casefold |
0 .. 1 |
1 |
whether to fold uppercase to lower case on input; 1 implies yes, 0 implies no. |
jobname |
alphanumeric string |
same name as program |
name to link output files |
minfreq |
integer |
2 |
minimum number of input LESU files in which a term must appear in to be kept for output |
outgloss |
Windows or Linux file-spec |
avwords_glos |
output file for consensus lexicon |
vocfile |
Windows or Linux file-spec |
corevox_vocs |
output file for consensus LESU |
voclists |
Windows or Linux file-spec |
lesu.dat / lexicons.txt |
input text file containing list of input file-specs, 1 per line |
withkey |
0 .. 1 |
0 |
whether to include the source-language term along with the target-language equivalents in avwords (1), or not (0) |
The content of coretest.txt, a simple initial parameter file for corevox1.py, is copied below.
voclists c:\glossoft\parapath\lesu.txt
vocfile c:\glossoft\op\corelist.txt
minfreq 2
The content of wordavs.txt, a starter parameter file for avwords3.py, is copied below.
voclists c:\glossoft\parapath\glossies.txt
outgloss c:\glossoft\op\glossout.txt
withkey 0
Pretty simple, eh?
References
Hogben, L. (1943). Interglossa. Harmondsworth: Penguin Books.
Hogben, L. (1963). Essential World English. London: Michael Joseph Ltd.
Longman (2003). Dictionary of Contemporary English. Harlow: Pearson Educational Ltd.
Macmillan (2002). MacMillan English Dictionary for Advanced Learners. Oxford: MacMillan Education.
Ogden, C.K. (1937). The ABC of Basic English. London: Kegan, Paul, Trench, Trubner & Co. Ltd.
Appendix
Constructed Auxiliary Languages : |
|||
Year |
Language |
Surname |
Forename(s) |
1661 |
Universal Character |
Dalgarno |
George |
1668 |
Real Character |
Wilkins |
Bishop |
1699 |
Characteristica Universalis |
Leibniz |
Gottfried |
1765 |
Nouvelle Langue |
de Villeneuve |
Faiguet |
1866 |
Solresol |
Sudre |
Francois |
1868 |
Universalglot |
Pirro |
Jean |
1880 |
Volapuk |
Schleyer |
Martin |
1886 |
Pasilingua |
Steiner |
Paul |
1887 |
Bopal |
de Max |
Saint |
1887 |
Esperanto |
Zamenhof |
Lazarus |
1888 |
Lingua |
Henderson |
George |
1888 |
Spelin |
Bauer |
Georg |
1890 |
Mundolingue |
Lott |
Julius |
1892 |
Latinesce |
Henderson |
George |
1893 |
Balta |
Dormoy |
Emile |
1893 |
Dil |
Fieweger |
Julius |
1893 |
Orba |
Guardiola |
Jose |
1896 |
Veltparl |
von Arnim |
Wilhelm |
1899 |
Langue Bleu |
Bollack |
Leon |
1902 |
Idiom Neutral |
Rosenberger |
Waldemar |
1903 |
Latino sine Flexione |
Peano |
Giuseppe |
1906 |
Ro |
Foster |
Edward |
1907 |
Ido |
de Beaufront |
Louis |
1913 |
Esperantido |
de Saussure |
Rene |
1922 |
Occidental |
de Wahl |
Edgar |
1928 |
Novial |
Jespersen |
Otto |
1943 |
Interglossa |
Hogben |
Lancelot |
1944 |
Mondial |
Heimer |
Helge |
1951 |
Interlingua |
Gode |
Alexander |
1957 |
Frater |
Thai |
Pham Xuan |
1961 |
Loglan |
Brown |
James |
1967 |
Lingloss |
Forsyth |
Richard |
1983 |
Uropi |
Landais |
Joel |
1996 |
Unish |
Jung |
Young Hee |
1998 |
Lingua Franca Nova |
Boeree |
George |
2002 |
Mondlango |
Yafu |
He |
2011 |
Angos |
Wood |
Benjamin |