Inaris Post
This post is about lists in GF. It’s aimed for multiple audiences, so possibly some parts won’t interest you. If you have a good grasp of the basics, feel free to jump directly to Advanced topics.
Basics
Quoting from the GF book.
C.4.3 List categories
Since categories of lists of elements of another category are a common idiom, the following syntactic sugar is available:
cat [C] {n}
abbreviates a set of three judgements:
cat ListC ; fun BaseC : C -> ... -> C -> ListC ; --n C’s fun ConsC : C -> ListC -> ListC
The functions
BaseC
andConsC
are automatically generated in the abstract syntax, but their linearizations, as well as the linearization type ofListC
, must be defined manually. The type expression[C]
is in all contexts interchangeable withListC
.
Choice of n
The parameter n
in cat [C]{n}
determines the size of the base list. For instance,
cat [C]{1}
generates the following functions:
BaseC : C -> ListC
ConsC : C -> ListC -> ListC
Likewise,
cat [C]{0}
generates the following functions:
BaseC : ListC
ConsC : C -> ListC -> ListC
In fact, the choice of n
only affects the BaseC
function. ConsC
is always the same, adding a single C
to an already existing list.
If you’re used to lists from other programming languages, you might wonder what’s the purpose of n > 0
.
An empty list is such a useful concept, why force the minimum size of a list to be 1, 2 or even more?
The answer is that it depends on an application. In the next sections, I’ll cover the use of lists for natural and formal languages.
Lists in natural language
The purpose of lists in the Resource Grammar Library (RGL) is to allow coordination. I’ll start with an example and explain the functions right after.
While the example is for NP
(noun phrase), exactly the same principle holds for other list categories in the RGL (AP, Adv,RS, S).
Baseline: single NP
Here’s the RGL API and the underlying tree for “I walk”. No lists yet, this is just for comparison.
-- RGL API, what you'd write in application grammar
lin I_walk_Cl = mkCl i_NP walk_V ;
-- Underlying RGL tree
PredVP -- : NP -> VP -> Cl
(UsePron i_Pron) -- : NP
(UseV walk_V) -- : VP
-- : Cl
If you’ve only ever used the RGL API and never seen PredVP
, UsePron
etc. before, you can read an explanation here. Knowing the RGL abstract syntax is not necessary for writing application grammars, but for this deep dive post, it’s useful to understand that the two levels exist.
List of NPs
Here’s the corresponding GF code for “they, you and I walk”. This time, the subject is constructed from a list of three NP
s, which are put back together into one NP
.
-- RGL API, what you'd write in application grammar
lin They_You_and_I_walk_Cl =
mkCl
(mkNP and_Conj
(mkListNP they_NP
(mkListNP you_NP
i_NP)
) -- : ListNP
) -- : NP
walk_V ;
The RGL API overloads the mkListNP
oper. In the underlying RGL abstract syntax tree, we see their true names BaseNP
and ConsNP
.
-- Underlying RGL tree
PredVP
(ConjNP and_Conj
(ConsNP (UsePron they_Pron)
(BaseNP (UsePron you_Pron)
(UsePron i_Pron))
) -- : ListNP
) -- : NP
(UseV walk_V)
As long as you have GF_LIB_PATH
set up, you can open the RGL abstract syntax in the GF shell and run the commands below.
$ gf alltenses/LangEng.gfo
…
Lang> l PredVP (ConjNP and_Conj (ConsNP (UsePron they_Pron) (BaseNP (UsePron youSg_Pron) (UsePron i_Pron)))) (UseV walk_V)
In RGL, n = 2
Looking at the examples, we see that the base list size in RGL is 2.
We saw that the innermost mkListNP
was applied to two arguments:
(mkListNP you_NP i_NP)
And that translated to a BaseNP
, which took 2 arguments. So we know that BaseNP
and ConsNP
were generated from the following expression.
cat [NP]{2}
We can even verify that this was the expression: look at the RGL abstract syntax!
Lists for less than 2 aren’t needed in the RGL.
If we had n = 1
, even a single NP could be a list, like in the following.
mkCl
(mkNP and_Conj (mkListNP i_NP))
walk_V
But why write mkNP and_Conj (mkListNP i_NP)
, when you can just write i_NP
?
Clearly, we only start needing lists when we want to coordinate 2 or more elements.
Back to C
from ListC
: introducing ConjC
For all categories C
that have a ListC
, the RGL includes a corresponding function
ConjC : Conj -> ListC -> C
which turns a list back into a single instance of the category, with the help of a conjunction. For example, “[Alice, Bob, Charlie]” is a list of NP
s, whereas “Alice, Bob and Charlie” is a single NP
.
Notice that ConjC
is different from BaseC
and ConsC
, which are derived automatically whenever ListC
is defined. In contrast, ConjC
is manually defined in the RGL.
ConjC in the API: just another instance of mkC
In the RGL API, all these ConjC
funs are accessible from mkC
. Here are two versions of mkNP
, which under the hood call ConjNP
:
If you only want to coordinate two NP
s, you can skip constructing ListNP
, and call mkNP
directly for the two NP
s (and a conjunction).
-- RGL API
mkNP or_Conj i_NP they_NP
mkNP or_Conj (mkListNP i_NP they_NP)
-- both correspond to the RGL abstract syntax
ConjNP or_Conj (BaseNP i_NP they_NP)
Conj
Naturally, RGL also includes the category Conj
, with examples such as
and_Conj : Conj ;
or_Conj : Conj ;
There are also conjunctions that put a string before the list.
both7and_DConj : Conj ; -- both...and
either7or_DConj : Conj ; -- either...or
For instance,
mkNP either7or_DConj everybody_NP nobody_NP
returns “either everybody or nobody”.
Implementation of ListC
in RGL
ListC
is usually implemented as exactly like the lincat of C
, but with two s
fields, called s1
and s2
.
For example, if C
is defined as
lincat C = {s : Number => Str ; g : Gender} ;
then ListC
will split its s
field into two, and retain its other fields as
lincat [C] = {s1,s2 : Number => Str ; g : Gender} ;
You can see examples in ConjunctionEng:
lincat
[S] = {s1,s2 : Str} ;
[Adv] = {s1,s2 : Str} ;
[NP] = {s1,s2 : NPCase => Str ; a : Agr} ;
[AP] = {s1,s2 : Agr => Str ; isPre : Bool} ;
[RS] = {s1,s2 : Agr => Str ; c : NPCase} ;
[CN] = {s1,s2 : Number => Case => Str} ;
If you look at the lins of a given Conjunction module, it probably looks rather cryptic. Most of it is just calling opers like twoTable
and consTable
. These opers, and many more, are found in the module gf-rgl/prelude/Coordination.gf.
The only documentation of Coordination (that I’m aware of) is in the comments of the module itself, and it looks like the following:
-- Create a ListX from two Xs. Example:
-- x = {s = "here"} ;
-- y = {s = "there"} ;
-- twoSS x y ==> {s1 = "here" ; s2 = "there"}
twoSS : (_,_ : SS) -> ListX = \x,y -> twoStr x.s y.s ;
It’s likely that the opers still look very cryptic, so as prerequisite knowledge, you should read the first half of this post on types in GF. You can stop when you see the subheading “Dependent types in abstract syntax”.
If you’re implementing a new resource grammar and are struggling to understand Conjunction, the best way is to look at other languages’ implementation of Conjunction, and read the comments of Coordination. If your language uses converbs or conjunctions as suffixes, you can read the section later in this post Natural language strategies beyond A, B and C. Later on, I might add a section in this post to be more of a hands-on guide for resource grammarians, but for now I’ve prioritised other things.
That was the end of the natural language/RGL part. Next section is about lists in formal languages.
Lists in formal language
Suppose that I’m defining my own object-oriented programming language. Here’s a class definition:
class Business = {
bus_name : String ;
is_legal : Boolean ;
} ;
Let’s ignore any other language constructs for the sake of this example, and just concentrate on the class definition.
All classes should have a name, and some amount of fields. I’m also happy to accept an empty class, with no fields—that’d just be written as follows:
class Business ;
Since this whole blog post is about lists, you might see where this is going.
Abstract syntax
Here’s my abstract syntax for the GF grammar which describes this programming language fragment.
abstract MyOOP = {
cat
Class ; -- class ClassName : { [Field] }
Field ; -- field_name : BuiltinType
[Field]{0} ; -- generates funs BaseField, ConsField
BuiltinType ;
fun
ClassDef : String -> [Field] -> Class ;
MkField : String -> BuiltinType -> Field ;
BoolType, StringType : BuiltinType ;
}
In this case, it’s a good idea to allow an empty list. If I had done like in the RGL, and made the minimum size of [Field]
>0, then I would’ve needed two versions of ClassDef
: one for 0 fields, other for >0 fields. Instead, with [Field]{0}
, I can express empty and non-empty classes with the same fun.
Concrete syntax
I still want to print out different strings for empty and non-empty classes. But that’s no problem—I just make a parameter for it. Here’s my concrete syntax.
concrete MyOOPCnc of MyOOP = {
lincat
[Field] = {s : Str ; isEmpty : IsEmpty} ;
param
IsEmpty = Empty | NonEmpty ;
lin
-- : String -> [Field] -> Class
ClassDef name flds = {
s = "class" ++ name.s ++
case flds.isEmpty of {
Empty => flds.s ; -- just empty string
NonEmpty => "= {" ++ flds.s ++ "}"
} ++ ";"
} ;
-- : String -> BuiltinType -> Field
MkField name typ = {s = name.s ++ ":" ++ typ.s} ;
-- : BuiltinType
BoolType = {s = "Boolean"} ;
StringType = {s = "String"} ;
-- These funs automatically generated from [Field]{0}
-- : [Field]
BaseField = {s = [] ; isEmpty = Empty} ;
-- : Field -> [Field] -> [Field]
ConsField f fs =
let sep : Str = case fs.isEmpty of {
Empty => [] ;
NonEmpty => ";" } ;
in {s = f.s ++ sep ++ fs.s ; isEmpty = NonEmpty} ;
}
I’m using the parameter IsEmpty
twice:
- In
ClassDef
to decide whether to print= { }
after the class name. - In
ConsField
to decide whether to put;
after the first argument.
That’s all I need to make ClassDef
handle empty and non-empty lists.
Side note: in ClassDef
, I use the string from flds
even when flds
is empty, and only contains the empty string. The reason is explained in my gotchas post. In GF, every argument needs to contribute with a string, otherwise it isn’t recognised when parsing. This happens even if there is no ambiguity.
Natural language concrete
Next, I want to do a natural language interface in my programming language! Wouldn’t this be cool:
MyOOP> p "class Business ;" | l -lang=Eng
Business is a class with no fields
MyOOP> p "class Business = { is_legal : Boolean } ;" | l -lang=Eng
Business is a class with a Boolean field is_legal
MyOOP> p "class Business = { is_legal : Boolean ; bus_name : String } ;" | l -lang=Eng
Business is a class with fields is_legal of type Boolean and bus_name of type String
English presents new challenges with lists of different sizes. But there’s still nothing to worry about: the English concrete just needs to use a different set of internal params. Here’s a fragment:
ClassDef name fields =
let classname : NP = symb name ;
with_fields : Adv = case fields.size of {
Zero => mkAdv with_Prep (mkNP noPl_Det field_N) ;
One => withField fields.firstField ;
Many => withField fields.s
} ;
descr : NP = mkNP a_Det (mkCN class_N with_fields)
in mkS (mkCl classname descr) ;
You can see that I’m using a three-valued parameter for list size: 0, 1 or >=2 (i.e. “many”). I chose those cases, because I wanted to use three different verbalisation strategies:
- no fields
- a type field name
- fields name¹ of type type¹, … and nameⁿ of type typeⁿ
I’m not going to paste the whole English concrete, but you can see everything—the abstract and the two concretes—in this link.
Advanced topics
First, I’ll talk about the problem that everything gets flattened to a string. Second, I’ll talk about natural languages like Turkish and Korean, whose conjunctions work a bit differently than the standard RGL model, and how to implement them.
Problem: everything flattens to string
You probably find lists in GF more restricted than you’re used to in other languages. For instance, you can’t retain the individual elements—they just get concatenated into one long string, separated by commas in the RGL, or a separator of your choice in formal languages. So you can’t peek into a list and decide, for an arbitrary n
, “if the n
th element has parameter Foo
, then do something”.
Of course, for a finite n
, you can always have a lincat with n
fields, and pattern match them as much as you like. For example:
lincat
Elem = {s : Str ; foo : FooBar} ;
[Elem] = {
s1,s2,s3,s4,s5 = {s : Str ; foo : FooBar} ;
numElems : NumElems ;
} ;
param
FooBar = Foo | Bar ;
NumElems = Zero | One | Two | Three | Four | Five ;
lin
BaseElem = {
s1,s2,s3,s4,s5 = { -- dummy values in all 5 fields
s = "NOTHING HERE YET" ;
foo = Foo} ;
numElems = Zero
} ;
ConsElem e es =
case es.numElems of {
Zero => es ** {s1 = e ; numElems = One} ;
...
Four => es ** {s5 = e ; numElems = Five} ;
Five => error "The list already has 5 elements"
} ;
This hypothetical list of elements has 5 fields s1‥s5
, and a parameter NumElems
to keep track of which of the 5 fields contains some actual value. (If you’re familiar with grammar blowup due to lots of parameters, here’s an exercise to you: which one is cheaper, this solution or the alternative in the footnote1?)
While this type of solution can work for a finite n
, the much more scalable method is to do any list manipulation from an external program.
Natural language strategies beyond A, B and C
The RGL Conjunction module was primarily designed for the following natural language strategy:
- Coordinated phrases appear in the same form as uncoordinated.
- The two last phrases are separated by a conjunction. Any additional phrases are separated by commas (A, B, … and X).
But that’s not the only strategy natural languages use to connect words and phrases. Other strategies include:
- Coordinated phrases appear in a special form, different from stand-alone phrase. For example, in Turkish you would not have full finite sentences like “I went to the market and bought cigarettes”. Instead, all but the last sentence are in a non-finite form, like “having gone to market, I bought cigarettes”.
- The conjunction can be inserted between every coordinated phrase, like in Korean.
Special form with coordination
Linguistically, this seems like a more exotic thing than just repeating the conjunction. But as long as the coordinated forms are still separated by commas (or anything else that doesn’t depend on the final conjunction chosen), this modification is the easier of the two.
Let’s take a hypothetical AP
in a hypothetical language. Here’s the spec:
- The house is blue.PRED
- A blue.ATTR house
- The house is blue.CONJ , green.CONJ and red.PRED
- A blue.CONJ , green.CONJ and red.ATTR house
So we have three forms: attributive, predicative and conjunctive. Here’s an inflection table:
lincat
AP = {s : AForm => Str} ;
param
AForm = AAttr | APred | AConj ;
You don’t need to do anything special in BaseAP
and ConsAP
. The basic design is that the last AP
is stored in the s2
field, and the others in s1
.
lincat
[AP] = {s1, s2 : AForm => Str} ;
lin
-- BaseAP and ConsAP are like for any RGL language.
BaseAP a1 a2 = twoTable AForm a1 a2 ;
-- returns: {s1 = a1.s ; s2 = a2.s}
ConsAP a as = consrTable AForm comma a as ;
-- returns the following:
-- {s1 = \\af => a.s ! af ++ "," ++ as.s1 ! af ; s2 = as.s2}
The differences are in ConjAP
, which chooses the AConj
form for all but the last AP
.
lin
ConjAP conj as = {
s = \\af => as.s1 ! AConj -- All but the last AP in AConj
++ conj.s -- Conj string, like "and"
++ as.s2 ! af -- Full inflection table retained
} ;
The last AP
retains the full inflection table, including the AConj
form. That’s because any AP
can potentially appear before a conjunction—even one that is already built of several AP
s. (The house is [big and blue] or [small and green].)
But if an AP
has ever been coordinated, and it was not the last one, then it won’t ever go back to a non-conjunctive form.
Of course, in real languages the inflection tables tend to be much larger, with agreement, polarity, tense and whatnot. I have implemented similar languages in the RGL and GF works very well—I just chose a minimal example for the blog post.
Repeated conjunctions
Consider the API of lists in the RGL. First, we construct a list: [scallions, onions, garlic]
, and only after that, we apply a conjunction to the list.
As we saw earlier in this post, this is implented as a record with two s
fields: s1
contains the string "scallions , onions"
and s2
contains the string "garlic"
. Then all we need to do is to put the conjunction between the two strings in s1
and s2
.
But now the conjunction is repeated after every element. So we need to be prepared for scallions and onions and garlic, as well as scallions or onions or garlic, even before ConjC <someConj>
has been called. What to do?
Conjunction as parameter
Luckily, there are a finite amount of conjunctions. So we can make it into a parameter:
param
ConjType = And | Or | Nor ;
This ConjType
will be on the left-hand side of our list type. Like this:
-- Assuming that lincat of NP is {s : Case => Str ; a : Agr}
lincat
[NP] = {s : ConjType => Case => Str ; a : Agr} ;
And here comes the important part: the ConjType
parameter is used to insert the desired conjunctions after each new addition to the list.
lin
BaseNP np1 np2 = {
s = table {
And => \\c => np1.s ! c ++ "and" ++ np2.s ! c ;
Or => \\c => np1.s ! c ++ "or" ++ np2.s ! c ;
Nor => \\c => np1.s ! c ++ "nor" ++ np2.s ! c
} ;
a = conjAgr np1.a np2.a
} ;
ConsNP np nps = nps ** {
s = table {
And => \\c => np.s ! c ++ "and" ++ nps.s ! And ! c ;
Or => \\c => np.s ! c ++ "or" ++ nps.s ! Or ! c ;
Nor => \\c => np.s ! c ++ "nor" ++ nps.s ! Nor ! c
}
} ;
And how to get the right string out of the ListNP
? Let’s look at the implementation of ConjNP
.
lincat
Conj = {c : ConjType} ; -- Conj string already in the list
lin
ConjNP conj nps = {s = nps.s ! conj.c} ;
How about if your language uses different conjunctions for different parts of speech? That’s no problem at all. Each BaseC
and ConsC
decides which strings to insert, so BaseNP
might use the string “and” for the parameter And
, and BaseAP
might put a different string for the same parameter.
Complex morphology
What if the conjunctions have many allomorphs depending on the words they attach to? That definitely happens, for instance in Korean. Here’s a peek into the Korean RG, where we have a type
conjTable : POS => ConjType => Phono => Str
That’s a bit more things to consider than just output the string “and” after the parameter And
, but nothing that GF can’t handle. Here’s some code in action, where we choose the right allomorphs:
oper
mkFirstAP : AdjPhrase -> VForm => ConjType => Str = \ap ->
\\af,conj => ap.compar ++ case isPos af of {
True => glue (ap.s ! VStem Pos) (conjTable ! VStar ! conj ! ap.p) ;
False => glue (ap.s ! VStem Neg) (conjTable ! VStar ! conj ! ap.pNeg)
} ;
The lincat for AP
includes a parameter p
which tell us whether the VStem Pos
form ends in a consonant or vowel, and pNeg
, which tells whether the VStem Neg
form ends in a consonant or vowel.
If you find that the code is getting very complicated, maybe a better approach would be to store the forms with the conjunction included, already before any lists are made. So your AP
would contain forms like
- blue.ATTR
- blue.PRED
- blue.CONJ+and
- blue.CONJ+or
and BaseAP
would just choose whatever blue.*+and it needs for a given parameter like And
.
Of course, the larger your inflection table for just blue is, the more it may blow up when you add conjunctions. Writing GF grammars for morphologically complex languages can be tricky, but also very rewarding.
If you are ever writing a resource grammar and struggle with these type of questions, I would love to help. You can write to the GF mailing list (or to me personally if you’re shy), and I’ll try to give some tips.
Footnotes
1. Below is an alternative method to implement the list of up to 5 Elems. In light of grammar blowup, which version is better? ↩
lincat
Elem = {s : Str ; foo : FooBar} ;
[Elem] = {
s1,s2,s3,s4,s5 = {s : Str ; foo : MaybeFooBar}
} ;
param
FooBar = Foo | Bar ;
-- This is not the place to teach polymorphism in GF
MaybeFooBar = Just FooBar | Nothing ;
lin
BaseElem = {
s1,s2,s3,s4,s5 = { -- dummy values in all 5 fields
s = "NOTHING HERE YET" ;
foo = Nothing}
} ;
ConsElem e es =
case es.s1.foo of {
Nothing
=> es ** {s1 = {s = e.s ; foo = Just e.foo}} ;
_ => case es.s2.foo of {
Nothing
=> es ** {s2 = {s = e.s ; foo = Just e.foo}} ;
_ => case es.s3.foo of {
Nothing
=> es ** {s3 = {s = e.s ; foo = Just e.foo}} ;
_ => case es.s4.foo of {
Nothing
=> es ** {s4 = {s = e.s ; foo = Just e.foo}} ;
_ => case es.s5.foo of {
Nothing
=> es ** {s5 = {s = e.s ; foo = Just e.foo}} ;
_ => error "The list already has 5 elements"}}}}
} ;
The solution is here.