Lecture Notes

for

CSCI 1301: Great Ideas in Computer Science

(c) 1993 - 2007 John E. Howland, All Rights Reserved

Department of Computer Science

Trinity University

One Trinity Place

San Antonio, Texas 78212-7200

Internet: jhowland@ariel.cs.trinity.edu

9 Language Translation

This discussion of language translation deals with some of the aspects of the translation of higher level programming languages to lower level machine languages. We do not here consider the computer application area of translation of one natural language, such as Russian, to another natural language such as English. Machine translation of natural languages is a much more complex problem. We begin with some background concepts from mathematics.

9.1 Inductive Specification

The problem of language translation is broken down into two sub-problems; syntax analysis and semantic analysis. Syntax analysis deals with the problem of recognizing whether or not a sentence (expression) in the language is correctly formed according to the rules of syntax for the language. Semantic analysis deals with the meaning of correctly formed sentences.

We start with the problem of syntax analysis. We need some method of specifying, precisely, the collection of syntactically correct sentences of a language. Inductive specification is a tool which may be applied to form strings of symbols chosen from an alphabet.

Inductive specification is a method of precisely specifying a set of values. As an example, consider the following specification of a certain subset of the natural numbers.

The set S is defined to be the smallest set of natural numbers satisfying:

a) zero is an element of S

b) If x is an element of S, then x+2 is an
element of S.

We can also reason about the set S in a non-inductive fashion as follows. 0 is in S (property a). 2 is in S (property b). 4 is in S (property b), etc. This leads us to think of the set of non-negative multiples of 2. Call this set M. It follows that S contains the set M. Similarly, since S is the smallest set of natural numbers satisfying a) and b), S must be a subset of every set satisfying a) and b). Hence, it must be true that M = S since M is a subset of S and S is a subset of M.

We can generalize the idea of inductive specification as follows. To specify a set A by induction, we define it to be the smallest set satisfying two properties of the form:

a) Some specific values must be in A.

b) If certain values are in A, then certain
other values are also in A.

This means (non-inductively) that A is the set of all elements satisfying property a) together with all elements which are obtained by repeatedly applying b). However, this non-inductive specification is not precise enough for our purposes.

9.2 Language Syntax Specification

We wish to apply inductive specification of a set to the problem of specifying the syntax of a programming language. In particular, we wish to use inductive specification to specify syntactic elements of a programming language. Each such set will be a set of strings which are correctly formed strings of symbols in the programming language. The rules which determine the set of all acceptable strings of symbols for a language is called the grammar of the language or the syntax of the language.

9.2.1 Example

Let us suppose that we wish to give the J programming language grammar for a subset of J known as list_of_numbers. Intuitively we are thinking of lists such as 1 2 3, '', etc., but not 9 2.4 a 3. Consider the following inductive specification:

The syntax element list_of_numbers is the smallest set of objects satisfying the two properties:

1. The empty list, '' is a list-of-numbers

2. If l is a list_of_numbers and n is a
number, then the pair n , l is a
list_of_numbers.

We can infer the following:

a) '' is a list_of_numbers by property 1.

b) 10 , '' ==> 10 is a
list_of_numbers by property 2 since
10 is a number.

c) 3 , 10 ==> 3 10 is a
list_of_numbers by property 2 since
3 is a number.

d) Nothing is a list-of-numbers unless it is
constructed in this manner.

9.3 Bacus-Naur Form and Syntactic Derivations

Bacus-Naur form is a notation which can be used to give inductive specifications for the syntactic elements of a language. The problem we face when we wish to define the rules of syntax (grammar) of a language is that we have to use notation (often involving the same alphabet as used in the language) to describe the rules for forming strings of symbols. We need to be able to distinguish symbols appearing in the syntax rule which are part of the rule itself from symbols which are part of a properly formed string of the language. Bacus and Naur developed a notational scheme, called BNF, for such syntax descriptions.

BNF can be used to inductively define a number of sets of syntactic elements of a language at once. These sets are called syntactic categories, or sometimes nonterminals and we write the names of these sets by enclosing them in "<" ">" . For example, the list_of_numbers J data type would be a nonterminal and would be written as <list_of_numbers>. Each syntactic category is defined by a finite set of rules, or productions. Each rule asserts that certain values must be in the syntactic category.

9.3.1 Example

Here is the BNF definition of the list-of-numbers syntax. This description has two rules.

<list_of_numbers> ::= ''

<list_of_numbers> ::= <number> , <list_of_numbers>

The first rule says that the empty list is in <list_of_numbers> and the second rule says that if n is in <number> and l is in <list_of_numbers>, the n , l which is n append l is in <list_of_numbers>. Each BNF rule starts with a syntactic category followed by ::= which is read as "is". The right hand side of the rule or production specifies how a member of the syntactic category is to be constructed from other syntactic categories and terminal symbols.

In the above rules the comma "," (the J word for append) and single quote "'" are terminal symbols and <number> and <list_of_numbers> are nonterminals. Sometimes we use the symbol "|" to shorten or simplify rules. "|" is read as or. The above two rules can be combined as:

<list_of_numbers> ::= '' | <number> , <list_of_numbers>

Another shorthand is the Kleene star {...}* which means zero or more repetitions of what ever is between the braces. And the Kleene plus {...}+ which means one or more repetitions of what ever is between the braces. For example, the <list_of_numbers> category might be defined as:

<list_of_numbers> ::= {<number>}*

The shorthand items "|", Kleene star and plus are not really necessary, but are used with BNF to help shorten the descriptions.

As an example, consider the following specifications of Scheme data:

9.3.2 Example

<list> ::= ({<datum>}*)

<dotted-datum> ::= ({<datum>}+ . <datum>)

<vector> ::= #({<datum>}*)

<datum> ::= <number> | <symbol> | <boolean> |
<string> | <list> | <dotted-datum> | <vector>

This means that


(1 abc 2 10)
is a valid Scheme datum.

We are discussing how one formulates the syntax rules of a programming language. The properties of a program which can be determined by analyzing just the text of a program are said to be static properties. Similarly, the dynamic properties of a program are those which are determined by run-time inputs. Analysis of static properties is important because a translator can use this information to catch certain kinds of program errors and perhaps write a more efficient program.

9.4 Lambda Calculus Language

If the variable x is a formal parameter, a reference to x is a static program property in most programming languages. We wish to explore this idea a little more fully. Suppose we consider a tiny programming language defined by the following BNF rules.

<exp> ::= <var_ref>
| <exp> <exp>
| monad def '<exp>'


This is a language which has references to variables, function calls and monad definitions.

This language is the lambda calculus which was invented by Alonzo Church in his 1941 book "The Calculi of Lambda-Conversion. The lambda calculus is a fundamental basis for many programming languages. Ii strongly influenced the design of Lisp and particularly the Scheme programming language. We are using the lambda calculus language as an example more because it is a tiny language than because of its theoretical significance.

The notation Church used for his lambda calculus would look something like the following when expressed in BNF:

<exp> ::= <var-ref>
| <exp> <exp>
| lambda <var> . <exp>

We have used a J syntax with "monad", "define" and single quotes rather than the lambda so that we can more easily use J to model this language. The counter part to the <var> is the pronoun "y." which may appear in the quoted <exp>.

9.4.1 Predicate exprp

Following is an implementation of a predicate exprp which returns true if an expression is a correctly formed sentence in the lambda calculus language.

exprp =: monad define
NB. parse sentence into words
words =. word y.
NB. check for variable reference
if. 1 = tally words
    do. varp open words
NB. check for verb application
  elseif. 2 = tally words
    do. (exprp open 0 from words) and exprp open 1 from words
NB. check for verb definition
  elseif. 3 = tally words
    do. ('monad' match open 0 from words) and ('def' match open 1 from words) and exprp do open 2 from words
  elseif. 1
    do. 0
end.
)
This definition uses two other verbs; word and the predicate for variable references varp. The complete discussion of how these two verbs accomplish their task is a bit beyond our level of understanding at this point, but that should not deter our use of the verbs in the definition exprp.

The verb word =: ;: separates a character string into a boxed list of items in the same manner that a J system would parse the character string into words as if the string were a J sentence. For example:


   word 'a+b' ==>
+-+-+-+
|a|+|b|
+-+-+-+
   word '+/%#' ==>
+-+-+-+-+
|+|/|%|#|
+-+-+-+-+
   word 'data * items' ==>
+----+-+-----+
|data|*|items|
+----+-+-----+
   word '-data' ==>
+-+----+
|-|data|
+-+----+
The verb

varp =: monad def ' _1 < (4 !: 0) < , y.'
is a predicate for a name which has been bound to some J item. For example:

   a ==>
100
   varp 'a' ==>
1
   b ==>
|value error
   varp 'b' ==>
0
We are now ready to illustrate the operation of the predicate, exprp, for properly formed sentences.

9.4.2 Examples:


   exprp 'a' ==>
1
   exprp 'a a' ==>
1
   exprp 'b' ==>
0
   b ==>
|value error
   exprp 'a b' ==>
0
   exprp 'a a a a' ==>
0
   exprp 'monad def ''y.''' ==>
1
   exprp 'monad def ''a''' ==>
1
   exprp 'monad def ''b''' ==>
0

9.5 Semantics

Next we give some definitions which are important when we begin to attach some semantics (meanings) to our lambda expression language.

We want the lambda calculus to allow us to:

a) reference variables

b) apply monads to an input

c) define monads

A variable reference is said to be bound in an expression if it refers to a formal parameter introduced in the expression.

A variable reference is said to be free in an expression if it is not a formal parameter in the expression.

9.5.1 Example


(monad def 'y.') a
The reference to y. is bound.

The reference to a is free.

A variable is said to occur bound in an expression if the expression contains a bound reference to the variable.

Similarly, a variable is said to occur free in an expression if the expression contains a free reference to the variable.

When evaluated at run time each variable reference must have a value. Variables bound to formal parameters are said to be lexically bound. Other wise they must be bound at the top level or supplied by the system. Such variables are said to be globally bound.

9.5.2 Example

In:

(monad def 'y.') a
y. is lexically bound.

a must be globally bound or an error will result


   (monad def 'y.') a ==>
100
because

   a ==>
100
A reference to a variable that is neither lexically bound or globally bound is an error.

The value of an expression depends only on the values of the variables which occur free in the expression.

The context surrounding the expression must supply the values of the free variables.

9.5.3 Example

The value of (monad def 'y.') a depends on a because a is free in this expression.

9.5.4 Example

The value of (monad def 'y.') y. in

(monad def '(monad def ''y.'') y.')
is determined by the binding of the parameter y. .

Hence,


(monad def '(monad def ''y.'') y.') 2 ==> 2
Note that y. was free in (monad def 'y.') y. but bound in

(monad def '(monad def ''y.'') y.')
To put it another way, in the above expression, the two variables y. are distinct from each other.

The value of an expression is independent of bindings for variables that do not occur free in the expression.

9.5.5 Example


   y. =: 10
    (monad def 'y.') y.
10
   (monad def '(monad def ''y.'') y.') 2
2
   y.
10
The meaning, that is, semantics, of (monad def 'y.') is always the same. It is the identity function, i.e., it returns whatever value is passed as an argument.

Next we discuss in more detail the idea of free and bound variable references. In J, all bindings are top level bindings unless they are bound in an explicit definition. As stated before, within the context of an explicit definition, the names x. and y. are lexically bound. In addition, any names assigned by the local binding symbol (spelled =.) are also bound variable references after each such binding. Any other variable references are said to be free. For example, consider the definition foo:


foo =: monad define
('one' ; 'two' ; 'three') =. y.
one + two * three - four
)
The variable references one, two and three are bound in the first line of this definition, while four is free. The value of this function will be dependent on the value of the free variable four.