#* This script parses and 'formats' (or pretty-prints) the teaching software language "pl/0" which was defined by N.Wirth (of 'pascal' fame) for teaching compiler construction ("Compilerbau"). The script demonstrates the potential of the pep machine and language. pl/0 is so limited as not to be very useful for actual programming. STATUS 19 june 2022 ran the translator scripts for testing 7 sept 2021 mainly working, and translatable to golang (not pl/0, but the pep script) maybe some bugs. tcl translation not working. BUGS The tr/translate.tcl.pss does not translate properly this script. This is a good oportunity to debug that translator script. SOLVED BUGS statementset parsing was tricked by nested 'end' tokens. The example below demonstrated the problem >> begin a:=1; if x>4 then write x end . This should parse as a valid pl/0 program, but didnt because "write x end" was parsed (tokens were reduced) as >> "statement.end." --> "statementset." The solution was to move the 5 token if/then, while/do rule to the top of the parse> section. GRAMMAR FOR PLZERO The pep script below does not follow strictly the grammar listed here (which is N.Wirth's original grammar). For example there is no "block" token in the script (and no nested procedures). The script also allows multiline comments in the form "{...}" and accepts "write" and "writeln" as synonyms for "!". Also, all keywords as parsed by the script are case-insensitive, since later versions of pl/0 allowed uppercase keywords (CONST, VAR etc). * the pl/0 grammar in wsn/ebnf form ------- program = block "." . block = [ "const" ident "=" number {"," ident "=" number} ";"] [ "var" ident {"," ident} ";"] { "procedure" ident ";" block ";" } statement . statement = [ ident ":=" expression | "call" ident | "?" ident | "!" expression | "begin" statement {";" statement } "end" | "if" condition "then" statement | "while" condition "do" statement ]. condition = "odd" expression | expression ("="|"#"|"<"|"<="|">"|">=") expression . expression = [ "+"|"-"] term { ("+"|"-") term}. term = factor {("*"|"/") factor}. factor = ident | number | "(" expression ")". ,,, TESTING This script can be successfully translated and run into go, java, python, ruby, c. However the tcl translator has some bug preventing translation. * test program parsing and pretty printing >> pep -f eg/plzero.pss -i 'const c=0; var a; proc x;var b; b:=1; !a ;' * check if variable declarations are parsing >> pep -f eg/plzero.pss -i "var a, b, c ;" * translate to golang and run, using the function in helpers.pars.sh >> pep.goff eg/plzero.pss test.plzero.primes.txt NOTES The pl/0 language was designed to be easily parsable by a "recursive-descent" parser/compiler (this was the type of compiler prefered by Niklaus Wirth). However, the pep parse machine uses shift-reductions of parse tokens. This is one explanation/excuse why the current script may appear excessively complex! In any case, once a working pl/0 parser script is created, it has the advantage of being translatable into other languages (go, java, python, ruby, c). One basic function of a compiler is to decide if a variable or function has been defined before it is called. Pep seems to struggle with this at the moment. To solve this we need essentially a "stack" of in-scope valid variables (and their type). When the parsing of a procedure is complete, we remove all those internal variable names/constants from the variable list. This grammar doesnt do nested procedures at the moment, because they seem silly. We can use the empty token trick for tokens that don't have to exist. Eg: add an empty "procset*" token as soon as the first "procedure" keyword is seen. This slightly simplifies the grammar rules for "procset." and "proceed." The "block" rule contains a series of optional elements that must appear in order eg: block := [constant.declarations] [var.declarations] [procedure.declarations] [statement] . all of these are optional. If we factor out the optionality we can end up with a large number of separate rules which can be tedious to code. However in this case, there are only about 6 possibilites. Statements have no end marker, so how do we know when the expression is finished in "statement = ident ':=' expression ." ? So, the problem is, if we have an input stream: >> a := b+c*d That will parse to "ident ':=' expression" where expression == "b+c" which will reduce to "statement", which ignores the "*d" suffix and produces a parse error! This is solved by using a look-ahead token, in rules that use the "begins-with" test (B"..."). "a = 4" is a legal 'condition' in pl/0 but is also part of a constant declaration. So I need to stop this parsing as a condition if it is preceded by "const" HISTORY 24 june 2022 Fixed a small bug in this and plzero.ruby.pss Trying to use global vars in ruby, but need to add $ to names but not to functions. 9 sept 2021 Apparently solved an interesting reduction precedence problem by moving rules to the top of the parse> section. i.e moved B"if.condition.then.statement." to the top of parse> this means that the "statement.end." rule does not take precedence (which was leading to mis-parsings). 7 sept 2021 Finishing pretty printing. Script appears to work well, if somewhat wordy. Can adapt this script to transpile to go etc. 6 sept 2021 more work. transmogrifying number* and ident* into exp* where necessary. Basic syntax appears good, apart from nested procedures. Working on pretty printing output, as a prelude to transpiling to another language. 5 sept 2021 rewriting, using "tail reduction" 30 august 2019 created a previous version *# read; # make character numbers (for error messages) relative # to line numbers "\n" { clear; nochars; .reparse } [:space:] { clear; .reparse } "+","-" { put; clear; add "opadd*"; push; .reparse } "*","/" { put; clear; add "opmul*"; push; .reparse } # modern pascal style comments, but not (* ... *) because # they are tricky to parse. "{" { # save the line number for possible error message later clear; lines; put; clear; add "{"; until "}"; E"}" { # can convert to another format put; clear; # create a "comment" parse token # comment-out this line to remove multiline comments from the # translated code # add "comment*"; push; .reparse } # make an unterminated multiline comment an error # to ease debugging of scripts. clear; add "[pl/0 error] Unterminated pascal comment { ... } \n"; add "stating at line number "; get; add "\n"; print; clear; quit; } "*","/" { put; clear; add "opmul*"; push; .reparse } # literal tokens # '!' is the print command in pl/0 # '?' is the read command in pl/0 # '#' means != # '.' marks the end of a program. ".",",","=",";","(",")" { put; add "*"; push; .reparse } "!" { clear; add "write"; put; clear; add "!*"; push; .reparse } "?" { clear; add "read"; put; clear; add "?*"; push; .reparse } # also need to parse <= >= comparison operators # '=' is comparison and also constant assignment. "<",">","#" { while [=]; !"<".!">".!">=".!"<=".!"#" { # error, message and quit put; clear; add "[pl/0 error] unrecognised operator "; get; add " at line/char "; lines; add "/"; chars; add "\n"; print; quit; } "#" { clear; add "!="; } put; clear; add "compare*"; push; } ":" { (eof) { clear; add "trailing : at end of file!\n"; print; quit; } read; ":=" { put; add "*"; push; .reparse } put; clear; add "[pl/0 error] unrecognised operator "; get; add " at line/char "; lines; add "/"; chars; add "\n"; print; quit; } [0-9] { while [0-9]; put; clear; add "number*"; push; .reparse } [:alpha:] { while [:alpha:]; # make keywords case insensitive put; lower; # synonym with ! "write","writeln" { put; clear; add "!*"; push; .reparse } # keywords in pl/0 "const","var","if","then","while","do","begin","end", "proc","procedure","call","odd" { # or add ".key*" to remind us that these are keywords, # not parse tokens # add a space here for pretty printing later put; "procedure" { clear; add "proc"; } add "*"; push; .reparse } # non-lower case identifier is already saved on the tape add "ident*"; push; .reparse } !"" { # error unrecognised character put; clear; lines; add ":"; chars; add " [pl/0 error] incorrect character '"; get; add "'\n"; print; quit; } parse> # To visualise token reduction uncomment this below: # lines; add ":"; chars; add " "; print; clear; # add "\n"; unstack; print; clip; stack; # we will do these 5 token reductions first, so that # "statement*end*" does not have precedence pop; pop; pop; pop; pop; # sometimes statements are terminated by expressions, which # in turn must be terminated by something else, so there is a # trailing token there B"if*condition*then*statement*".!"if*condition*then*statement*" { replace "if*condition*then*statement*" "statement*"; push; push; --; --; get; add " "; ++; get; ++; add " "; get; add "\n"; ++; get; --; --; --; put; clear; # transfer 'invisible' token value ++; ++; ++; ++; get; --; --; --; put; --; clear; ++; ++; .reparse } B"while*condition*do*statement*".!"while*condition*do*statement*" { replace "while*condition*do*statement*" "statement*"; push; push; --; --; get; add " "; ++; get; ++; add " "; get; add "\n"; ++; get; --; --; --; put; clear; # transfer invisible token value, and realign tape pointer ++; ++; ++; ++; get; --; --; --; put; --; clear; ++; ++; .reparse } push; push; push; push; push; #----------------- # 1 token pop; # errors (eof).!".*" { clear; lines; add ":"; chars; add " [pl/0 error] Missing '.' at end of program ? \n"; add " \n"; print; quit; } #----------------- # 2 tokens pop; # Some errors "ident*ident*" { clear; lines; add ":"; chars; add " [pl/0 error] 2 variable names in a row (is there a \n"; add " missing operator?) \n"; print; quit; } "vardec*const*" { clear; lines; add ":"; chars; add " [pl/0 error] Constant declarations must precede variable \n"; add " declarations in pl/0 \n"; print; quit; } "condec*const*" { clear; lines; add ":"; chars; add " [pl/0 error] only 1 constant declaration block is allowed \n"; add " in each scope \n"; print; quit; } "vardec*var*" { clear; lines; add ":"; chars; add " [pl/0 error] only 1 variable declaration block is allowed \n"; add " in each scope \n"; print; quit; } ";*end*" { clear; lines; add ":"; chars; add " [pl/0 error] Last statement in block does not require \n"; add " a semi-colon ';' \n"; print; quit; } #* B"const*".!"const*".!E"ident*" { clear; lines; add ":"; chars; add " [pl/0 error] The 'const' keyword must be followed by \n"; add " a variable name. \n"; print; quit; } *# # add an empty procset* (set of procedures) if there is # not one already. This simplifies program reduction later # it doesnt matter if there is 1 or 2 tokens here. E"proc*".!B"procset*" { !"proc*" { # need to conserve the preceding token push; clear; get; ++; put; --; clear; put; add "procset*proc*"; push; push; .reparse } # only 1 token clear; get; ++; put; --; clear; put; add "procset*proc*"; push; push; .reparse } # procedure headers (name etc) "procset*proceed*" { clear; get; add "\n"; ++; get; --; put; clear; add "procset*"; push; .reparse } # "=" can be used for constant assignment but also as # a comparison operator "=*ident*","=*exp*" { replace "=*" "compare*"; push; push; .reparse } "exp*=*" { replace "=*" "compare*"; push; push; .reparse } # expression transmog "opmul*ident*","opadd*ident*","opmul*number*","opadd*number*", "compare*ident*","compare*number*",":=*ident*",":=*number*", "if*ident*","if*number*","while*ident*","while*number*", "(*ident*","(*number*","!*ident*","!*number*" { push; clear; add "exp*"; push; .reparse } # expression transmutation "ident*opmul*","ident*opadd*","number*opmul*","number*opadd*", "ident*compare*","number*compare*" { replace "ident*" "exp*"; replace "number*" "exp*"; push; push; .reparse } "const*conlist*" { clear; get; add " "; ++; get; --; put; clear; add "condec*"; push; .reparse } # non-tail reduction for variables "var*ident*" { clear; get; add " "; ++; get; --; put; clear; add "varlist*"; push; .reparse } # variable decs "varlist*;*" { clear; get; ++; get; --; put; clear; add "vardec*"; push; .reparse } "block*.*" { clear; add "program*"; push; .reparse } "call*ident*" { clear; get; add " "; ++; get; --; put; clear; add "statement*"; push; .reparse } "?*ident*" { clear; get; ++; get; --; put; clear; add "statement*"; push; .reparse } # a multi statement block, between begin/end "begin*statementset*" { # ident statements, clear; get; add "\n"; ++; get; --; replace "\n" "\n "; add "\nend"; put; clear; add "statement*"; push; .reparse } # tail reduction for statementsets "statement*end*" { clear; #get; ++; get; --; put; clear; add "statementset*"; push; .reparse } #----------------- # 3 tokens pop; # Some 3 token errors, not all of them ";*ident*=*","begin*ident*=*" { clear; lines; add ":"; chars; add " [pl/0 error]\n"; add " incorrect assignment? The assignment operator in \n"; add " pl/0 is ':=' not '=' \n"; print; quit; } # ! expression must be parsed while looked at the # trailing token, like all expression parsing B"!*exp*".!"!*exp*".!E"opmul*".!E"opadd*" { # need to conserve the "invisible" last token replace "!*exp*" "statement*"; push; push; # also transfer attributes --; --; get; add " "; ++; get; --; put; clear; ++; ++; get; --; put; ++; clear; .reparse } # procedure headers (name etc) "proc*ident*;*" { clear; get; add " "; ++; get; ++; get; --; --; put; clear; add "prochead*"; push; .reparse } # procedure headers (name etc) "prochead*statement*;*" { # indent the statement if it is not a begin/end construct clear; ++; get; !B"begin" { clear; add " "; get; replace "\n" "\n "; put; } --; clear; get; add "\n"; ++; get; ++; get; --; --; put; clear; add "proceed*"; push; .reparse } # expressions, this could be the trickiest aspect of # the grammar. transmog ident/number to exp "exp*opmul*exp*", "exp*opadd*exp*" { clear; get; ++; get; ++; get; --; --; put; clear; add "exp*"; push; .reparse } "(*exp*)*" { clear; get; ++; get; ++; get; --; --; put; clear; add "exp*"; push; .reparse } # this is reverse reduction of statements, which can be # useful since exp* requires a trailing token in order to # resolve. "statement*;*statementset*" { clear; get; add ";\n"; ++; ++; get; --; --; put; clear; add "statementset*"; push; .reparse } # variable decs "varlist*,*ident*" { clear; get; ++; get; add " "; ++; get; --; --; put; clear; add "varlist*"; push; .reparse } #----------------- # 4 tokens pop; # 4 token errors B"ident*=*number*".!E";*".!E",*" { clear; lines; add ":"; chars; add " [pl/0 error] incorrect assignment operator '=' \n"; add " did you mean ':=' ? \n"; print; quit; } # procedure headers (name etc). Need to indent the variable decs etc "prochead*vardec*statement*;*","prochead*condec*statement*;*" { # indent the variable/constant declaration clear; add " "; ++; get; replace "\n" "\n "; put; --; clear; get; add "\n"; ++; get; add "\n"; ++; get; ++; get; --; --; --; put; clear; add "proceed*"; push; .reparse } # ident and number have already been transmog'ed into exp* # and =* into compare* B"exp*compare*exp*".!"exp*compare*exp*".!E"opmul*".!E"opadd*" { # need to conserve the "invisible" last token replace "exp*compare*exp*" "condition*"; push; push; --; --; get; add " "; ++; get; add " "; ++; get; --; --; put; clear; # transfer trailing token value ++; ++; ++; get; --; --; put; ++; clear; .reparse } # also see the 5 token reduction, because of the trailing token # required by exp* "if*condition*then*statement*", "while*condition*do*statement*" { # indent the statement if it is not a begin/end construct clear; ++; ++; ++; get; !B"begin" { clear; add " "; get; replace "\n" "\n "; put; } --; --; --; clear; clear; get; add " "; ++; get; add " "; ++; get; add "\n"; ++; get; --; --; --; put; clear; add "statement*"; push; .reparse } # lookahead for expressions in statements # the problem is: x:=4*3+1 will be parsed at # statement*+1 if no lookahead. # If the expression # is not followed by a */+- then it is a complete statement # and can be reduced. lets transmog ident and number to make simpler B"ident*:=*exp*".!"ident*:=*exp*".!E"opmul*".!E"opadd*" { # need to conserve the "invisible" last token replace "ident*:=*exp*" "statement*"; push; push; --; --; get; add " "; ++; get; add " "; ++; get; --; --; put; clear; ++; ++; .reparse } # tail reduction for constant decs "ident*=*number*;*" { clear; get; add " "; ++; get; add " "; ++; get; ++; get; --; --; --; put; clear; add "conlist*"; push; .reparse } # "ident*;*" { clear; get; ++; get; --; put; clear; add "varlist*"; push; .reparse } #----------------- # 5 tokens pop; # procedure headers (name etc), need to indent condec vardec "prochead*condec*vardec*statement*;*" { # indent the variable and constant declarations clear; add " "; ++; get; replace "\n" "\n "; put; clear; add " "; ++; get; replace "\n" "\n "; put; --; --; clear; get; add "\n"; ++; get; add "\n"; ++; get; add "\n"; ++; get; ++; get; --; --; --; --; put; clear; add "proceed*"; push; .reparse } # constant declarations, tail reduction, but tail redux not # necessary "ident*=*number*,*conlist*" { clear; get; add " "; ++; get; add " "; ++; get; ++; get; add " "; ++; get; --; --; --; --; put; clear; add "conlist*"; push; .reparse } # program reduction (eof) { "statement*.*" { clear; get; ++; get; --; put; clear; add "program*"; } "vardec*statement*.*", "condec*statement*.*", "procset*statement*.*" { clear; get; add "\n{ main program }\n"; ++; get; ++; get; --; --; put; clear; add "program*"; } "condec*vardec*statement*.*", "vardec*procset*statement*.*", "condec*procset*statement*.*" { clear; get; add "\n"; ++; get; add "\n{ main program }\n"; ++; get; ++; get; --; --; --; put; clear; add "program*"; } "condec*vardec*procset*statement*.*" { clear; get; add "\n"; ++; get; add "\n"; ++; get; add "\n{ main program }\n"; ++; get; ++; get; --; --; --; --; put; clear; add "program*"; } } (eof) { unstack; "program*" { clear; add "{ ok pl/0 }\n"; add "{ code checked and formatted by eg/plzero.pss }\n"; get; add "\n"; print; quit; } put; clear; add "Pl/0 program did not parse well \n"; add "The final parse tokens were: "; get; add "\n"; print; quit; } push; push; push; push; push;