#*

  This script parses and 'formats' (or pretty-prints) the teaching software
  language "pl/0" which was defined by N.Wirth (of 'pascal' fame) for teaching
  compiler construction ("Compilerbau"). The script demonstrates the potential
  of the pep machine and language. pl/0 is so limited as not to be very useful
  for actual programming.

STATUS

  19 june 2022
    ran the translator scripts for testing
  7 sept 2021
    mainly working, and translatable to golang (not pl/0, but the pep script)
    maybe some bugs. tcl translation not working.

BUGS

  The tr/translate.tcl.pss does not translate properly this script.
  This is a good oportunity to debug that translator script.

SOLVED BUGS

  statementset parsing was tricked by nested 'end' tokens.
  The example below demonstrated the problem
  >> begin a:=1; if x>4 then write x end .
  This should parse as a valid pl/0 program, but didnt because
  "write x end" was parsed (tokens were reduced) as 
  >> "statement.end." --> "statementset." 

  The solution was to move the 5 token if/then, while/do rule
  to the top of the parse> section.

GRAMMAR FOR PLZERO

  The pep script below does not follow strictly the grammar listed here (which
  is N.Wirth's original grammar). For example there is no "block" token in the
  script (and no nested procedures). The script also allows multiline comments
  in the form "{...}" and accepts "write" and "writeln" as synonyms for "!".
  Also, all keywords as parsed by the script are case-insensitive, since later
  versions of pl/0 allowed uppercase keywords (CONST, VAR etc).
   
  * the pl/0 grammar in wsn/ebnf form
  -------
   program = block "." .
   block = [ "const" ident "=" number {"," ident "=" number} ";"]
        [ "var" ident {"," ident} ";"]
        { "procedure" ident ";" block ";" } statement .

   statement = [ ident ":=" expression | "call" ident 
              | "?" ident | "!" expression 
              | "begin" statement {";" statement } "end" 
              | "if" condition "then" statement 
              | "while" condition "do" statement ].

   condition = "odd" expression |
               expression ("="|"#"|"<"|"<="|">"|">=") expression .
   expression = [ "+"|"-"] term { ("+"|"-") term}.
   term = factor {("*"|"/") factor}.
   factor = ident | number | "(" expression ")".

  ,,,

TESTING

  This script can be successfully translated and run into
  go, java, python, ruby, c. However the tcl translator has some bug
  preventing translation.

  * test program parsing and pretty printing
  >> pep -f eg/plzero.pss -i 'const c=0; var a; proc x;var b; b:=1; !a ;'

  * check if variable declarations are parsing
  >> pep -f eg/plzero.pss -i "var a, b, c ;"

  * translate to golang and run, using the function in helpers.pars.sh
  >> pep.goff eg/plzero.pss test.plzero.primes.txt

NOTES

  The pl/0 language was designed to be easily parsable by a 
  "recursive-descent" parser/compiler (this was the type of compiler
  prefered by Niklaus Wirth). However, the pep parse machine uses
  shift-reductions of parse tokens. This is one explanation/excuse
  why the current script may appear excessively complex!
  In any case, once a working pl/0 parser script is created, it
  has the advantage of being translatable into other languages
  (go, java, python, ruby, c).

  One basic function of a compiler is to decide if a variable
  or function has been defined before it is called. Pep seems 
  to struggle with this at the moment. To solve this we need 
  essentially a "stack" of in-scope valid variables (and their
  type). When the parsing of a procedure is complete, we remove
  all those internal variable names/constants from the 
  variable list.

  This grammar doesnt do nested procedures at the moment, 
  because they seem silly.

  We can use the empty token trick for tokens that 
  don't have to exist. Eg: add an empty "procset*" token
  as soon as the first "procedure" keyword is seen. This slightly
  simplifies the grammar rules for "procset." and "proceed."
  
  The "block" rule contains a series of optional elements that 
  must appear in order eg:
   block := 
     [constant.declarations] [var.declarations] [procedure.declarations]
     [statement] .
  all of these are optional. If we factor out the optionality we can end
  up with a large number of separate rules which can be tedious to code.
  However in this case, there are only about 6 possibilites.

  Statements have no end marker, so how do we know when the 
  expression is finished in "statement = ident ':=' expression ." ?
  So, the problem is, if we have an input stream:
    >> a := b+c*d
  That will parse to "ident ':=' expression" where expression == "b+c"
  which will reduce to "statement", which ignores the "*d" suffix
  and produces a parse error!

  This is solved by using a look-ahead token, in rules that use
  the "begins-with" test (B"...").

  "a = 4" is a legal 'condition' in pl/0 but is also
  part of a constant declaration. So I need to stop this parsing
  as a condition if it is preceded by "const"

HISTORY

  24 june 2022
    Fixed a small bug in this and plzero.ruby.pss Trying to use
    global vars in ruby, but need to add $ to names but not to
    functions.
  9 sept 2021
    Apparently solved an interesting reduction precedence problem
    by moving rules to the top of the parse> section.
    i.e moved B"if.condition.then.statement." to the top of parse>
    this means that the "statement.end." rule does not take 
    precedence (which was leading to mis-parsings).

  7 sept 2021
    Finishing pretty printing. Script appears to work well, if 
    somewhat wordy. Can adapt this script to transpile to go etc.
  6 sept 2021
    more work. transmogrifying number* and ident* into exp* where
    necessary. Basic syntax appears good, apart from nested 
    procedures. Working on pretty printing output, as a prelude
    to transpiling to another language.
  5 sept 2021
    rewriting, using "tail reduction"
  30 august 2019
    created a previous version 

*#

  read;
  # make character numbers (for error messages) relative
  # to line numbers
  "\n" { clear; nochars; .reparse }
  [:space:] { clear; .reparse }

  "+","-" {
    put; clear; add "opadd*"; push; .reparse
  }
  "*","/" {
    put; clear; add "opmul*"; push; .reparse
  }

  # modern pascal style comments, but not (* ... *) because 
  # they are tricky to parse.
  "{" {

    # save the line number for possible error message later
    clear; lines; put; clear; 
    add "{";
    until "}"; 
    E"}" {
      # can convert to another format
      put; clear; 
      # create a "comment" parse token
      # comment-out this line to remove multiline comments from the 
      # translated code
      # add "comment*"; push; 
      .reparse  
    }
    # make an unterminated multiline comment an error
    # to ease debugging of scripts.
    clear; 
    add "[pl/0 error] Unterminated pascal comment { ... } \n";
    add "stating at line number "; get; add "\n";
    print; clear;
    quit;
  }

  "*","/" {
    put; clear; add "opmul*"; push; .reparse
  }

  # literal tokens
  # '!' is the print command in pl/0
  # '?' is the read command in pl/0
  # '#' means != 
  # '.' marks the end of a program. 
  ".",",","=",";","(",")" { 
    put; add "*"; push; .reparse 
  }
  "!" { clear; add "write"; put; clear; add "!*"; push; .reparse }
  "?" { clear; add "read"; put; clear; add "?*"; push; .reparse }

  # also need to parse <= >= comparison operators
  # '=' is comparison and also constant assignment.
  "<",">","#" {
    while [=]; 
    !"<".!">".!">=".!"<=".!"#" {
      # error, message and quit
      put; clear; 
      add "[pl/0 error] unrecognised operator "; get;
      add " at line/char "; lines; add "/"; chars; add "\n";
      print; quit;
    }
    "#" { clear; add "!="; }
    put; clear; add "compare*"; push; 
  }

  ":" { 
    (eof) { 
      clear; add "trailing : at end of file!\n"; 
      print; quit;
    }
    read; 
    ":=" { put; add "*"; push; .reparse }
    put; clear; 
    add "[pl/0 error] unrecognised operator "; get;
    add " at line/char "; lines; add "/"; chars; add "\n";
    print; quit;
  }

  [0-9] { 
    while [0-9]; put; clear; add "number*"; 
    push; .reparse 
  }

  [:alpha:] { 
    while [:alpha:]; 
    # make keywords case insensitive 
    put; lower;
    # synonym with !
    "write","writeln" {
      put; clear; add "!*"; push; .reparse 
    }
    # keywords in pl/0
    "const","var","if","then","while","do","begin","end",
    "proc","procedure","call","odd" { 
      # or add ".key*" to remind us that these are keywords,
      # not parse tokens
      # add a space here for pretty printing later
      put; "procedure" { clear; add "proc"; }
      add "*"; push; .reparse
    }
    # non-lower case identifier is already saved on the tape
    add "ident*"; push; .reparse 
  }

  !"" { 
    # error unrecognised character
    put; clear; lines; add ":"; chars; 
    add " [pl/0 error] incorrect character '"; get; add "'\n";
    print; quit;
  }

parse>

 # To visualise token reduction uncomment this below:
 # lines; add ":"; chars; add " "; print; clear; 
 # add "\n"; unstack; print; clip; stack; 

 # we will do these 5 token reductions first, so that 
 # "statement*end*" does not have precedence
 pop; pop; pop; pop; pop;
  # sometimes statements are terminated by expressions, which
  # in turn must be terminated by something else, so there is a 
  # trailing token there
  B"if*condition*then*statement*".!"if*condition*then*statement*" {
    replace "if*condition*then*statement*" "statement*";
    push; push; --; --;
    get; add " "; ++; get; ++; add " "; get; add "\n"; ++; get;
    --; --; --; put; clear;
    # transfer 'invisible' token value
    ++; ++; ++; ++; get; --; --; --; put; --; clear; ++; ++;
    .reparse
  }
  B"while*condition*do*statement*".!"while*condition*do*statement*" {
    replace "while*condition*do*statement*" "statement*";
    push; push; --; --;
    get; add " "; ++; get; ++; add " "; get; add "\n"; ++; get;
    --; --; --; put; clear;
    # transfer invisible token value, and realign tape pointer
    ++; ++; ++; ++; get; --; --; --; put; --; clear; ++; ++;
    .reparse
  }
  push; push; push; push; push;

#-----------------
# 1 token
  pop; 
  
  # errors 
  (eof).!".*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error] Missing '.' at end of program ? \n";
    add " \n"; print; quit;
  }

#-----------------
# 2 tokens
  pop; 
  
# Some errors
  "ident*ident*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error] 2 variable names in a row (is there a \n";
    add " missing operator?) \n"; print; quit;
  }

  "vardec*const*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error] Constant declarations must precede variable \n";
    add " declarations in pl/0 \n"; print; quit;
  }
  "condec*const*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error] only 1 constant declaration block is allowed \n";
    add " in each scope \n"; print; quit;
  }
  "vardec*var*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error] only 1 variable declaration block is allowed \n";
    add " in each scope \n"; print; quit;
  }
  ";*end*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error] Last statement in block does not require \n";
    add " a semi-colon ';' \n"; print; quit;
  }

  #*
  B"const*".!"const*".!E"ident*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error] The 'const' keyword must be followed by \n";
    add " a variable name. \n"; print; quit;
  }
  *#

  # add an empty procset* (set of procedures) if there is
  # not one already. This simplifies program reduction later
  # it doesnt matter if there is 1 or 2 tokens here.
  E"proc*".!B"procset*" {
    !"proc*" { 
      # need to conserve the preceding token
      push; clear; get; ++; put; --; clear; put;
      add "procset*proc*"; push; push; .reparse
    }
    # only 1 token
    clear; get; ++; put; --; clear; put;
    add "procset*proc*"; push; push; .reparse
  }

  # procedure headers (name etc)
  "procset*proceed*" {
    clear; get; add "\n"; ++; get; --; put; clear;
    add "procset*"; push; .reparse
  }

  # "=" can be used for constant assignment but also as
  # a comparison operator
  "=*ident*","=*exp*" { replace "=*" "compare*"; push; push; .reparse }
  "exp*=*" { replace "=*" "compare*"; push; push; .reparse }

  # expression transmog
  "opmul*ident*","opadd*ident*","opmul*number*","opadd*number*",
  "compare*ident*","compare*number*",":=*ident*",":=*number*",
  "if*ident*","if*number*","while*ident*","while*number*",
  "(*ident*","(*number*","!*ident*","!*number*" {
    push; clear; add "exp*"; push; .reparse
  }

  # expression transmutation
  "ident*opmul*","ident*opadd*","number*opmul*","number*opadd*",
  "ident*compare*","number*compare*" {
    replace "ident*" "exp*"; replace "number*" "exp*";
    push; push; .reparse
  }

  "const*conlist*" {
    clear; get; add " "; ++; get; --; put; clear;
    add "condec*"; push; .reparse
  }

  # non-tail reduction for variables
  "var*ident*" {
    clear; get; add " "; ++; get; --; put; clear;
    add "varlist*"; push; .reparse
  }

  # variable decs
  "varlist*;*" {
    clear; get; ++; get; --; put; clear;
    add "vardec*"; push; .reparse
  }

  "block*.*" {
    clear; add "program*"; push; .reparse
  } 

  "call*ident*" {
    clear; get; add " "; ++; get; --; put; clear;
    add "statement*"; push; .reparse
  }

  "?*ident*" {
    clear; get; ++; get; --; put; clear;
    add "statement*"; push; .reparse
  }

  # a multi statement block, between begin/end
  "begin*statementset*" {
    # ident statements,
    clear; get; add "\n"; ++; get; --; replace "\n" "\n  "; 
    add "\nend"; put; clear; add "statement*"; push; .reparse
  }

  # tail reduction for statementsets 
  "statement*end*" {
    clear; 
    #get; ++; get; --; put; clear;
    add "statementset*"; push; .reparse
  }

#-----------------
# 3 tokens
  pop;

  # Some 3 token errors, not all of them

  ";*ident*=*","begin*ident*=*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error]\n";
    add "  incorrect assignment? The assignment operator in \n";
    add "  pl/0 is ':=' not '=' \n"; print; quit;
  }

  # ! expression must be parsed while looked at the 
  # trailing token, like all expression parsing
  B"!*exp*".!"!*exp*".!E"opmul*".!E"opadd*" {
    # need to conserve the "invisible" last token 
    replace "!*exp*" "statement*"; push; push; 
    # also transfer attributes 
    --; --; get; add " "; ++; get; --; put; 
    clear; ++; ++; get; --; put; ++; clear;
    .reparse
  }

  # procedure headers (name etc)
  "proc*ident*;*" {
    clear; get; add " "; ++; get; ++; get; --; --; put; clear;
    add "prochead*"; push; .reparse
  }

  # procedure headers (name etc)
  "prochead*statement*;*" {
    # indent the statement if it is not a begin/end construct
    clear; ++; get;
    !B"begin" { clear; add "  "; get; replace "\n" "\n  "; put; }
    --; 
    clear; get; add "\n"; ++; get; ++; get; --; --; put; clear;
    add "proceed*"; push; .reparse
  }

  # expressions, this could be the trickiest aspect of 
  # the grammar. transmog ident/number to exp
  "exp*opmul*exp*", "exp*opadd*exp*" {
    clear; get; ++; get; ++; get; --; --; put; clear;
    add "exp*"; push; .reparse
  }

  "(*exp*)*" {
    clear; get; ++; get; ++; get; --; --; put; clear;
    add "exp*"; push; .reparse
  }

  # this is reverse reduction of statements, which can be 
  # useful since exp* requires a trailing token in order to
  # resolve.
  "statement*;*statementset*" {
    clear; get; add ";\n"; ++; ++; get; --; --; put; clear;
    add "statementset*"; push; .reparse
  }

  # variable decs 
  "varlist*,*ident*" {
    clear; get; ++; get; add " "; ++; get; --; --; put; clear;
    add "varlist*"; push; .reparse
  }

#-----------------
# 4 tokens
  pop; 

  # 4 token errors
  B"ident*=*number*".!E";*".!E",*" {
    clear; lines; add ":"; chars; 
    add " [pl/0 error] incorrect assignment operator '=' \n";
    add " did you mean ':=' ? \n"; print; quit;
  }

  # procedure headers (name etc). Need to indent the variable decs etc
  "prochead*vardec*statement*;*","prochead*condec*statement*;*" {
    # indent the variable/constant declaration
    clear; add "  "; ++; get; replace "\n" "\n  "; put; --; 

    clear; get; add "\n"; ++; get; add "\n"; ++; get; ++; get; 
    --; --; --; put; clear; add "proceed*"; push; .reparse
  }

  # ident and number have already been transmog'ed into exp*
  # and =* into compare*
  B"exp*compare*exp*".!"exp*compare*exp*".!E"opmul*".!E"opadd*" {
    # need to conserve the "invisible" last token 
    replace "exp*compare*exp*" "condition*"; push; push; 
    --; --; get; add " "; ++; get; add " "; ++; get; --; --; put; clear;
    # transfer trailing token value
    ++; ++; ++; get; --; --; put; ++; clear; 
    .reparse
  }

  # also see the 5 token reduction, because of the trailing token
  # required by exp*
  "if*condition*then*statement*",
  "while*condition*do*statement*" {
    # indent the statement if it is not a begin/end construct
    clear; ++; ++; ++; get;
    !B"begin" { clear; add "  "; get; replace "\n" "\n  "; put; }
    --; --; --; clear;
    clear; get; add " "; ++; get; add " "; ++; get; add "\n"; ++; get;
    --; --; --; put; clear;
    add "statement*"; push; .reparse
  }

  # lookahead for expressions in statements 
  # the problem is: x:=4*3+1 will be parsed at 
  # statement*+1 if no lookahead.

  # If the expression
  # is not followed by a */+- then it is a complete statement
  # and can be reduced. lets transmog ident and number to make simpler
  B"ident*:=*exp*".!"ident*:=*exp*".!E"opmul*".!E"opadd*" {
    # need to conserve the "invisible" last token 
    replace "ident*:=*exp*" "statement*"; push; push;
    --; --; get; add " "; ++; get; add " "; ++; get; --; --; put; clear;
    ++; ++; .reparse
  }

  # tail reduction for constant decs
  "ident*=*number*;*" {
    clear; get; add " "; ++; get; add " "; ++; get; ++; get; 
    --; --; --; put; clear;
    add "conlist*"; push; .reparse
  }

  # 
  "ident*;*" {
    clear; get; ++; get; --; put; clear;
    add "varlist*"; push; .reparse
  }

#-----------------
# 5 tokens
  pop; 

  # procedure headers (name etc), need to indent condec vardec
  "prochead*condec*vardec*statement*;*" {
    # indent the variable and constant declarations
    clear; add "  "; ++; get; replace "\n" "\n  "; put; 
    clear; add "  "; ++; get; replace "\n" "\n  "; put; 
    --; --; 

    clear; get; add "\n"; ++; get; add "\n"; ++; 
    get; add "\n"; ++; get; ++; get; --; --; --; --; put; clear;
    add "proceed*"; push; .reparse
  }
  # constant declarations, tail reduction, but tail redux not
  # necessary
  "ident*=*number*,*conlist*" {
    clear; 
    get; add " "; ++; get; add " "; ++; 
    get; ++; get; add " "; ++; get;
    --; --; --; --; put; clear;
    add "conlist*"; push; .reparse
  }

  # program reduction
  (eof) {
    "statement*.*" {
      clear; get; ++; get; --; put; clear; 
      add "program*"; 
    }
    "vardec*statement*.*",
    "condec*statement*.*",
    "procset*statement*.*" {
      clear; get; add "\n{ main program }\n"; 
      ++; get; ++; get; --; --; put; 
      clear; add "program*"; 
    }

    "condec*vardec*statement*.*",
    "vardec*procset*statement*.*",
    "condec*procset*statement*.*" {
      clear; 
      get; add "\n"; ++; get; add "\n{ main program }\n"; 
      ++; get; ++; get; --; --; --; put; 
      clear; add "program*";
    }
    "condec*vardec*procset*statement*.*" {
      clear; get; add "\n"; ++; get; 
      add "\n"; ++; get; add "\n{ main program }\n"; 
      ++; get; ++; get; --; --; --; --; put; 
      clear; add "program*"; 
    }
  }

  (eof) {
    unstack;
    "program*" {
      clear; 
      add "{ ok pl/0 }\n";
      add "{ code checked and formatted by eg/plzero.pss }\n";
      get; add "\n"; print; quit;
    }
    put; clear; 
    add "Pl/0 program did not parse well \n";
    add "The final parse tokens were: "; get; add "\n";
    print; quit;
  }
  push; push; push; push; push;