#* 

   pars/compile.pss 

   This is a parse-script which compiles parse-scripts (!).

   What is more, it can compile itself... so we can do
     >> pep -f compile.pss compile.pss > asm.new.pp
     
   This is useful because the resulting 'assembler' program (in sav.pp)
   and printed to stdout, can be used as a replacement for 'asm.pp'
   which is the default parse-script language compiler. The advantage 
   is that it is easier to maintain and add new sytax to compile.pss
   than it is to 'asm.pp'.

   This script uses the virtual machine and engine implemented at 
   http://bumble.sf.net/books/pars/object/ It implements a script language
   with a syntax reminiscent of sed and awk (much simpler than awk, but
   more complex than sed).
   
   This code was created in a straightforward manner by adapting the 
   "assembled" code in 'asm.pp'. Some extra error checks were added.
   Also, the EOF test was placed at the end of the script to remove
   the 'last character' bug. It was evident that using the script
   language is much more comfortable that hand-coding parse machine
   assembler programs.

HOW TO ADD A NEW COMMAND TO PEP

   In general, we would like to avoid adding new commands 
   (instructions) to the pep script-language/machine since we would
   like to keep the machine and language as simple as possible.
   However, after great thought and cogitation, sometimes new commands
   or features seem like a good idea. To add a new command the 
   process is as follows:

   Add a constant in command.h and command.c in the info[] array.
   Implement the commmand in machine.interp.c in the big switch
   statement.

   Then modify compile.pss to recognise the command when it is 
   in a script. 
   Then copy asm.pp to asm.old.pp
   Then do
     >> pep -f compile.pss compile.pss > asm.new.pp
     >> cp asm.new.pp asm.pp
   Now test your "newcommand" with eg
     >> pep -e "r; newcommand; t;d;" -i "abcd"
  
   Now, if you wish, add the new command to the various 
   translation scripts in the tr/ folder. The modifications 
   required in the translation scripts are very similar to the 
   modifications made to the "compile.pss" script, except that 
   the target language is different.

REPLACING ASMPP

   We can use this script as a replacement for "asm.pp" or 
   "asm.handcode.pp" which is a script assembler written by hand in
   the parse machine assembly format (1 command per line, labels, jumps,
   tests, etc). 

   * replace asm.pp with compile.pss
   -----
    # generate the new script assembler
    cp asm.pp asm.old; pep -f compile.pss compile.pss > asm.new.pp
    cp asm.new.pp asm.pp
    # test the new assembler (the script "r;t;t;t;d;" will be compiled
    # by the new asm.pp which we have just created.
    pep -e "r;t;t;t;d;" -i "abcd"
    # output: aaabbbcccddd
   ,,,

   The advantage of all this, is that it is much easier to maintain and add
   new syntax to "pars/compile.pss" than it is directly, to "pars/asm.pp"
   
   For example asm.handcode.pp still uses "rabbit hops" to compile "quoteset"
   tokens (an old version of the "ortestset" token), which is very inefficient
   but compile.pss uses the new look-ahead technique. Also, there are negated
   tests implemented in compile.pss but not implemented in asm.handcode.pp 
   
   I will no longer continue to maintain asm.handcode.pp because its real
   purpose was to "bootstrap" the current script. I will maintain working
   copies of asm.pp as generated by this script in case of future errors.
     
NOTES

   The accumulator register was being used to generate true-jump 
   targets for testsets, but no longer
   
   This script can be used as the basis for many others which transform
   scripts in some way. 
   
   For example, to 'pretty-print' scripts, or to generate compilable c code
   for a script using the functions in machine.methods.c. So, instead of
   compiling to the "assembler" format for the machine (which is then
   interpreted by the code in pep.c) we can compile to a series of c function
   calls. This is c source code which can be compiled with gcc, producing an
   executable version of the target script.

   This is an interesting idea, because we can transform a script into
   compilable or executable code in a different language with a different
   'Machine' object. So, for example, we could write a Machine object in Ruby
   or Java or Python or x86 assembler and then generate compilable or
   executable code for that target environment. The compilable code would
   consist of a series of method calls for the given object and test and
   jumps. 

   It will also be interesting to see if there is a significant performance
   advantage in running executed, rather than interpreted scripts. see
   tr/translate.c.pss or tr/translate.go.pss for creating executable parse
   programs from scripts

GRAMMAR NOTES

  The machine cannot directly implement the ebnf structures of repetition
  "{...}", optionality "[...]" or grouping "(...)", so we need to express all
  grammar rules only in terms of alternation |. Quotesets are a handy way to
  express this in a script, eg

     * bnf rule: alpha ::= a | b | c ;
     >> 'a','b','c' { clear; add "alpha*"; push; .reparse } 

  It is sometimes straightforward to factor out the above ebnf structures,
  but the result is a greater number of rules.

SEE ALSO
   
   At http://bumble.sf.net/books/pars/
   object/pep.c 
     the current implementation of the machine interpreter and debugger. 
   object/*.c 
     the virtual machine and components 
   tr/translate.java.pss
     compiles pep scripts to a stand-alone java source code file
   tr/translate.go.pss
   tr/translate.ruby.pss
   tr/translate.python.pss
     As above, for go, ruby and python.
   asm.handcode.pp
     a handcoded "assembly" compiler of the parse script language for 
     a previous version of the script language. This was how I 
     initially "bootstrapped" the pep language (before using the
     current file, compile.pss to create new versions of the pep 
     language).
    
USAGE

   This script is used to replace the hand-coded assembler file
   "asm.handcode.pp" since it is much easier to maintain and add new syntax
   for the parse-script language. Comments are preserved (largely) in the
   output file. 
   
   We can also do the seemingly strange operation
     >> pep -f compile.pss compile.pss
   which actually creates an 'assembler' version of itself in 'sav.pp'
   which is then be suitable for use as an 'asm.pp' substitute.
   (This is how we modify the syntax of the pep language, if need be). 
   This is quite tricky to think about since it is so self-referential.

   This is analogous to the equally strange operation
     >> pep -f tr/translate.c.pss tr/translate.c.pss > eg/clang/tr.clang.c
   which generates a compilable c language program of the compilable 
   script.

   It is possible to compile this script into a stand-alone
   executable with:
   ----
     pep -f tr/translate.c.pss tr/translate.c.pss > eg/clang/tr.clang.c
     cd eg/clang/
     gcc -o tr.clang.c -Lobject -lmachine -Iobject
   ,,,,

TESTING

   * view how this script compiles an inline script
   >> pep -f compile.pss -i "[aeiou] {a '(vowel)'; } t;d;"

   The result will also be saved in "sav.pp"

   * see how the compiled script runs
   >> pep -a sav.pp -i "abcde"
   output: a(vowel)bcde(vowel)

   * test "test chaining" compilation  
   >> pep -f compile.pss -i "r;'a','b','c'{t;}t;d;"
   >> pep -a sav.pp -i "axbxcx"
   output should be: aaxbbxccx

   * view/debug how compile.pss compiles test chains (or something else)
   >> pep -If compile.pss -i "r;'a','b','c'{t;}t;d;"

   This provides interactive debugging of the compilation process.

FIXED BUGS

  I was getting segmentation faults because of one-off errors etc
  >> pep -f compile.pss compile.pss
  Mainly fixed with "valgrind", but still a bug in "until" (in
  object/machine.interp.c execute()... need to implement endsWith() function.
  And one other bug.
  * didnt need 2 jumps after "tests", just 1 jumpfalse or jumptrue
    used "replace" to remove the unnecessary jump

  * eg: add "\\"; threw an error and also: replace "\\" "\\\\";
    This was a problem with the "until" implementation in machine.interp.c
    It was actually necessary to count the number of escape chars 
    before the suffix. If even, break, if not, dont.

BUGS
   
  * missing braces in scripts dont produce good error messages,
    just a cryptic "script could not be compiled".

  * should I unescape single quotes in single quote blocks??
    eg ' abc\'xyz' will become " abc'xyz"

  * doesnt catch B[abc] or E[a-z] type errors in scripts. Also 
    doesnt catch "r;r;d" type errors.
  * Also, un-balanced braces give cryptic error messages

  compile.pss should not write the compiled script to stdout
  because then asm.pp will do the same thing. easy enough to fix
  in asm.pp as well (comment out final 2 "print" commands).

  Comments may not be parsing correctly.

  Comments and multiline comments should not jump back to read
  after deleting the comment, because there could be no more 
  input, and read will throw an error. They should jump to 
  the EOF end-of-file check. Or they could just call ".reparse"
  which is safe but not very efficient.

TODO

  Add an "echar" command that changes the default escape
  character. Also, in some languages a character actually
  escapes itself, eg '' is ' escaped!

  We could allow single argument "replace" command eg:
    >> replace "x";
  which is equivalent to
    >> replace "x" "":
  
  Need to catch multiline quote errors when used with the 
  "until" command.

  Separate error checking into a new script, and make pep load
  an assembled version of this error checker. This will allow
  the same error checker to be used with the scripts
    tr/translate.java.pss tr/translate.tcl.pss etc.

HISTORY
    
  15 june 2021
    Adding the commands "upper" "lower" and "cap"
    "nochars" "nolines"

  13 march 2020
    Added compilation for multiline arguments for the "add" 
    command. Appears to be working.

  15 sept 2019
    Realised that I can have an eof error check block at 
    the end of the script just before all the tokens are 
    pushed back on the stack. See the 1 and 2 token eof error
    check in this script.

  13 september 2019
    Adding "mark" and "go" commands here.
    Improved unterminated quote '/" error messages. In general
    it is much more helpful to catch the error when it happens 
    and print an informative message (with line-number etc).

  5 september 2019

    Added a "stack" and "unstack" command to the machine and
    to compile.pss

  29 august 2019
   
    Improved some error checks. Could make the error check code
    more succinct.

    Changed the way testeof and testtape are parsed to include
    them with other tests. This also allows to negate them with
    !(==) and !(eof) and also to concatenate with other tests
    eg: (eof),B"abc" {}
    added extra syntax <eof> <EOF> and <==> for these tests.

  25 august 2019

    Realised that I dont need 2 jumps for OR test concatenation (with ',')
    That will greatly improve script interpretation efficiency.

    Added AND concatenation logic to tests so now we can do

     * test if workspace begins with 'a' AND ends with 'z'
     >> B"a".E"z" {}

    Changed the way .reparse and .restart are parsed and compiled.
    These are now parsed as 2 tokens ".*word*". This allows me to
    use '.' for AND logic concatenation in tests. It also allows
    me to provide special semantic meaning to commands beginning with
    a dot, which seems like a good thing.

    Added "delim" command here and in machine.c and machine.interp.c, 
    to change the stack delimiter.

  24 august 2019

    The "state*" token should be separated into "testeof*" and 
    "testtape*" and then the 2 tests can be elided.(done)

    The conversion to a "test*{*" rule and ellision of 
    multiple tests will make this script much more compact and hopefully
    just as readable. Also, as a side effect, negation of all
    tests will be available soon. Also, it is possible to chain together
    different types of tests.

    Converted quoteset to "ortestset*" and "andtestset*". 
    I will introduce a new notation namely:

    * check if workspace begins with "abc" AND ends with "xyz"
    >> B"abc" . E"xyz" { commands }

    so the dot will become an "AND" (&&) concatenator of tests
    and "," will remain as the "OR" (||) concatenator of tests
    In these || and && test lists any type of test can be 
    included for example
     
     * check if workspace starts with "a", only contains chars a|b|c
     * and ends with the letter "z" (using "." AND concatenator)
     >> B"abc" . [abc] . E"z" { ... } 

    Experimenting with the new technique to create negated tokens
    classes.

    * test negated tokens for the equality test
    >> pep -f compile.pss -i 'r;!"b",!"a"{nop;}'

  23 august 2019
    
    Adding begintests B"..." { } and endtests E"..." {} to the quoteset logic.
    But need to juggle the combinations. Also will add classes and negated
    classes. More or less working. But should actually change parsing to
    make quotesets more flexible, see the section of the script for details.

    The new quoteset compilation seems to be working.
    Needs more testing. We can now use compile.pss as a replacement
    for asm.pp.

    Converting to a new quoteset (eg: 'n','m' {...} ) lookahead compiling
    technique.  Also we can compile comments with rules for
    "comment*command*" and "command*comment*" and "comment*comment*" ->
    "comment*". Instead of the current shenanigins.

  14 august 2019

    trying to preserve comments here but cant reduce comments
    with tokens like {* }* !* etc because we never retrieve
    the attributes for those tokens. more thought required.

    Added a !"text" {...} syntax. very simple to add here. 
    did the same in compilable.c.pss

    Added a "begin" block to this (for start configurations of scripts).
    Also need to improve the compilation of "quoteset*" tokens which produce
    nifty but very poor code. need 'tapereplace' command for this?
    
  30 july 2019
    Fixed the last character bug by putting the EOF test at the very end of
    the file. The translation is complete and the script appears to be
    working but no doubt will contain bugs.  Initially translated from
    asm.pp.

*# 

  read;
  #--------------
  [:space:] {
    clear; .reparse
  }

  #---------------
  # We can ellide all these single character tests, because
  # the stack token is just the character itself with a *
  # Braces {} are used for blocks, ',' and '.' for concatenating
  # tests with OR or AND logic. 'B' and 'E' for begin and end
  # tests. 
  "{", "}", ";", ",", ".", "!", "B", "E" {
    put; add "*"; push;
    .reparse 
  }

  #---------------
  # format: "text"
  "\"" {
    # save the line number in case there is no terminating
    # quote.
    clear; ll; put; clear; add '"';
    until '"'; 
    !E'"' { 
      clear; add 'Unterminated quote (") starting at line ';
      get; add ' !\n'; 
      print; quit;
    }
    put; clear;
    add "quote*"; push;
    .reparse 
  }

 #---------------
 # format: 'text', single quotes are converted to double quotes
 # but we must escape embedded double quotes.
  "'" {
    # save the line number in case there is no terminating
    # quote.
    clear; ll; put; clear; 
    until "'"; 
    !E"'" { 
      clear; add "Unterminated quote (') starting at line ";
      get; add '!\n'; 
      print; quit;
    }
    # should we unescape single quotes here??
    clip; escape '"'; put; clear;
    add "\""; get; add "\"";
    put; clear;
    add "quote*";
    push;
    .reparse 
  }

  #---------------
  # formats: [:space:] [a-z] [abcd] [:alpha:] etc 
  "[" {
    until "]"; put; clear;
    add "class*"; push;
    .reparse 
  }

 #---------------
 # formats: (eof) (==) etc. I may change this syntax to just
 # EOF and ==
  "(" {
    clear; until ")"; clip;
    put; 
    "eof","EOF" { clear; add "eof*"; push; .reparse } 
    "==" { clear; add "tapetest*"; push; .reparse } 

    add " << unknown test near line "; ll;
    add " of script.\n";
    add " bracket () tests are \n";
    add "   (eof) test if end of stream reached. \n";
    add "   (==)  test if workspace is same as current tape cell \n";
    print; clear;
    quit;
  }

  #---------------
  # multiline and single line comments, eg #... and #* ... *#
  "#" {
    clear; read;
    "\n" { clear; .reparse }

    # checking for multiline comments of the form "#* \n\n\n *#"
    # these are just ignored at the moment (deleted) 
    "*" { 
      # save the line number for possible error message later
      clear; ll; put; clear;
      until "*#"; 
      E"*#" {
        # convert to # single-line comments
        clip; clip;
        #put; clear; add "#*"; get; add "*#";
        replace "\n" "\n#";
        # create a "comment" parse token
        put; clear; add "comment*"; push; 
        .reparse  
      }
      # make an unterminated multiline comment an error
      # to ease debugging of scripts.
      clear; 
      add "unterminated multiline comment #* ... *# \n";
      add "stating at line number "; get; add "\n";
      print; clear;
      quit;
    }

    # single line comments. some will get lost.
    put; clear; add "#"; get; until "\n"; clip;
    put; clear; add "comment*"; push; 
    .reparse 
  }

 #----------------------------------
 # parse command words (and abbreviations)

 # legal characters for keywords (commands)
 ![abcdefghijklmnopqrstuvwxyzBEKGPRUWS+-<>0^] {
   # error message about a misplaced character
   put; clear;
   add "!! Misplaced character '";
   get;
   add "' in script near line "; ll;
   add " (character "; cc; add ") \n";
   print; clear; bail;
 }

   # my testclass implementation cannot handle complex lists
   # eg [a-z+-] this is why I have to write out the whole alphabet

   while [abcdefghijklmnopqrstuvwxyzBEOFKGPRUWS+-<>0^];
   #----------------------------------
   # KEYWORDS 
   # here we can test for all the keywords (command words) and their
   # abbreviated one letter versions (eg: clip k, clop K etc). Then
   # we can print an error message and abort if the word is not a 
   # legal keyword for the parse-edit language

   # make ll an alias for "lines" and cc an alias for chars
   "lines" { clear; add "ll"; }
   "chars" { clear; add "cc"; }
   # one letter command abbreviations
   "a" { clear; add "add"; }
   "k" { clear; add "clip"; }
   "K" { clear; add "clop"; }
   "D" { clear; add "replace"; }
   "d" { clear; add "clear"; }
   "t" { clear; add "print"; }
   "p" { clear; add "pop"; }
   "P" { clear; add "push"; }
   "u" { clear; add "unstack"; }
   "U" { clear; add "stack"; }
   "G" { clear; add "put"; }
   "g" { clear; add "get"; }
   "x" { clear; add "swap"; }
   ">" { clear; add "++"; }
   "<" { clear; add "--"; }
   "m" { clear; add "mark"; }
   "M" { clear; add "go"; }
   "r" { clear; add "read"; }
   "R" { clear; add "until"; }
   "w" { clear; add "while"; }
   "W" { clear; add "whilenot"; }

   # we can probably omit tests and jumps since they are not
   # designed to be used in scripts (only assembled parse programs).
   #*
   "b" { clear; add "jump"; }
   "j" { clear; add "jumptrue"; }
   "J" { clear; add "jumpfalse"; }
   "=" { clear; add "testis"; }
   "?" { clear; add "testclass"; }
   "b" { clear; add "testbegins"; }
   "B" { clear; add "testends"; }
   "E" { clear; add "testeof"; }
   "*" { clear; add "testtape"; }
   *#

   "n" { clear; add "count"; }
   "+" { clear; add "a+"; }
   "-" { clear; add "a-"; }
   "0" { clear; add "zero"; }
   "c"     { clear; add "cc"; }
   "chars" { clear; add "cc"; }
   "l"     { clear; add "ll"; }
   "lines" { clear; add "ll"; }
   "^" { clear; add "escape"; }
   "v" { clear; add "unescape"; }
   "z" { clear; add "delim"; }
   "S" { clear; add "state"; }
   "q" { clear; add "quit"; }
   "Q" { clear; add "bail"; }
   "s" { clear; add "write"; }
   "o" { clear; add "nop"; }
   "rs" { clear; add "restart"; }
   "rp" { clear; add "reparse"; }

   # some extra syntax for testeof and testtape
   "<eof>","<EOF>" { put; clear; add "eof*"; push; .reparse }
   "<==>" { put; clear; add "tapetest*"; push; .reparse }

   #*
   "nochars", "nolines" {
     put; clear; 
     add "The command '"; get; add "' (near line "; ll; add ")\n";
     add "has not been implemented, but needs to be. \n";
     print; clear; quit;
   }
   *#

   "add","clip","clop","replace","clear",
   "upper","lower","cap","print",
   "pop","push","unstack","stack","put","get","swap",
   "++","--","mark","go",
   "read","until","while","whilenot",
   "jump","jumptrue","jumpfalse",
   "testis","testclass","testbegins","testends",
   "testeof","testtape",
   "count","a+","a-","zero","cc","ll", "nochars","nolines",
   "escape","unescape","delim","state","quit","bail",
   "write","nop","reparse","restart" {
     put; clear;
     add "word*";
     push; .reparse
   }
   
   #------------ 
   # the .reparse command and "parse label" is a simple way to 
   # make sure that all shift-reductions occur. It should be used inside
   # a block test, so as not to create an infinite loop.

   "parse>" {
     clear; add "parse:"; put;
     clear; add "command*"; push;
     .reparse 
   }

   # --------------------
   # try to implement begin-blocks, which are only executed
   # once, at the beginning of the script (similar to awk's BEGIN {} rules)
   "begin" {
     put; add "*"; push; .reparse 
   }

   put; 
   add "Pep syntax error: unknown command '"; get; add "' \n";
   add "on line "; ll; add " (or character "; cc; add ")"; 
   add "of input (file or stream). \n"; 
   print; clear; quit;

# ----------------------------------
# PARSING PHASE:
# the lexing phase finishes here, and below is the 
# parse/compile phase of the script. Here we pop tokens 
# off the stack and check for sequences of tokens eg word*semicolon*
# If we find a valid series of tokens, we "shift-reduce" or "resolve"
# the token series eg word*semicolon* --> command*
#
# At the same time, we manipulate (transform) the attributes on the 
# tape, as required. So Tape=|pop|;| becomes |\npop| where the 
# bars | indicate tape cells. (2 tapes cells are merged into 1).
#
# Each time the stack is reduced, the tape must also be reduced
# 

parse>

#-------------------------------------
# 2 tokens
#-------------------------------------
  pop; pop;

  # All of the below are currently errors, but may not
  # be in the future if we expand the syntax of the parse
  # language. Also consider:
  #    begintext* endtext* quoteset* notclass*, !* ,* ;* B* E*
  # It is nice to trap the errors here because we can emit some
  # hopefully not-very-cryptic error messages with a line number.
  # Otherwise the script writer has to debug with
  #   pep -a asm.pp scriptfile -I
  #

  "word*word*", "word*}*", "word*begintext*", "word*endtext*",
  "word*!*", "word*,*", 
  "quote*word*", "quote*class*", "quote*state*", "quote*}*",
  "quote*begintext*", "quote*endtext*",
  "class*word*", "class*quote*", "class*class*", "class*state*", "class*}*",
  "class*begintext*", "class*endtext*", "class*!*", 
  "notclass*word*", "notclass*quote*", "notclass*class*", 
  "notclass*state*", "notclass*}*"
  {
    push; push;
    add "error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script (missing semicolon/brace/unescaped quote??) \n";
    print; clear; quit;
  }  

  "{*;*", ";*;*", "}*;*" {
    push; push;
    add "error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script: misplaced semi-colon? ; \n";
    print; clear; quit;
  }

  # comma errors.
  ",*;*", ",*{*", ",*}*" {
    push; push;
    add "error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script: misplaced comma? ; \n";
    print; clear; quit;
  }

  ",*{*" {
    push; push;
    add "Pep: error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script: extra comma in list? \n";
    print; clear; quit;
  }

  "command*;*","commandset*;*" {
    push; push;
    add "Pep: error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script: extra semi-colon? \n";
    print; clear; quit;
  }

  "!*!*" {
    push; push;
    add "Pep: error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script: \n double negation '!!' is not implemented \n";
    add " and probably won't be, because what would be the point? \n";
    print; clear; quit;
  }

  "!*{*","!*;*" {
    push; push;
    add "Pep: error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script: misplaced negation operator (!)? \n";
    add " The negation operator precedes tests, for example: \n";
    add "   !B'abc'{ ... } or !(eof),!'abc'{ ... } \n";
    print; clear; quit;
  }

  ",*command*" {
    push; push;
    add "error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script: misplaced comma? \n";
    print; clear; quit;
  }

  "!*command*" {
    push; push;
    add "error near line "; ll;
    add " (at char "; cc; add ") \n"; 
    add " The negation operator (!) cannot precede a command \n";
    print; clear; quit;
  }

  ";*{*", "command*{*", "commandset*{*" {
    push; push;
    add "error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script: no test for brace block? \n";
    print; clear; quit;
  }

  "{*}*" {
    push; push;
    add "error near line "; ll;
    add " of script: empty braces {}. \n";
    print; clear; quit;
  }

  "B*class*","E*class*" {
    push; push;
    add "error near line "; ll;
    add " of script:\n  classes ([a-z], [:space:] etc). \n";
    add "  cannot use the 'begin' or 'end' modifiers (B/E) \n";
    print; clear; quit;
  }

  "}*command*" {
    push; push;
    add "error near line "; ll;
    add " of script: extra closing brace '}' ?. \n";
    print; clear; quit;
  }

  "comment*{*" {
    push; push;
    add "error near line "; ll;
    add " of script: comments cannot occur between \n";
    add " a test and a brace ({). \n";
    print; clear; quit;
  }

  #------------ 
  # the .restart command just jumps to the start: label 
  # (which is usually followed by a "read" command)
  # but '.' is also the AND concatenator, which seems ambiguous,
  # but the parsing works.
  ".*word*" {
    clear; ++; get; --;
    "restart" {
      clear; add "jump start";
      put; clear;
      add "command*";
      push; .reparse 
    }
    "reparse" {
      clear; add "jump parse";
      put; clear;
      add "command*";
      push; .reparse 
    }
    push; push;
    add "error near line "; ll;
    add " (char "; cc; add ")"; add " of script:  \n";
    add " misplaced dot '.' (use for AND logic or in .reparse/.restart \n";
    print; clear; quit;
  }


  #-----------------------------------------
  # compiling comments so as to transfer them to the compiled 
  # file. 
  # implement these rules to conserve comments
  "comment*command*","command*comment*","commandset*comment*" {
    clear; get; add "\n"; ++; get; --; put; clear;
    add "command*"; push; .reparse
  }

  "comment*comment*" {
    clear; get; add "\n"; ++; get; --; put; clear;
    add "comment*"; push; .reparse
  }

  # -----------------------
  # negated tokens.
  #
  # This is a new more elegant way to negate a whole set of 
  # tests (tokens) where the negation logic is stored on the 
  # stack, not in the current tape cell. We just add "not" to 
  # the stack token.

  # eg: ![:alpha:] ![a-z] ![abcd] !"abc" !B"abc" !E"xyz"
  #  This format is used to indicate a negative test for 
  #  a brace block. eg: ![aeiou] { add "< not a vowel"; print; clear; }

  "!*quote*","!*class*","!*begintext*", "!*endtext*",
  "!*eof*","!*tapetest*" {
    # a simplification: just replace the token name with its
    # negative.
    replace "!*" "not"; push;
    # now get the token-value
    # added an extra ++ here.
    get; --; put; ++; clear;
    .reparse
  }

  #-----------------------------------------
  # format: E"text" or E'text'
  #  This format is used to indicate a "workspace-ends-with" text before
  #  a brace block.
  "E*quote*" {
     clear; add "endtext*";
     push;
     get; --; put; ++;
     clear; .reparse
  } 

  #-----------------------------------------
  # format: B"sometext" or B'sometext' 
  #   A 'B' preceding some quoted text is used to indicate a 
  #   'workspace-begins-with' test, before a brace block.
  "B*quote*" {
     clear; add "begintext*";
     push;
     get; --; put; ++;
     clear; .reparse
  } 

  #--------------------------------------------
  # ebnf: command := word, ';' ;
  # formats: "pop; push; clear; print; " etc
  # all commands need to end with a semi-colon except for 
  # .reparse and .restart
  #
  "word*;*" {
     clear;
     # check if command requires parameter
     get;
     "add", "until", "while", "whilenot", "mark", "go",
     "escape", "unescape", "delim", "replace" {
       put; clear; add "Pep: '"; get; add "'";
       add " << command needs an argument, on line "; ll; 
       add " of script.\n";
       print; clear; quit;
     }
     clear; add "command*";
     # no need to format tape cells because current cell contains word
     push; 
     .reparse
   }

  #-----------------------------------------
  # ebnf: commandset := command , command ;
  "command*command*", "commandset*command*" {
    clear;
    add "commandset*"; push;
    # format the tape attributes. Add the next command on a newline 
    --; get; add "\n"; 
    ++; get; --;
    put; ++; clear; 
    .reparse
  } 

  #-------------------
  # here we begin to parse "test*" and "ortestset*" and "andtestset*"
  # 

  #-------------------
  # eg: B"abc" {} or E"xyz" {}
  "begintext*{*","endtext*{*","quote*{*","class*{*",
  "eof*{*","tapetest*{*" {
    # set accumulator == 0
    zero; 
    B"begin" { clear; add "testbegins "; }
    B"end" { clear; add "testends "; }
    B"quote" { clear; add "testis "; }
    B"class" { clear; add "testclass "; }
    # clear the tapecell for testeof and testtape because
    # they take no arguments. 
    B"eof" { clear; put; add "testeof "; }
    B"tapetest" { clear; put; add "testtape "; }
    get; add  "\n";
    add "jumptrue 2 \n"; 
    # this extra jump has utility when we parse ortestsets and
    # andtestsets.
    add "jump block.end.";
    # the final jumpfalse + target will be added when
    # "test*{*commandset*}*" is parsed, or when
    # "ortestset*{*commandset*}*"
    # "andtestset*{*commandset*}*"
    put; a+; a+; a+; a+;
    clear; add "test*{*";
    push; push; .reparse
  }

  #-------------------
  # negated tests
  # eg: !B"xyz {} 
  #     !E"xyz" {} 
  #     !"abc" {}
  #     ![a-z] {}
  "notbegintext*{*","notendtext*{*","notquote*{*","notclass*{*",
  "noteof*{*","nottapetest*{*" {
    # set accumulator == 0
    zero; 
    B"notbegin" { clear; add "testbegins "; }
    B"notend" { clear; add "testends "; }
    B"notquote" { clear; add "testis "; }
    B"notclass" { clear; add "testclass "; }
    # clear the tapecell for testeof and testtape because
    # they take no arguments. 
    B"noteof" { clear; put; add "testeof "; }
    B"nottapetest" { clear; put; add "testtape "; }
    get; add  "\n";
    add "jumpfalse 2 \n"; 
    # this extra jump has utility when we parse ortestsets and
    # andtestsets.
    add "jump block.end.";
    # the final jumpfalse + target will be added later
    # use the accumulator to store the incremented jump target
    put; a+; a+; a+; a+;
    clear; add "test*{*";
    push; push; .reparse
  }

  #-------------------
  # 3 tokens
  #-------------------

  pop;

  #-----------------------------
  # some 3 token errors!!!
 
  # there are many other of these errors but I am not going
  # to write them all.
  "{*begintext*;*","{*endtext*;*","{*class*;*" {
    push; push; push;
    add "error near line "; ll;
    add " (char "; cc; add ")"; 
    add " of script (misplaced semicolon?) \n";
    print; clear; quit;
  }  

  "{*quote*;*","commandset*quote*;*","command*quote*;*" {
    push; push; push;
    add "[error] near line "; ll; add " (char "; cc; add ")"; 
    add " of script (quoted text without a command?) \n";
    print; clear; quit;
  }  

  # to simplify subsequent tests, transmogrify a single command
  # to a commandset (multiple commands).
  "{*command*}*" {
    clear; add "{*commandset*}*"; push; push; push;
    .reparse
  }

  # rule 
  #',' ortestset ::= ',' test '{'
  # trigger a transmogrification from test to ortestset token
  # and 
  # '.' andtestset ::= '.' test '{'

  ",*test*{*" {
    clear; add ",*ortestset*{*"; push; push; push;
    .reparse
  }

  # trigger a transmogrification from "test" to "andtest" by
  # looking backwards in the stack
  ".*test*{*" {
    # the jump counter is 1 too high for AND tests
    a-; clear; add ".*andtestset*{*"; push; push; push;
    .reparse
  }

  # errors! mixing AND and OR concatenation
  ",*andtestset*{*",
  ".*ortestset*{*" {
    # push the tokens back to make debugging easier
    push; push; push; 
    add " error: mixing AND (.) and OR (,) concatenation in \n";
    add " in script near line "; ll;
    add " (character "; cc; add ") \n";
    print; clear; quit;
  }

  #--------------------------------------------
  # ebnf: command := keyword , quoted-text , ";" ;
  # format: add "text";

  "word*quote*;*" {
    clear; get;
    "replace" {
       # error 
       add "< command requires 2 parameters, not 1 \n";
       add "near line "; ll;
       add " of script. \n";
       print; clear; quit;
    }

    "add", "until", "while", "whilenot", "escape", "mark", "go",
    "unescape", "delim" {
       # check here or in error.pss for multiline quoted arguments
       # for "mark" "go" "until" etc because they are not allowed.
       clear; add "command*";
       push;
       # a command plus argument, eg add "this" 
       --; get; 
       # allow multiline text in (only) the add command
       # we do this by turning a multiline "add" command into a 
       # sequence of single line "add" commands (because that is what
       # the assembler format allows). Actually, I could just write
       # replace "\n" "\\n"; which should work but would be much less
       # readable in the assembled file.
       "add" {
         add " "; ++; get;
         replace "\n" '\\n"\nadd "';
         --; put; ++;
         clear; .reparse
       }
       # maybe it would be useful for the until command to 
       # allow multiline as well
       "until" { 
         add " "; ++; get;
         replace "\n" '\\n';
         --; put; ++;
         clear; .reparse
       }
       add " "; ++; get;
       --; put; ++;
       clear; .reparse
     }

     # error, superfluous argument
     add ": command does not take an argument \n";
     add "near line "; ll;
     add " of script. \n";
     print; 
     #state
     quit;
   }

   #----------------------------------
   # format: "while [:alpha:] ;" or whilenot [a-z] ;

   "word*class*;*" {
     clear; get;
     "while", "whilenot" {
        clear; add "command*";
        push;
        # a command plus argument, eg while [a-z] 
        --; get; add " "; ++;
        get; --;
        put; ++;
        clear;
        .reparse
     }

     # error 
     add " < command cannot have a class argument \n";
     add "line "; ll; add ": error in script \n";
     print; clear; quit;
   }


  # -------------------------------
  # 4 tokens
  # -------------------------------

  pop;

  #-------------------------------------
  # ebnf:     command := replace , quote , quote , ";" ;
  # example:  replace "and" "AND" ; 

  "word*quote*quote*;*" {
    clear; get;
    "replace" {
      clear; add "command*"; push;
      #---------------------------
      # a command plus 2 arguments, eg replace "this" "that"
      --; get; add " ";
      ++; get; add " ";
      ++; get; --;
      --; put; ++;
      clear;
      .reparse
    }
    add " << command does not take 2 quoted arguments. \n";
    add " on line "; ll; add " of script.\n";
    print; quit;
  }

  #-------------------------------------
  # format: begin { #* commands *# }
  # "begin" blocks which are only executed once (they
  # will are assembled before the "start:" label. They must come before
  # all other commands.

  # "begin*{*command*}*",
  "begin*{*commandset*}*" {
     clear; 
     ++; ++; get; --; --; put; clear;
     add "beginblock*";
     push; .reparse
   }

   # -------------
   # parses and compiles concatenated tests
   # eg: 'a',B'b',E'c',[def],[:space:],[g-k] { ...
   "begintext*,*ortestset*{*",
   "endtext*,*ortestset*{*",
   "quote*,*ortestset*{*",
   "class*,*ortestset*{*",
   "eof*,*ortestset*{*",
   "tapetest*,*ortestset*{*" {
     B"begin" { clear; add "testbegins "; }
     B"end" { clear; add "testends "; }
     B"quote" { clear; add "testis "; }
     B"class" { clear; add "testclass "; }
     # clear the tapecell for testeof and testtape because
     # they take no arguments. 
     B"eof" { clear; put; add "testeof "; }
     B"tapetest" { clear; put; add "testtape "; }
     get; add "\n";
     add "jumptrue "; count; add "\n";
     ++; ++; get; --; --; put; clear; 
     # this works as long as we dont mix AND and OR concatenations 
     # add "test*{*";
     # need to change to this
     add "ortestset*{*";
     push; push;
     a+; a+; .reparse
   }

   # A collection of negated tests.
   "notbegintext*,*ortestset*{*",
   "notendtext*,*ortestset*{*",
   "notquote*,*ortestset*{*",
   "notclass*,*ortestset*{*",
   "noteof*,*ortestset*{*",
   "nottapetest*,*ortestset*{*" {
     B"notbegin" { clear; add "testbegins "; }
     B"notend" { clear; add "testends "; }
     B"notquote" { clear; add "testis "; }
     B"notclass" { clear; add "testclass "; }
     B"noteof" { clear; put; add "testeof "; }
     B"nottapetest" { clear; put; add "testtape "; }
     get; add "\n";
     add "jumpfalse "; count; add "\n";
     ++; ++; get; --; --; put; clear; 
     # this works as long as we dont mix AND and OR concatenations 
     add "ortestset*{*";
     # need to change to this
     # add "ortestset*{*";
     push; push;
     a+; a+; .reparse
   }

   # this works as long as we dont mix AND and OR concatenations 

   # -------------
   # AND logic 
   # parses and compiles concatenated AND tests
   # eg: 'a',B'b',E'c',[def],[:space:],[g-k] { ...
   # it is possible to elide this block with the negated block
   # for compactness but maybe readability is not as good.
   "begintext*.*andtestset*{*",
   "endtext*.*andtestset*{*",
   "quote*.*andtestset*{*",
   "class*.*andtestset*{*",
   "eof*.*andtestset*{*",
   "tapetest*.*andtestset*{*" {
     B"begin" { clear; add "testbegins "; }
     B"end" { clear; add "testends "; }
     B"quote" { clear; add "testis "; }
     B"class" { clear; add "testclass "; }
     B"eof" { clear; put; add "testeof "; }
     B"tapetest" { clear; put; add "testtape "; }
     get; add "\n";
     add "jumpfalse "; count; add "\n";
     ++; ++; get; --; --; put; clear; 
     add "andtestset*{*";
     push; push;
     a+; a+; .reparse
   }

   # eg
   # negated tests concatenated with AND logic (.). The 
   # negated tests can be chained with non negated tests.
   # eg: B'http' . !E'.txt' { ... }
   "notbegintext*.*andtestset*{*",
   "notendtext*.*andtestset*{*",
   "notquote*.*andtestset*{*",
   "notclass*.*andtestset*{*",
   "noteof*.*andtestset*{*",
   "nottapetest*.*andtestset*{*" {
     B"notbegin" { clear; add "testbegins "; }
     B"notend" { clear; add "testends "; }
     B"notquote" { clear; add "testis "; }
     B"notclass" { clear; add "testclass "; }
     B"noteof" { clear; put; add "testeof "; }
     B"nottapetest" { clear; put; add "testtape "; }
     get; add "\n";
     add "jumptrue "; count; add "\n";
     ++; ++; get; --; --; put; clear; 
     add "andtestset*{*";
     push; push;
     a+; a+; .reparse
   }

  #-------------------------------------
  # we should not have to check for the {*command*}* pattern
  # because that has already been transformed to {*commandset*}*

  "test*{*commandset*}*",
  "andtestset*{*commandset*}*",
  "ortestset*{*commandset*}*" { 
     # indent the assembled code for readability
     B"test*{*" {
       clear;
       # get rid of unnecessary jump but only in "test" cases 
       get; 
       # for positive tests (eg [a-z] {...})
       replace "jumptrue 2 \njump" "jumpfalse"; put;
       # for negative tests (eg ![a-z] {...})
       replace "jumpfalse 2 \njump" "jumptrue"; put;
     }
     clear; 
     ++; ++; add "  "; get; replace "\n" "\n  "; put; --; --; 
     clear; get;
     # the final jump (to the closing brace) has already been
     # coded in the "test*{*" rule or the other rules.
     # we just need to add the label number with "cc"
     cc;
     add "\n";
     ++; ++; get;
     add "\nblock.end."; cc; add ":";
     --; --; put; clear;
     add "command*";
     push;
     # always reparse/compile
     .reparse
   }

  # -------------
  # multi-token end-of-stream errors
  # not a comprehensive list of errors...
  (eof) {
    E"begintext*",E"endtext*",E"test*",E"ortestset*",E"andtestset*" {
      add "  Error near end of script at line "; ll;
      add ". Test with no brace block? \n";
      print; clear; quit;
    }

    E"quote*",E"class*",E"word*"{
      put; clear; 
      add "Error end of script! (line "; ll; 
      add ") missing semi-colon? \n";
      add "Parse stack: "; get; add "\n";
      print; clear; quit;
    }

    E"{*", E"}*", E";*", E",*", E".*", E"!*", E"B*", E"E*" {
      put; clear; 
      add "Error: misplaced terminal character at end of script! (line "; 
      ll; add "). \n";
      add "Parse stack: "; get; add "\n";
      print; clear; quit;
    }
  }

  # put the 4 (or less) tokens back on the stack
  push; push; push; push;

  (eof) {
    #add "end of script!! \n"
    print; clear;
    #---------------------
    # check if the script correctly parsed (there should only
    # be one token on the stack, namely "commandset*" or "command*"
    pop; pop;

    "commandset*",
    "command*" {
      push; --;
      add "# Assembled with the script 'compile.pss' \n";
      add "start:\n"; get;
      # an extra space because of a bug in compile()
      add "\njump start \n";
      # put a copy of the final compilation into the tapecell
      # so it can be inspected interactively.
      put; 
      # remove this print from asm.pp after generating a new asm.pp
      # with pep -f compile.pss compile.pss > asm.new.pp; cp asm.new.pp asm.pp
      print; # remove!
      # save the compiled script to 'sav.pp'
      write; clear; quit;
    }

    "beginblock*commandset*",
    "beginblock*command*" {
      clear; add "# Assembled with the script 'compile.pss' \n";
      get; add "\n"; ++; 
      add "start:\n"; get;
      # an extra space because of a bug in compile()
      add "\njump start \n";
      # put a copy of the final compilation into the tapecell
      # so it can be inspected interactively.
      put; 
      # remove this 'print' from asm.pp after generating a new asm.pp
      # with pep -f compile.pss compile.pss > asm.new.pp; cp asm.new.pp asm.pp
      print; # remove!
      # also save the compiled script to 'sav.pp'
      write; 
      clear; quit;
    }

    push; push;
    # state
    clear;
    add "After compiling with 'compile.pss' (at EOF): \n ";
    add "  parse error in input script, check syntax: \n ";
    add "  To debug script try the -I switch with \n ";
    add "   >> pep -If script -i 'some input' \n ";
    add "  or to debug the compilation process try: \n ";
    add "   >> pep -Ia asm.pp script' \n ";
    print; 
    clear;
    # clear sav.pp because script could not be compiled
    write;
    # bail means exit with error
    bail;

  } # not eof

  # there is an implicit .restart command here (jump start)