&& The Parsing Virtual Machine and script Langauge 

OVERVIEW 

 This booklet is about the pattern-parsing virtual machine and script
 language. The executable is currently called 'pp' (August 2019)
 (in the folder /books/pars )

 The virtual machine and language allows simple "LR" bottom-up shift
 reduce parsers to be implemented in a limited script language with
 a syntax which is very similar to the "sed" unix stream editor.

 As far as I am aware, this is a new approach to parsing context-free
 languages and according to the scripts and tests so far written has 
 great potential. The script language is deliberately limited in 
 a number of ways (it does not have regular expressions, for example), but
 it seems to be an interesting tool for learning about compiler techniques.

DOWNLOAD

  A ".tar.gz" archive file of the c source code can be downloaded from 
  the https://sourceforge.net/projects/bumble/ folder.
 
COMPILING SOURCE CODE 

  At some point I will package this using autotools and automake to allow
  simple installation. The main folder, source file and booklet are
  called pars/ gh.c and pars-book.txt

  * obtain, extract and compile the "pp" source code 
  -------
   # extract the tar ball
   tar xvzf gh.tar.gz
   cd pars/object
   # compile the pp tool using the object files in the "object" folder.
   gcc -o ../pp \
    gh.c tapecell.c tape.c buffer.c colours.c \
    charclass.c command.c parameter.c instruction.c \
    labeltable.c program.c machine.c exitcode.c \
    machine.interp.c
   
   cd ..
   # create a symlink if you like into the bin folder.
   ln -s pp /usr/local/bin/pp

  ,,,

  At the moment (Aug 2019) the pp executable looks for the file
  "asm.pp" in the same folder where it is run. This obviously needs
  to be changed so that pp looks for asm.pp in some standard folder.
  asm.pp is necessary to compile and run scripts.

  * rebuild the libmachine.a library file 
  -----
  cd gh/object
  # rebuild all object files
  gcc -c *.c
  # not all of these files are really necessary. 
  ar rcs libmachine.a buffer.o charclass.o colours.o command.o \
    exitcode.o instruction.o labeltable.o machine.o machine.interp.o \
    machine.methods.o parameter.o program.o tape.o tapecell.o
  ,,,,
  
  The libmachine.a file can be used when compiling executables from
  c source code generated by the script "compilable.c.pss"

DOCUMENTATION

 This file (pars-book.txt) will become the principle documention about the
 pattern-parser machine and language.

 There is also a lot of documentation in the "gh.c" file (which implements
 the virtual machine) and some in the "asm.pp" assembler file, which converts
 scripts into assembler listings which can then be loaded by the machine into
 its program.

COMMAND LINE USAGE
  
  I will create 2 separate tools for interactively running, debugging and
  viewing scripts, and another for interpreting scripts. but at the moment
  (august 2019) there is only one tool called "pp" (source code gh.c)

  * specify script and input on the command line 
  >> pp -e "read; print; print; clear;" -i "abcXYZ"  

  (This prints "aabbccXXYYZZ")

  * load a script from file and start an interactive session to view/debug 
  >> pp -If scriptfile somefile.txt

  * load an "assembly" file into the machines program and view/debug. 
  >> pp -Ia asmfile somefile.pss

  * convert a script to compilable c code
  >> pp -f compile.ccode.pss script 
  
ONE LINE EXAMPLES

  * double space a text file
  >> pp -e "r;'\n'{a'\n';}t;d;" someFile
  
  * the same as above with long command names. 
  >> pp -e "read; '\n' { add '\n'; } print; clear;" someFile

ONE LINE SCRIPTS 

  The pp tool may have useful applications in unix "one-liners" but
  its main power is in the implementation of simple context free 
  languages and compilers.

  * convert all spaces to dots
  >> pp -e "read; [:space:] {d;a'.';t;d;}t;d; " somefile.txt
  >> pp -e "read; [:space:] {clear; add '.';print;clear;}print;clear; " \
       somefile.txt

 * double every instance of vowels   
 >> read; [aeiou] { put; get; } print;clear; 

 * Only print text within double quotes:
 >> read; '"' { until '"'; print; } clear;

 * removes multiple consecutive instances of the character "a":
 >> read; "a" { while "a"; clear; } print;clear;

 * Number each line of the input:
 >> read; "\n" { ll; add " "; } print; clear;
 >> read; "\n" { ll; a " "; } t; d;  # the same, with short commands

 * Count the number of lines in the input stream
 >> read; "\n" { a+; } clear; (eof) { add "lines: "; count; print; }

 * Delete leading whitespace (spaces, tabs) from end of each line: 
 >> read; "\n" { while '- '; clear; newline; } print;

 * Delete trailing whitespace (spaces, tabs) from end of each line: 
 >> r; [:space:]{ w[:space:];G;d;r;!"\n"{clear;get;} print; clear;

 * insert 5 blank spaces at beginning of each line (make page offset):
 >> r; "\n" { add '     '; } print; clear;

 * print the first ten lines of the input:
 >> r; "\n" { a+; clear;count; "10"{ quit;} clear;add "\n"; } print; clear; 

 * Strip out c comments from a file:
 >> r; "/" { r; "/*"{ until '*/'; d; } "//" { until '\n'; d; } print;clear;

 
  * Delete trailing whitespace (spaces, tabs) from end of each line: 
  ----------
    # not tested, 
    read;
    [:space:] {
       while [:space:]; put; clear; read;
       !"\n" {
         clear; get; add "\n";
       } 
       print; clear;
    }
  ,,,

  * insert 5 blank spaces at beginning of each line (make page offset):
  >> "\n" { add "     "; } print; clear;

  * print the first ten lines of the input:
  >> read; "\n" { a+; d; count; "10" {quit;} d; add "\n"; } print; d;

  * Strip out c comments from a file:
  >> read; "/" { read; "/*"{ until '*/'; clear; }} print;clear;
  -

ASSEMBLY FILES

  The implementation of the language uses an intermediary "assembly"
  phase when loading scripts. asm.pp is responsible for converting 
  the script into an assembly (text) format. asm.pp is itself an
  assembly file. These files consist of "instructions" on the virtual
  machine, along with "parameters", jumps, tests and labels, which make
  writing assembly files much easier (instead of having to use line
  numbers). So the asm.pp file actually implements the script language

  The proof is in the pudding: this shows that the parse-machine is
  capable of implementing "languages" (or at least simple ones).

  It should not normally be necessary to write any assembler code,
  since the script language is much more readable. However, it
  is useful for debugging scripts to view the assembly listing as
  it is loaded into the machine (the -I switch allows this).

PARSING WITH THE SCRIPT LANGUAGE 

  A simple example:

  * parse a sentence in a very simple natural language
  --------------
    "article*noun*" {
      clear; add "nounphrase*"; push;
    }

    "nounphrase*verb*object*" {
      clear; add "sentence*"; push;
    }
  ,,,
  
  The example above corresponds to a backus-naur grammar rules: 
     nounphrase <- article noun
     sentence <- noun verb object
  
  If we only want to write a "recogniser" for a particular 
  context-free language (that is: a script that determines if a 
  particular input is a legal element of that language) then
  the previous script style is sufficient. But usual, we will be 
  trying to transform an input into another format. For example,
  translate one computer language into another,

TRANSFORMATIONS AND COMPILATIONS

  The parsing virtual machine and language is designed to 
  transform one textual (data) format into another, or compile
  one context-free language into another.  

  Examples might be: 

    * transform a csv (comma-separated-values) file into a 
      json data format.
    * convert from "infix" arithmetic notation to "postfix" notation.
    * compile a simple computer language into an assembly language.
    * properly indent a computer language source code file.
 
DEBUGGING

  The parse virtual machine is more complicated than a sed virtual 
  machine (sed only really has 2 string registers, the "work-space"
  and the "hold-space"). So you may find yourself needing to debug a 
  script. 

  * load a script and view/execute/step through it interactively 
  >> pp -If someScript input.txt

  * interactively view how some script is being compiled by "asm.pp" 
  >> pp -Ia asm.pp someScript 

  (now you can step through the compiled program "asm.pp" and watch as 
  it parses and compiles "someScript". Generally, use "rr" to run the
  whole script, and "rrw text" to run the script until the workspace
  is some particular text. This helps to narrow down where the asm.pp
  compiler is not parsing the input script correctly.

  Once in an interactive session, there are many commands to 
  run and debug a script. For example:

    n - execute the next instruction in the program
    m - view the state of the machine (stack/workspace/registers/tape/program)
    rrw - run the script until the workspace is exactly some text.
    rre - run script until the workspace ends with something
    rr - run the whole script from the current instruction
    M.r - reset the virtual machine and input stream 
          (but not the compiled program)

COMMON BUGS ....

  * make sure that you are pushing as many times as there are tokens.
  >> add "noun*verb*noun*"; push; push; # << error, 3 tokens, 2 pushes

  * make sure there is at least one read command in the script 
  >> "."{ clear; } print; clear; # << error: no read in script

SCRIPT EXAMPLES 

 * print alphanumeric words in file one per line
 ----------------------
   read; [:alpha:] { while [:alpha:]; add "\n"; print; clear; a+; }
   clear;
 ,,,

 * print the number of alphabetical words in a file
 ---------------------
   read; [:alpha:] { while [:alpha:]; clear; a+; } clear;
   (eof) { add "words in file: "; count; add "\n"; print; clear; }
 ,,,


 * convert c comments (// and /* ... */) to bash comments
 -----------
  read;
  "/" { 
     read; 
     "//" { clear; add "# "; until "\n"; print; clear; } 
     "/*" { 
       clear; add "# "; until "*/"; clip; clip; 
       replace "\n" "\n#";
       print; clear;
     } 
  }
  print;
  clear;
 ,,,

MACHINE DESCRIPTION

  The parsing virtual machine consists of a number of parts which I
  will describe here. 
  
WORKSPACE BUFFER ....

  The workspace buffer is the heart of the virtual machine and is 
  analogous to an "accumulator" register in a cpu chip. All incoming
  and outgoing data is processed through the workspace. All machine
  instructions either affect or are affected by the state of the 
  workspace.

STACK ....

  * machine instructions relating to the stack
  >> push, pop

  The stack (which is implemented as a single string buffer and is 
  actually the prefix of the workspace buffer) is used to store
  and access the parse tokens that are constructed by the virtual 
  machine while parsing input.

** parse tokens

 Within the virtual machine of the parse script language the "stack"
 structure is designed to hold and contain the "parse tokens" 
 
 parse tokens can be any string ended by the delimiter eg:
 >> add "set*";

DELIMITER REGISTER ....

  The delimiter register determines what character will be used for 
  delimiting parse tokens on the stack when using the "push" and "pop"
  commands. This can be set with the "delim" command.

    * set the parse token delimiter to '/'
    >> begin { delim '/'; }
     
  The delimiter should be set in a 'begin' block. 
  I normally use the "*" character, but "/" or ";" might be good 
  options.

FLAG REGISTER ....

  * machine instructions relating to the flag register
  >> testis, testbegins, testends, testtape, testeof 
  >> jumptrue, jumpfalse

  The flag register affects the operation of the conditional jump instructions
  "jumpfalse" and "jumptrue" and it is affected by the test instructions such
  as "testis", "testtape", "testeof" etc.  It is analogous to a "flags"
  register in a cpu, but (currently) it only contains one boolean (true/false)
  value.

ACCUMULATOR REGISTER ....

  The machine contains 1 integer accumulator register that can
  be incremented (with "a+"), decremented (with "a-") and set to
  zero (with "zero"). This register is useful for counting occurances
  of miscelaneous elements that occur during parsing and translating.

  An example of the use of the accumulator register is given during the
  parsing of "quotesets" in the the "compile.pss" script. The accumulator
  in this case, keeps track of the target for true-jumps.

  * count how many "x" characters occur in the input stream
  >> pp -e "r; 'x' {a+;} print;d; (eof) { add '='; count; print; }" -i "axax"
  
PEEP REGISTER  ....

  The peep buffer is a single character buffer which stores
  the next character in the input stream.  When a 'read' command
  is performed the current value of the peep buffer is appended to
  the 'workspace' buffer and the next character from the input
  stream is placed into the peep buffer.

TAPE ....

  * machine instructions with an effect on the tape:
  >> get, put, ++, --, pop, push 

  The tape is an array of string cells (with memory allocated dynamically),
  each of which can be read or written, using the workspace buffer.
  The tape cell array also includes a tape cell pointer, which is usually
  called the "current cell". 

CURRENT CELL OF TAPE ARRAY ....

  * instructions affected the current cell of the tape
  >> ++, --, push, pop

  The current cell of the tape is a very important mechanism for
  storing attributes of parse tokens, and then manipulating those
  attributes. The parse virtual machine has the ability to 
  "compile" one text format into another. (I call this compilation because
  because the target text format may be an assembly language - for either 
  a real or virtual machine)

  In the parsing phase of a script, the attributes of different
  parse tokens are accessed from the tape and combined and manipulated
  in the workspace buffer. This means that a script which is 
  transforming some input stream will finish with the entire transformed
  input in the 1st cell of the tape structure. The script can 
  then print out the cell to stdout (with "get; print;", which allows
  further processing by other tools in the pipe chain) or else
  write the contents of the cell to the file 'sav.pp' (with "get; write;").

  The current cell is also affected by the stack machine commands 
  "pop" and "push". The push command increments the tape pointer 
  (current cell) by 1 and the pop command decrements the tape pointer 
  by one. 

  This is a subtle and powerful mechanism that allows the tape pointer
  to stay in sync with the stack of the stack. After a "push" or "pop" command
  the tape pointer will be pointing at the correct tape cell for 
  that item on the stack. In some cases, it may be easiest to see how
  these mechanisms work by running the machine engine in interactive mode
  (pp -If script input) and stepping through a script or executing
  commands at the prompt.

  The tape pointer is also incremented and decremented by the 
  ++ and -- commands, and these commands are mainly used during the
  compilation phase to access and combine attributes to transform
  the input into the desired output.

SYNTAX OF THE SCRIPT LANGUAGE

  The script language, which is implemented in the file books/pars/asm.pp looks
  somewhat similar to sed and awk. Unlike sed, it also allows long names for
  commands (eg "clear" instead of "d", "add" instead of "a"). Each command
  has a long and a short form.

LANGUAGE FEATURES ....

   The script language is implemented in the file asm.pp. Some commands, such
   as "reparse" and "restart" (to be implemented) are not instructions on the
   virtual machine, they are commands that are parsed in asm.pp. 

   * tests on the workspace buffer, followed by a block of commands
   >> [a-z] { print; clear; }
   
   Most scripts will start with "read;" or "r;" (which is the 
   abbreviated equivalent. This reads one character from the input
   stream. Whereas sed and awk are line oriented, pp is character
   orientated.

   Scripts also have an implicit loop. When the interpreter reaches
   the end of the script, it jumps back to the first command (usually
   "read") and continues looping until the input stream is finished.
   This is similar to the behavior of sed.

CHARACTER CLASSES ....

  character classes are written [:space:] [:alnum:] etc
  Currently (August 2019) the pp machine and language uses plain
  ctype.h character classes as a way of grouping characters.

  These classes are important in a unicode setting because they
  allow specifying types of characters in a locale-neutral way.
  The "while" command uses character classes as do tests.

  I may allow the script language to define new character classes.

  * check if the workspace is only alphanumeric characters 
  >> [:alnum:] { nop; }

  * read the input stream while the peep register is whitespace
  >> while [:space:];

** Not regular expressions

QUOTES ....

  both single and double quotes may be used in scripts.
  >> '"' { add "<< single quote!\n"; print; }

  If using the negation operator ! in an "in-line" script, then
  enclose the whole thing in single quotes.

  * use single quotes in one-liners to avoid special char problems
  >> pp -f compile.pss -i '![a-z],"a","b"{nop;}'

  The until command already understands escaped characters, so it
  will read past an escaped delimiter (eg \").

COMMENTS ....
   
  Both single line comments (line starts with "#") or multiline
  comments (hash+asterix .... asterix+hash) are available. 

  Multiline comments are useful for disabling blocks of code
  during script developement.

BEGIN BLOCKS ....

  Like awk, the script language can have a 'begin' block. These 
  blocks are only executed once, where as the rest of the script
  is executed in a loop for every character in the input stream.

  * an example begin block and script
  -----
    begin {
      add "starting script ...\n"; print; clear;
      # set the token delimiter to / 
      delim "/";  
    }
    read; print; clear;
  ,,,

CONDITIONS AND TESTS ....
 

CLASS TESTS ....

  The class text checks whether the @workspace buffer
  matches any one of the characters or character classes
  listing between the square braces.  A class test is written
    >> [characters] { ...  }

  For example, the following script will delete all vowels
  from the input text
    >> [aeiou] { clear; }  ![aeiou] { print; clear; }

  The above script also demonstrates that the class tests
  can be -negated- with a prefixed "!" character.

  If the workspace is exactly any one of the characters contained
  within the square braces then the test returns true and the 
  instructions contained within the curly braces are executed and
  if not, then not. 

EOF END OF STREAM TEST ....
  
  * formats possible
  >> <eof> <EOF> (eof) (EOF)

 This test returns true if the 'peep' look-ahead register
 currently contains the <EOF> end of stream marker for
 the input stream. The test is written
  >> (eof) { ... }

 This test can be combined with other tests either with AND
 logic (.) or with OR logic
   
  * test if workspace is "horse" when the end-of-stream is reached
  >> (eof)."horse" { add ' and cart'; print; }

  * test if workspace is "horse" OR end-of-file has been reached 
  >> (eof),"horse" { ... }
  
 The <eof> test is important for checking if the script has
 successfully parsed the input stream when the end of stream is
 reached.

TAPE TEST ....

  >> (==) { ...}

BEGINS WITH TEST ....

  workspace begins with: B"abc" { ... }
  workspace ends with: E"abc" { ... }
  all chars in workspace in class: [:space:] { ... }

** Concatenating tests

  Conditional tests can be chained together with OR (,) or AND (.)

  * test if workspace starts with "ba" OR "ab"
  >> B"ab",B"ba" { ... }

  We can also do AND logic concatenation with

  * test if the workspace starts with 'a' AND ends with 'z'
  >> B"a".E"z" { }

  * check if workspace is a url but not a text file 
  >> B"http://" . !E".txt" { add "<< url"; print; clear; }

** Negated tests

  !B"abc", !E"xyz", ![a-z], !"abc" ...
MATERIAL FROM THE "PARSE-BOOK.TXT"

 A Book about the parse system. This document was written for 
 an earlier version of the parse virtual machine (as of august 2019).

COMMANDS

  * all commands have a single letter variants for the commands. 
    eg: p, pop, P, push, etc 

COMMAND SUMMARY ....
 
   All commands have an abbreviated (one letter) form as well.

   * ++ 
      increments the tape pointer by one (see 'increment' )
   * --:
      decrements the tape pointer by one. (see "decrement;")
   * mark "text"
      adds a text tag to the current tape cell
   * go "text"
      sets the current tape cell to the marked cell 
   * add "text" / add 'text' 
      adds text to the end of the workspace 
   * check
      jumps back to the @@@ label. This is used to ensure that
      all shift reductions take place.
   * clip
      removes the last character from the workspace
   * count 
      appends the counter to the end of the workspace
   * crash 
      exits the script without reading anything more from
      standard input. equivalant to the sed command 'q'
   * clear
      sets the workspace to a zero length string. Equivalent
      to the sed command 
      >> s/^.*$//;
   * put
      puts the contents of the workspace into the current
      item of the tape (as indicated by the @tape-pointer )
   * get
      gets the current item of the tape and adds it to
      the -end- of the workspace with -no- separator character
   * swap
      swaps the contents of the current tape cell and the 
      workspace buffer.
   * a+ 
      increments the accumulatorr variable/register in the virtual 
      machine by 1.
   * a- 
      this command decrements a counter variable by one
   * ll
      appends the line number register to the workspace buffer.
   * cc 
      appends the character number register to the workspace buffer
   * print
      prints the contents of the workspace buffer to the standard 
      output stream (stdout).  Equivalent to the sed command 'p'
   * pop
      pops one token from the stack and adds it to the 
      -beginning- of the stack
   * push :
      pushes one token from the workspace onto the stack, or
      reads upto the first star "*" character in the @workspace buffer
      and pushes that section of the buffer onto the stack.
   * unstack
      pops the entire stack as a prefix onto the workspace buffer
   * stack
      pushes the entire workspace (regardless of any token delimiters
      in the workspace) onto the stack.
   * replace
      replaces a string in the workspace with another string. 
   * read
      reads one more character from the stdin.
   * state
      prints the current state of the virtual machine
      to the standard output stream. maybe useful for debugging
      I may remove this command.
   * until 'text'
      reads characters from stdin until the workspace
      ends in the given text and does not end in the 
      second given text. This maybe used for capturing
      quoted strings etc. 
   * while class/text;
      reads characters from the input stream while the 
      @peep character -is- the character 'c' (or a character class)
   * whilenot class/text;
      reads characters from the input stream while the 
      'peep' register is not the character 'c'
   * zero;
      sets the counter to zero 

ADD ....
  
  The "add" command  adds a piece of text to the -end- of the 'workspace'
  buffer No other register or buffer within the virtual machine is
  affected.  The command is written

   >> add 'text'; or
   >> add "text";

  The argument may span more than one line. So the following is also
  a valid script statement.

   add 'one line
        another line';

  * Add a space after every character of the input 
  >> add ' '; print; clear;
  >> a ' ';p;d;
  
   /style*type*/
   {
     clear;
     add "command*";
     push;
   }

  The script above does a shift-reduce operation while parsing some
  hypothetical language. The "add" command is used to add a new token name
  to the 'workspace' buffer which is then pushed onto the stack (using the
  'push' operation, naturally). In the above script the text added can be
  seen to be a token name (as opposed to some other arbitrary piece of
  text) because it ends in the "*" character. Actually any character could
  be used to delimit the token names, put "*" is the default
  implementation. 

  The add command takes -one- and only one parameter, or argument.U.
  
REPARSE COMMAND  ....

  This command makes the interpreter jump to the parse> label.  The label must
  occur before the check command, not after.  The .reparse command takes no
  arguments and is written
   
    parse> 

** Why use this command?

  The reason for using this command lies in the need to apply multiple "shift
  reduce"  operations during one cycle of a script. fk

CLIP COMMAND ....
 
 This command removes one character from the end of the 'workspace' buffer and
 sends it into the void. It deletes it. The character is removed from the
 -end- of the workspace, and so, it represents the last character which would
 have been added to the workspace from a previous 'read' operation.

 The command is useful for example, when parsing quoted text, used in
 conjunction with the 'until' command. As in The following script only prints
 text which is contained within double quote characters, from the input.

 * parse and print quoted text
 >> '"' { clear; until '"'; clip; print; }
 >> "\"" { clear; until "\""; clip; print; }

 If the above script receives the text 'this "big" and "small" things'
 as its input, then the output will be 'big small' .That is, only
 that which is within double quotes will be printed.

The script above can be translated into plain english as follows

 a- If the workspace is a " character then
  - clear the workspace
  - read the input stream until the workspace -ends- in "
  - remove the last character from the workspace (the ")
  - print the workspace to the console
  - end if
  -

 The script should print the contents of the quoted text
 without the quote characters because the 'clear' and
 the clip commands got rid of them. 

CLOP COMMAND ....

  The "clop" command removes one character from the front
  of the workspace buffer. The clop command is the counterpart of 
  the 'clip' command.

  >> clop;

COUNT COMMAND ....

 The count command adds the value of the counter variable
 to the end of the workspace buffer . For example, if the counter variable
 is 12 and the workspace contains the text 'line:', then after the 
 count command the workspace will contain the text
  >> line:12

 The count command only affects the 'workspace' buffer in the virtual
 machine.

CRASH COMMAND ....

  This command immediately exits out of the script without
  processing any more script commands or input stream characters.
  
  This command is similar to the "sed" command 'q'

DECREMENT COMMAND ....

 The decrement command reduces the tape pointer by
 one and thereby moves the current 'tape' element one 
 to the 'left' 
 
 The decrement command is written "--" or "<"

 If the current tape element is the first element of 
 the tape then the tape pointer is not changed and the 
 command has no effect. 

 ** See also

  'increment' , 'put' , 'get'

INCREMENT COMMAND ....

 This command is simple but important. The command 
 adds -one- to the 'tape'-pointer. The command is written
  >> ++ or >

 Notice that the only part of the machine state which has
 changed is that the 'tape-pointer' (the arrow in the 'tape' 
 structure) has been incremented by one cell.
  
 This command allows the tape to be accessed as a 'tape' . This is
 tortological but true. Being able to increment and 'decrement' the tape
 pointer allows the script writer and the virtual machine to access any value
 on the tape using the 'get' and 'put' commands.
 
 It should be remembered that the 'pop' command also automatically
 increments the tape pointer, in order to keep the tape pointer
 and the stack in synchronization.

 Since the get command is the only way to retrieve values from
 the tape the ++; and --; commands are necessary. the tape
 cannot be accessed using some kind of array index because
 the get and 'put' commands to not have any arguments.

 An example, the following script will print the string associated
 with the "value" token.

 >> pop;pop; "field*value*" { ++; get; --; print; }

MARK COMMAND ....
 
 The mark command adds a text "tag" to the current tapecell.
 This allows the tapecell to be accessed later in the script with
 the "go" command. The mark and go commands should allow "offside"
 or indent parsing (such as for the python language)

 * use mark and go to use the 1st tape cell as a buffer.
 ------
   begin { mark "topcell"; ++; }
   read; [:space:] { d; }
   whilenot [:space:]; put; 
   # create a list of urls in the 1st tapecell
   B"http:" {
     mark "here"; go "topcell"; add " "; get; put; go "here";
   }
   clear; add "word*"; push; 
   <eof> { go "topcell"; get; print; quit; }
 ,,,,

GO COMMAND ....

  The "go" command set the current tape cell pointer to a cell
  which has previously been marked with a "mark" command (using
  a text tag. The go/mark commands may be useful for offside
  parsing as well as for assembling, for example, a table of contents
  for a document, while parsing the document structure.

  * basic usage
  >>  read; mark "a"; ++; go "a"; put; clear;

MINUS COUNTER COMMAND ....
 
 The minus command decreases the counter variable by one.
 This command takes no arguments and is written
  >> a-;
  The 'testeof' (eof) checks to see if the peep buffer contains the 
  end of input stream marker. 

  The 'while' command reads from the input stream while the 
  peep buffer is or is not some set of characters
  For example
   >> while 'abc';
  reads the input stream while the peep buffer is any one of the 
  characters 'abc'. 

PLUS COUNTER COMMAND ....

 This command increments the machine counter variable by
 one. It is written
  >> a+ 

 The plus command takes no arguments. Its counter part is the
 'minus' command.

POP COMMAND ....

  The pop command pops the last item from the virtual machine stack
  and places its contents, without modification, at the -beginning-
  of the workspace buffer. 
  
   * pop the stack into the workspace.
   >> pop;

   * check if a grammar rule is applicable
   >> pop;pop; "word*colon*" { clear; add 'command*'; push; }

  The pop command affects the 'workspace' buffer and the 'stack'
  structure and decrements the 'tape-pointer' variable.

  The pop command is usually employed in the parsing phase 
  of the script (not the lexing phase); that is, after the "parse>"
  label.
   
  If the stream parser language is being used to parse and translate a
  language then the script writer needs to ensure that each token ends with a
  delimiter character (by default "*") in order for the push and pop commands
  to work correctly.

  The pop command also affects the "tape" of the virtual machine in
  that the tape-pointer is automatically decremented by one once for
  each pop command. This is convenient because it means that the 
  tape pointer will be pointing to the corresponding element for the 
  token. In other words in the context of parsing and compiling a
  "formal language" the tape will be pointing to the 
  "value" or "attribute" for the the token which is currently in the 
  workspace.

PRINT COMMAND ....

  The print command prints the contents of the workspace to
  the standard output stream (the console, normally). This is
  analogous to the the sed 'p' command (but its abbreviated form
  is 't' not 'p' because 'p' means "pop".

** Examples

  * Print each character in the input stream twice:
  >> print; print; clear;

  * Replace all 'a' chars with 'A's;
  >> "a" { clear; add "A"; } print; clear;

  * Only print text within double quotes:
  >> '"' { until '"'; print; } clear;
    
** Details  
 
  The print command does not take any arguments or parameters.
  The print command is basically the way in which the parse-language
  communicates with the outside world and the way in which it
  generates an output stream. The print command does not change
  the state of the parse virtual machine in any way.

  Unlike sed, the parse-language does not do any "default" printing.
  That is, if the print command is not explicitly specified the 
  script will not print anything and will silently exit as if it
  had no purpose. This should be compared with the default 
  behavior of sed which will print each line of the input stream if 
  the script writer does not specify otherwise (using the -n switch).

  Actually the print command is not to be so extensively used as 
  in sed. This is because if an input stream is successfully parsed
  by a given script then only -one- print statement will be necessary,
  at the end of the input stream. However the print command could be
  used to output progress or error messages or other things. 
 

GET COMMAND ....

  This command obtains the value in the current 'tape' cell
  and adds it (appends it) to the end of the 'workspace' buffer.
  The "get" command only affects the 'workspace' buffer of the virtual
  machine. 

  * if the workspace has the "noun" token, get its value and print it.
  >> "noun*" { clear; get; print; clear; }

PUT COMMAND ....

 The put command places the entire contents of the workspace
 into the current tape cell, overwriting any previous value
 which that cell might have had. The command is written
  >> put;

  * put the text "one" into the current tape cell and the next one.
  >>  clear; add "one"; put; ++; put;
 
 The put command only affects the current tape cell of the virtual
 machine. 

 After a put command the workspace buffer is -unchanged- . This contrasts
 with the machine stack 'push' command which pushes a certain amount of text
 (one token) from the workspace onto the stack and deletes the same amount of
 text from the workspace buffer.

 The put command is the inverse of the 'get' command which retrieves or gets
 the contents of the current item of the tape data structure. Since the tape
 is generally designed for storing the values or the attributes of parse
 tokens, the put command is essentiallly designed to store values of
 attributes.

 The put command can be used in conjunction with the 'increment' ++ and
 'decrement' -- commands to store values in the tape which are not the current
 tape item. 

READ COMMAND ....
  
  The read command reads one character from the input stream and places that
  character in the 'peep' buffer.  The character which -was- in the peep
  buffer is added to the -end- of the 'workspace' buffer. 

  The read command is the fundamental mechanism by which the input stream is
  "tokenized" also known as "lexical analysis". The commands which also
  perform tokenization are "until", "while" and "whilenot".  These commands
  perform implicit read operations.

  An implicit read operation may also be performed at the beginning of each
  "cycle" of the script interpreter. This depends apon the implementation and
  is an issue which I havent resolved yet.  If an implicit read is not
  performed then each script will need to have an explicit read command at the
  beginning of the script.  

REPLACE COMMAND ....

 This command replaces one piece of text with another in the workspace
 The replace command is useful indenting blocks of text during formatting
 operations. The replace command only replaces plain text, not 
 regular expressions

  * replace the letter 'a' with 'A' in the workspace buffer
  >> replace "a" "A";

  * indent a block of text
  >> clear; get; replace "\n" "\n  "; put; clear;

UNTIL COMMAND ....

  * example
  >> until 'text';

  Reads the input stream until the workspace ends with the given
  text.

  * print any text that occurs between '<' and '>' characters
  >> /</ { until ">"; print; } clear;

  * print only text between quote characters (excluding the quotes)
  >> r; /"/ { until '"'; clip; clop; print; } clear;

  * create a parse token 'quoted' from quoted text
  >> r; "/" { until '"'; clip; clop; put; add 'quoted*'; push; } clear;

  * print quoted text, reading past escaped quotes (\")
  >> /"/ { until '"'; print; } clear;

  The 'while' and 'whilenot' commands are similar to the until
  command but they depend on the value of the 'peep' virtual
  machine buffer (which is a single-character buffer) rather than
  on the contents of the 'workspace' buffer like the until command.

** notes

  The 'until' command usually will form part of the 'lexing' phase
  of a script. That is, the until command permits the script to 
  turn text patterns into 'tokens'. While in traditional parsing tools
  (such as Lex and Yacc) the lexing and parsing phases are carried out
  by seperate tools, with the 'pp' tool the two functions are combined.
  
  The until command essentially performs
  multiple read operations or commands and after each read checks to see
  whether the workspace meets the criteria specified in the argument.

** proposed implementation changes

  A variant of the until command may be implemented, without any 
  argument. That is 
    >> untiltape;

  will read the input stream until the workspace ends with the same text
  as is contained in the current tape element. This form of the 'until' 
  command would be useful for parsing, for example, a sed substitution
  command such as 's@this@that@g' or 's/this/that/g' where the delimiter
  character is whatever character follows the 's' character.

WHILENOT COMMAND ....
  
  This is a command that was dreamed up in a rash moment. It
  is yet to be seen if it really is necessary. It reads into
  the workspace characters from the input stream -while- the 
  'peep' buffer is -not- a certain character. This is 
  a "tokenizing" command and allows the input stream
  to be parsed up to a certain character without reading that 
  character.

 * an example
 ----
  /"/ { 
    clear; whilenot '"'; print;
  }
 ,,,,

 The above script tokenizes a quoted text section and prints
 the contents to the standard output.


WHILE COMMAND ....

  The 'while' command in the pattern-parse language reads the 
  input stream while the 'peep' buffer is any one of the 
  characters or character sets mentioned in the argument.
  The command is written

   >> while [cdef];

  The command takes one argument. This argument may also
  include character classes as well as literal characters.
  From example,

   >> while [:space:];

  reads the input stream while the peep buffer is a digit.
  The read characters are appended to the 'workspace' buffer.

NEGATION

  Negation for the while command is currently supported 
  using the "whilenot" command.
 ** Escaping characters

ENDSWITH TEST .... 

  The 'ends with' test checks whether the workspace ends
  with a given string. This test is written
   
   >> E"xyz" { ... }


 The script language contains a structure to perform a 
 test based on the content of the workspace and to
 execute commands depending on the result of that test.
 An example of the syntax is
 -->
   /ocean/
   {
     add " blue";
     print;
   }
 --<

 In the script above, if the @workspace buffer 
 is the text "ocean" then the commands within
 the braces are executed and if not, then not. The test
 structure is a simple string equivalence test, there 
 are -no- regular expressions and the workspace buffer 
 must be -exactly- the text which is written between
 the // characters or else the test will fail, or 
 return false, and the commands within the braces will
 not be executed.

 This command is clear influenced by the "sed" http://sed.sf.net
 stream editor command which has a virtually identical syntax
 except for some key elements. In sed regular expressions are 
 supported and in sed the first opening brace must be on the 
 same line as the test structure.

 There is also another test structure in the script language 
 which checks to see if the workspace buffer -begins- with the 
 given text and the syntax looks like this
  :> <ocean> { add ' blue'; print; }

 This syntax is visually not very pleasant and its design
 goes against the simplicity which I was attempting to
 obtain in writing the parse script language.  

 = the List test in the omski language =

 The list test may be written as
 :> :nouns.txt: { ... } 

 The list test determines if the workspace matches any one
 of a list of strings which are contained within a text
 file. In the example given above the name of the text file
 is "nouns.txt". If the "workspace;" was "dog" at the time
 that the test was executed and the string "dog" was 
 contained in the text file "nouns.txt", then the test would
 return -true- and the instructions within the braces would
 be executed. However if the word "dog" was not contained
 in the text file then the test would return false, and 
 the instructions within the braces would -not- be 
 executed.

 This test may be useful in a linguistic context, in order
 to determine if a particular string is a noun, verb, 
 or other part of speech.  

 
  && Using tests in the pp tool

 
TEST EXAMPLES ....

  * if the workspace is not empty, add a dot to the end of the workspace
  >> !"" { add '.'; }

  * if the end of the input stream is reached print the message "end of file"
  >> <eof> { add "end of file"; print; }

  * if the workspace begins with 't' trim a character from the end
  >> B"t" { clip; }

TAPE TEST ....

 The tape test determines if the current element of 
 the @tape structure is the same as the @workspace buffer.

 The tape test is written
  :> ==

 This test was included originally in order to parse the sed structure
   >> s@old@new@g  or s/old/new/g or s%old%new%g
 In other words, in sed, any character can be used to delimite the 
 text in a substitue command.

ACCUMULATOR 

STACK STRUCTURE IN THE VIRTUAL MACHINE   

  The 'virtual'-machine of the chomski language contains
  a stack structure which is primarily intended to hold
  the parse tokens during the parsing and transformation
  of a text pattern or language. However, the stack 
  could hold any other string data.  Each element of the
  stack structure is a string buffer of unlimited size. 

  The stack is manipulated using the pop and push commands.
  When a value is popped off the stack, that value is appended
  to the -front- of the workspace buffer. If the stack is 
  empty, then the pop command has no effect.


 && The tape in the chomski machine 

 The tape structure in the @virtual-machine is an infinite
 array of elements. Each of these elements is a string buffer
 of infinite size. The elements of the tape structure may be 
 accessed using the @increment , @decrement , @get and @put
 commands. 

 ** The tape and the stack

 The tape structure in the virtual machine and the @stack
 structure and designed to be used in tandem, and several
 mechanisms have been provided to enable this. For example,
 when a @pop operation is performed, the @tape-pointer is
 automatically decremented, and when a @push operation is
 performed then the tape pointer is automatically incremented.

 Since the chomski language and machine have been designed
 to carry out parsing and transformation operations on
 text streams, the tape and stack are intended to hold
 the values and tokens of the parsing process.
 

  ** the virtual machine 

  The "virtual machine" [->wikipedia] for the bumble language looks like this

   -machine-
   -stack:...,...,...,...
   -tape:...,...,...,...,...,
   -workspace:
   -peep:
   -counter:

  The machine has four main elements
   D- a @stack:
      which can contain "parse tokens" [->wikipedia] if the language
      is used for parsing or any other text data.
    - a @tape:
      which is an array of text data which is synchronized with the 
      machine stack using a tape pointer. The tape is manipulated 
      with the @get and @put commands
    - a @tape-pointer :
      This variable determines the current tape element which will
      be used by @get and @put commands. The tape pointer is incremented
      with the ++ command and decremented with the -- command.
    - a @workspace buffer:
      This is where all "text change" operations within the machine are 
      carried out. It is similar in concept to a register within
      a "cpu" [->wikipedia] or to the "Sed" [->wikipedia] stream editor
      pattern space. The workspace is affected by various commands, 
      such as @clear, @add, @indent, @get, @push, @pop etc
    - a @peep character:
      this character is not directly manipulable, but it constitutes
      a very simple "look ahead" mechanism and is used by the 
      @while and @whilenot commands
    - a @counter :
      This counter or "accumulator" is an integer variable which can
      be incremented with the command @plus ,decremented with the 
      command @minus ,and set to zero with the command @zero .

 ** Language commands and the machine

  Most of the commands in the script language affect the virtual machine.
  For example, the -pop- command takes the last element off the stack
  and appends it to the -front- of the workspace. We can see this
  by looking at the virtual machine before and after a pop command.

   -machine-
   -stack: noun* , verb* , noun*
   -tape: cat , hunts ,  bird , --> ,...,
   -workspace: noun*
   -peep: t

  After the @pop command the machine will look like this 

   -machine-
   -stack: noun* , verb*
   -tape: cat , hunts , --> bird ,...,...,
   -workspace: noun*noun*
   -peep: t

 The stack is one element shorter and the workspace has changed


  && The workspace buffer in the virtual machine 

  The workspace buffer is a buffer within the virtual machine of 
  the stream parsing language. In this buffer all of the text
  transformation processed take place. For example, the commands
  @clear, @add, @indent, @newline all affect the text in the 
  workspace buffer.

  The machine looks like this

   -machine-
   -workspace:
   -stack:...,...,...,...
   -tape:...,...,...,...,...,
   -peep:
   -counter:

  The workspace buffer is analogeous to a processor register 
  in a non-virtual cpu. In order to manipulate a value, it is
  generally necessary to first load that value into a 
  cpu register. In the same way, in order to manipulate some 
  text in the chomski machine, it is necessary to first
  load that text into the workspace buffer. This can be 
  achieve with the @get, @pop, @read commands.
 

  && About backus-naur form and the pattern parser
  ------------------------------------------------

  Backus-naur form is a way of expressing grammar rules for formal languages. A
  variation of BNF is "EBNF" which modifies slightly the syntax
  of bnf. Sometimes on this site I use (yet another) syntax for bnf or ebnf
  rules but the idea is the same. 

  There is close relationship between the syntax of the 'pp' language
  and a bnf grammar. For example

  >>  /word*colon*/ { clear; add 'command*'; push }

  corresponds to the grammar rule

  >> command := word colon


REFLECTION SELF HOSTING AND SELF PARSING

  The script "compile.pss" is a parse-script which implements the 
  parse-script language. This was achieved by first writing a hand-coded
  "assembler" program for the machine (contained in the file "asm.pp").
  Once a working asm.pp program implemented a basic syntax for the language,
  the compile.pss script was written. This allows us to maintain and
  add new syntax to the language using the language itself.

  A new "asm.pp" file is generated by running 
     >> pp -f compile.pss compile.pss >> asm.new.pp; cp asm.new.pp asm.pp

  The command above runs the script compile.pss and also uses compile.pss as
  its text input stream. In this sense the system is "self-hosting" and
  "self-parsing".  It is also a good idea to preserve the old copy of asm.pp
  in case there are errors in the new compiler. 
  ** Test Structures

  All tests may be negated by placing the '!' character
  -before- the test. For example

    :> !// { add 'x'; }
    If the workspace is not empty then add the character
    'x' to the workspace
 
  Multiple tests may be used before a brace block. 
   
    /text/ /word/ { print; }
  --<

 If the workspace is either 'text' or 'word' then print
 the workspace.

   D- "equals" doc:test-is.txt.html test:
      :> /sometext/ {...} 
      If the @workspace is exactly the same as the text contained within
      the forward-slashes then the 
      the instructions within the braces will be executed. If
      not, then the next instruction executed will be the first 
      instruction after the matching closing brace.
    - "begins with" doc:test-begin.txt.html test:
      :> <sometext> {...} 
      If the text in the workspace begins with the text contained
      in the angle brackets, then the code in the braces will
      be executed.
    - "ends with" doc:test-begin.txt.html test:
      :> (sometext) {...} 
      If the text in the workspace begins with the text contained
      in the angle brackets, then the code in the braces will
      be executed.
    - "list" doc:test-list.txt.html test:
      :> =file.name= {...} 
       This test checks the workspace against the lines a 
       the text file 'file.name'. If the @workspace 
       matches any one of the lines then the test will return
       true and the instructions within the braces will be 
       executed. 
    - "tape = workspace" doc:test-class.txt.html test:
      :> == {...}
       tests if the current element of the tape is equal to
       the current value of the workspace
    - "class" doc:test-class.txt.html test:
      :> [aeiuo] {...}
      If the text in the workspace is exactly any one of the 
      characters contained within the square braces then the 
      test will return true. 
    - "end of input stream" doc:test-eof.txt.html test:
      :> <> {...}
      This test checks to see if the end of the input stream has been
      reached.  If the input stream has no more characters to read then
      the test returns true and the code within the braces
      will be executed.

  All of the above tests can be "negated"by placing the "!"
  character -before- the test. For example, if the script contains
  the code
  :> !/sometext/ {...}
  then the instructions within the braces will be executed only
  if the @workspace buffer is -not- the text "sometext".

 ** Comments in scripts

   Comments, which will be ignored by the interpreter may
   be placed in a script file by -enclosing- the comment
   in "#" characters. Notice that this is different from the 
   normal unix syntax  of beginning each line with a hash 
   character. For example
   -->
     # This comment will be ignored by the interpreter
       Comments may span lines and must end with 
       a hash character #
     /a/ { pop; }
   --<


 && Compiling the chomski interpreter 


  && a virtual machine cycle 

 Each statement in a given script is -not- only executed
 once. It is executed once for each cycle of  the program
 and the number of cycles is determined by the number of 
 characters in the input stream.

 In some implementations an implicit @read operation will
 be performed for each cycle of the engine but in other 
 implementations no implicit read may be performed and each
 script will have to include a read command at the beginning
 of the script. If this is not done the script may loop
 infinitely.

 ** sed the stream editor

 The concept of "cycles" is drawn directly from  the sed
 language or tool (sed is a unix utility). In the sed language
 each statement in a sed script is executed once for each -line-
 in a given input stream. In other words there is a kind of
 implicit "loop" which goes around the sed script. This loop in some
 fictional programming language might look like:
 -->
  while more input lines
  do
    sed script
  loop

 --<
 In the current parse-language the cycles are executed for each 
 -character- in the input stream (as opposed to line).
 
 These cycles are also the reason that an interpreter for the language
 (see the "Parser.cpp" file and the "Program.cpp" file) cannot 
 interpret the language in the normal line-at-a-time fashion. It would
 be very slow, because the interpreter would have to parse the 
 input script once for each character in the input stream, and there 
 might be thousands or millions of characters in the input stream. 
 A modern computer could probably do this without being excessively
 slow, but a better approach is needed. The better approach is 
 to "compile" the input script into some kind of array of instructions
 (with jumps and tests included). This is what the Parser and Program
 cpp classes do. 

 ** shifting and reducing

 There is one complicating factor which is the concept of multiple
 shift-reduces during the "shift-reduce parsing2 (wikipedia) one cycle or
 the interpreter. This concept has already been treated
 within the @flag command documentation. Another tricky concept
 is grammar rule precedence, in other words, which grammar rule 
 shift-reduction should be applied first or with greater predence. In terms of any concrete
 application the order of the script statements determines precedence.
 

...
 ** a description of the c++ language interpreter files

  In general the file endings are 
  D- dev:
     the devcpp project files
   - name.main.cpp:
     a program to test the name.cpp class
   - name.cpp:
     a class which is part of the -machine- implementation

  D- file:/machine/Instruction.cpp :
     represents one instruction in a compiled Program.
  - file:/machine/Instruction.dev :
     the bloodshed "devcpp" http://www.bloodshed.net/devcpp.html
     project file for the Instruction class
  - file:/machine/Instruction.exe :
     an ms-windows executable to test the Instruction class
     This file is the result of compiling the Instruction.dev
     project in the devcpp compiler.
  - file:/machine/Instruction.h :
     the header file for the Instruction class
  - file:/machine/Instruction.main.cpp :
     a c++ program to test the Instruction class
  - file:/machine/Machine.cpp :
     The implementation of a @virtual-machine which is the 
     basis of the parsing script language. See the file:/machine/java
     directory for an older implentation in java
  - file:/machine/Machine.dev :
     the project file for the virtual Machine class
  - file:/machine/Machine.exe :
     an ms-windows exe to test the Machine class
  - file:/machine/Machine.h :
     The c++ header file for the Machine class
  - file:/machine/Machine.main.cpp :
     a program to test the Machine class. Allows operations to be
     performed on the machine and to watch the changes in the 
     state of the machine.
  - file:/machine/makefile.win :
     a ms-windows specific makefile which was generated by the 
     devcpp compiler. could be used as the basis of a unix make 
     file.
  - file:/machine/Parser.cpp :
     A class which parses and compiles the  script language.
     this class parses a script written in the language and compiles it
     into an array format using the Program class. This parsing and
     compiling operation removes the need to reparse the script for each
     character in the input stream. 
  - file:/machine/Parser.dev :
     the devcpp project file for the Parser class
  - file:/machine/Parser.exe :
     this is an ms-windows executable which tests the Parser class.
     The executable also acts as an interpreter for a given script
     since it invokes the "run" method on the Program object which 
     is created by the parser.
  - file:/machine/Parser.h :
  - file:/machine/Parser.main.cpp :
     This is a program to test the Parser class which also acts
     as an interpreter for the script language.
  - file:/machine/Program.cpp :
     represents a compiled script.
  - file:/machine/Program.dev :
     the devcpp project file.
  - file:/machine/Program.exe :
  - file:/machine/Program.h :
  - file:/machine/Program.main.cpp :
     a program to test the Program class. This is very useful
     for learning about the parse language and ":virtual-machine"
     because it provides a command interpreter which allows the 
     user to add instructions to a test program and execute them
     while watching the changes in the state of the virtual machine.
  - file:/machine/Tape.cpp :
     represents the @tape data structure within the ":virtual-machine"
     The tape is essentially an array with a pointer.
  - file:/machine/Tape.exe :
     an ms-windows executable to test the Tape class
  - file:/machine/Tape.h :

 As can be seen by the above files, any of the classes can
 be tested by compiling the ".main.cpp" class which provides
 a program to test the class. To compile the .main.cpp class
 on a windows computer the bloodshed devcpp compiler can be 
 used and by loading the ".dev" project file.

 For example, to compile the file:/machine/Machine.main.cpp file
 load the file:/machine/Machine.dev project file in the devcpp
 compiler and use the menu compile option. By loading the ".dev"
 project file all dependency class will be automatically loaded into
 the compiler.

 ** other files

  D-file:/machine/Interpret.cpp :
     a class which parses and generates c++ source code
     for a given script. The source code can the be compiled in
     order to execute the script. 
  - file:/machine/Interpret.dev :
     The "devcpp" http://www.bloodshed.net/devcpp.html project file
  - file:/machine/Interpret.exe :
     An ms-windows executable
  - file:/machine/Interpret.h :
  - file:/machine/Interpret.main.cpp :
  - file:/machine/Interpretor.dev :
  - file:/machine/Compile.cpp :
  - file:/machine/Parser.all.cpp :
  - file:/machine/MachineBasic.cpp :
     an older version of the @virtual-machine 


 && A Glossary for the pp language 
 --------------------------------------


   parse:
     find the structure of a text pattern
   compile:
     transform a pattern into another
   machine:
     a virtual machine which is used to execute a script
   pattern:
     a formal language, in the mathematical or linguistic 
     sense.
   shift-reduce:
     a method of parsing.


 && How to compile the pp tool
 -----------------------------------------

 These instructions will most probably change as the c source
 will be divided into small files for the sake of maintainability.
 
-On an x86 computer-

  a- download the "tiny c" http://fabrice.bellard.free.fr/tcc/  
   - unzip the tiny c distribution in some folder
   - download file:library.c and file:chomski.c and place the 
     two files in the "tcc" folder (which was created when the tiny c
     distribution was unzipped.
   - compile with a command line such as 
  :>  \tcc\tcc -I\tcc\include -L\tcc\lib chomski.c 
   - The resulting executable is around 40k

-On a non-x86 computer-

  a- download file://library.c and file://chomski.c
   - download or obtain a c compiler, for example the GNU c
     compiler.
   - compile the code. The exact proceedure I dont know because
     I havent done it.


  ** un directori de documentacio 

  Aquest directori conte documentacio per un llenguatge
  script que esta dissenyat per a implementar analitzadors
  sintactics. De fet, aixo es una explicacio molt simplificada,
  perque el llenguagte tambe pot ser utilizada per traduir
  un format de text a un altre. 

  Aquest llenguagte te similituds amb "sed" [->wikipedia] que
  es un editor de fluxes de text. El llenguatge descrit aqui
  tambe es un editor de fluxes de text pero funciona amb
  caracters i no pas lineas de text
 
  La majoria de documents estan escrits en angles. Si un document
  tambe esta escrit en catala te un nom com "nomarxiu.ca.txt" i
  ho podeu veure teclejant "nomarxiu.ca.txt.html" com una pagina
  web.

   D- "commands" file:commands.txt.html :
      una llista de paraulas claus del llenguagte
    - "maquina virtual" file:virtual-machine.txt.html :
      descriu els components de la maquina virtual

  ** Altres documents
   
    D- file:how-to-compile.txt.html:
        com compilar el codi font del programa
     - file:file-description.txt.html :
        descriu els arxius de codi "c++" (->wikipedia:ca) que estan
        pel directori file:/machine 
     - file:language.txt.html :
        una descripcio general del llenguagte
     - file:script-trace.txt.html :
        intenta demostrar com funciona el llenguatge i maquina virtual
        junts.

  
  ** documentacio en html 

  per visualitzar un document com una pagina web cal afegir ".html" al
  final del nom de arxiu per exemple "nom.txt.html"


 && The input stream of a script
 -------------------------------

 Each script in the for the pattern parser assumes that there is an input
 stream which is to be processed by the instructions or commands contained
 within the script. In this sense the parse-language follows the unix tradition
 of using "pipes" to process text streams using small programs. Programs such
 as "wc (unix)"  and "grep" are examples of Unix style programs which process
 text input streams and produce text output stream. The 'pp' tool works in
 the same way as these above mentioned programs and can form part of a normal
 "unix pipe" line (These pipes are also available within the ms-windows or
 "Dos" command line)

 Let us look at an example. A very simple script in the parse language 
 would be 
  >> /"/ { clear; add "'"; } print; clear;

 This script just replaces each instance of " within a text file with
 the character ' (which is not very useful since there are already
 many programs which can do this)

 The above script can be saved in a file called "quote.txt" for example and
 then executed. It is executed as the ms-windows or *nix command console
 by typing
   >> type file.txt | Parser quote.txt | more  ##(ms-windows)
 or 
   >> cat file.txt | Parser quote.txt | less ##(unix)
 
 These commands lines are equivalent and demonstrate the use of pipes and input
 streams and output streams. The file "file.txt" which contains normal text
 with " characters in it, forms the input stream for the parse-language script
 (quote.txt) which in turn creates an output stream which is consumed by the
 "more" or "less" programs. (these programs just display the output one page at
 a time). The user will just see the original file.txt will all its "
 double-quotes changed to ' single-quotes, but the actual "file.txt" is
 unchanged, since the script operates on the input stream and not on the file
 itself.


THE PATTERN PARSER SCRIPT LANGUAGE 

 The parsing script language is a "sed" like language which is designed to
 parse and transform "context free languages" or in non-technical terminology,
 is designed to re-format text patterns.

NAMING OF THE LANGUAGE

  The tool is named 'pp' for Pattern Parser, and the virtual machine
  may be refered to as the virtual machine. I originally named
  this language 'chomski' but decided to follow the Unix tradition of 
  short names.

AN EXAMPLE SCRIPT

  >>> 

   /=/ 
     { put; clear; add "operator*"; push; }
   pop;
   /operator*number*/
    { clear; add "equation*"; push; }
   push;

  <<< 

  As can be seen the language has a number of things in common
  with sed. 
   u- it is a cycle based language; that is, each command in 
      the script is executed once for each cycle of the interpreter.
    - it uses "tests" in the form /text/ 
    - it has the concept of a "workspace" or "pattern space" like sed,
      being a central text buffer against which tests and manipulations
      occur
    - it also uses other buffers, like the sed @hold-space but in the 
      case of the parse language the buffers are more capable because they
      include a @stack and a ":tape".

THE PURPOSE OF THE LANGUAGE

 The language is designed to allow the simple creation of parsers and
 translators, without the necessity to become involved in the complexities of
 something like Lex or Yacc. Since the interpreter for the language is based on
 a virtual machine, the language is platform independant and has a level of
 abstraction.

 The language combines the features of a tokenizer and parser and
 translator.

AVAILABLE IMPLEMENTATIONS 

 http://bumble.sf.net/books/pars/gh.c and other related files in
 the "gh/" folder such as 
   asm.pp compile.pss compile.ccode.pss

 ...

TRACE OF A SCRIPT

 This section provides a trace of a script to show how the 
 virtual machine changes.

 a script which "lexes" 
 ----------
  read;
  '"' {
    put;
    clear;
    add "quote*";
    push;
  } 

  "a", "b" {
    put;
    clear;
    add "letter*";
    push;
  } 
  clear;
 ,,,,

 And here is the input stream (text)
 :> "a","b"

 ** the trace

 -machine-
 -stack:
 -tape:-->, ...
 -workspace:
 -peep:"

 -notes-
   u- the @peep buffer contains the next character in the input stream
      (in this case a quote character)
    - the dots withing the @tape structure indicate that the tape is
      an "infinite" array.
    - the arrow in the first cell of the tape indicates that the 
      @tape-pointer is currently pointing to that cell (since the 
      tape pointer starts at zero.
    - the @stack and @workspace are initially empty

  
 -machine-
 -stack:
 -tape:-->,...
 -workspace:"
 -peep:a

 -notes- 
  u- the peep buffer is a single character buffer and now contains the 
     next character in from the input stream
   - in each @cycle of the interpreter the peep character is -added- to
     the end of the workspace so the workspace is now "

 script statement:
 :> /"/
 This statement is a @test statement. It checks to see if the workspace
 buffer is exactly the text withing the "/" characters. In this cas
 the condition is true so the statements within the braces are executed
 :> put;
 
 This statement puts the contents of the @workspace buffer into the 
 current element of the @tape data structure. The current element is
 the element where the pointer is pointing. So the machine now looks like

 -machine-
 -stack:
 -tape:--> ",...
 -workspace:"
 -peep:a

 -notes- 
  u- The first element of the tape now contains the character ", which
     was the contents of the workspace. The workspace has -not- changed
     because the @put operation does not delete or change in any
     way the workspace
   - also the tape pointer is still pointing to the first element 
     because the put operation does not change the ":tape-pointer"

 statement:
 :> clear;
 
 This statement simple clears the workspace buffer of all text
 that is, sets it to a zero length string, so the machine now
 looks like

 -machine-
 -stack:
 -tape:--> ",...
 -workspace:
 -peep:a

 next statement:
 :> add "quote*";

 This is also a straightforward statement, it simply
 adds the given text to the workspace  buffer. So the 
 machine is now

 -machine-
 -stack:
 -tape:--> ",...
 -workspace:quote*
 -peep:a

 -notes- 
 u- the * character at the end of the text is important
    because it indicates that what is being @add ed to the 
    workspace is actually a token name.
  - all statements or commands in the language -must- end 
    with a semi-colon, unlike sed

 next statement:
 :> push;

 The @push command pushes the contents of the workspace upto
 and including the first star character onto the stack so the 
 machine now looks like

 -machine-
 -stack:quote*
 -tape: ",-->, ...
 -workspace:
 -peep:a

 -notes-    
  u- the workspace is now empty because the @push command
     does change the workspace unlike the -put- command.
   - The tape pointer is now pointing to the second item in 
     the tape, because the push command automatically increments
     the tape pointer.
   - The @stack now has one element, namely the previous contents
     of the workspace buffer.
   - The @peep buffer is unchanged because no new charater will
     be read from the input stream until all the script commands
     have been processes.
   - The character which is in the first element of the tape in
     a sense "corresponds" to the first element of the stack, in the 
     sense that the quote character is the -value- of the 
     "parse token" [->wikipedia] namely "quote*". In general the stack
     is for parse tokens that the tape is for values. But the 
     machine can be used for any purpose which the script writer
     desires.

 -statement- :

 :> /a/
 This is another test, it checks to see if the workspace is
 exactly the text "a" (no "regular expressions" [->wikipedia] are 
 involved). In this case the answer is no.

 -statement- :
 :> /b/
 another test and the result is once again false because 
 the workspace currently empty.
 ** a new cycle

 so...
 now the all the script statements have been processes (because
 the second block of braces was skipped because the tests returned
 false. Now begins a new ":cycle". This means that a new character
 is read from the input stream into the @peep buffer and the 
 previous peep character is added to the end of the workspace. Also
 the interpreter starts processes the script from the beginning 
 again.
 
 so after the new cycle the machine now looks like

 -machine-
 -stack:quote*
 -tape:",-->, ...
 -workspace:a
 -peep:"

 The interpreter starts from the first script statement again.
 :> /"/
 This test now returns -false- because the workspace contains
 an "a" character not a quote character. (The test is -not- performed
 against the peep buffer, only the workspace buffer)
 The test returns false and the brace block is skipped.

 The next statement is:
 :> /a/
 another test, but this time it returns -true- because the workspace
 -does- equal 'a'. The statements in the block are executed.
 The next statement is
 :> put;

 The contents of the workspace are put into the element of the tape
 where the tape pointer is pointing so the machine now is

 -machine-
 -stack:quote*
 -tape:",-->a, ...
 -workspace:a
 -peep:"

 The next statement:
 :> clear;
 results in the ":virtual-machine"

 -machine-
 -stack:quote*
 -tape:",-->a, ...
 -workspace:
 -peep:"

 and the next statment
 :> add "letter*";
 just adds the given text to the @workspace so the machine is now

 -machine-
 -stack:quote*
 -tape:",-->a, ...
 -workspace:letter*
 -peep:"

 -statement- :
 :> push;
 once again the contents of the workspace up to the first star
 are pushed onto the stack to the machine is now
 
 -machine-
 -stack:quote*,letter*
 -tape:",a,-->, ...
 -workspace:
 -peep:"

 -notes-  
  u- the tape pointer is incremented
   - the workspace is deleted upto and including the first star
     (in this case, all of it)
   - the stack gets another element
   - the @peep is unchanged because a new character is only
     read at the beginning of a cycle or because of and 
     explicit @read command.

 ** and so on

 ...


COMPARISON BETWEEN PP AND SED 

 The "pp" tool and virtual machine was largely inspired by the "sed" stream
 editor. Sed is a program designed to find and replace  patterns in text
 files. The patterns which Sed replaces are "regular expressions". The 
 pp tool includes a more powerful virtual machine which is capable of 
 parsing and compiling many "context-free" languages, whereas sed is mainly
 used to manipulate "regular" languages.

** Similarities

  * The workspace:
    Both sed and pp have a 'workspace' buffer (which in sed is 
    called the "pattern space"). This workspace is the area where
    manipulations of the text input stream are carried out.

  * The script loop or cycle:
    The languages are based on an implicit loop or cycle. That is to say
    that each command in a sed or a chomski script is executed once
    for each line (sed) or once for each character (pp) 

  * Syntax:
    The syntax of sed and chomski are similar. Statement blocks are
    surrounded by curly braces {} and statements are terminated with
    a semi-colon ; .Also the sed idea of "tests" based on the contents
    of the "pattern space" (or @workspace ) is used in the chomski
    language. 

  * Text streams:
    Both sed and 'pp' are text stream based utilities, like many 
    other unix tools. This means that both sed and 'pp' consume an
    input text stream and produce as output, an output text stream.
    These streams are directed to the programs using "pipes" | in a 
    console window on both unix and windows systems. For example, for
    sed we could write in a console window
     >> echo "abcabcabc" | sed "s/b/B/g"
    and the output would be
     >> aBcaBcaBc
    In the case of the pattern parser the command line could be written
     >> echo "abcabcabc" | pp -e "'b'{d;add'B';t;d;}t;d;"
   and the output, once again will be
     :> aBcaBcaBc

   In other words the sed and the pp script do the same thing.

** Differences

    * Lines vs characters:

     Sed is a "line" based language where as chomski is a character
     based script language. This means that the sed script is \
     executed once for each line of the input text stream, but in the 
     case of "pp", the script is executed once for each character
     of the input text stream, except for the "begin" block
   
    * Strict syntax:

     The syntax of the pp language is stricter than that of sed. For 
     example, in a chomski script all commands must end in a semi-colon
     and all statements after a test must be enclosed in curly braces.
     In sed, it is not always necessary to terminate commands with a 
     semi-colon. For example, both of the following are valid sed 
     statements.
      >> s/b/B/g
      >> s/b/B/g;

    * White space:

      In the pattern-parse language (as implemented in "asm.pp" or
      compile.pss) whitespace is insignificant. In sed, this is not the case.
      For example in pp one can write:
        ----
           "text" 
           { print; }
        ,,,,

      That is, the opening brace is on a different line to the 
      test. This would not be a legal syntax in sed.
    
    * Commands names:
    Sed uses single character "memnonics" for its commands. For example,
    "p" is print, "s" is substitute, "d" is delete. In the pp
    language, in contrast, commands are have a long form as well such as 
    "print" for print the workspace, "clear" for clear the workspace, or
    "pop" in order to pop the last element of the stack onto the workspace.
    While the sed approach is useful for writing very short, terse 
    scripts, the readability of the scripts not good. 
   
   * The virtual machines:
     While both sed and the pattern parser are in a sense based on "virtual
     machines", the machine for pp is more extensive. The sed machine
     essentially has 2 buffers, the pattern space, and the hold space. The
     pp virtual-machine, however,  has a workspace, a stack, a tape, a
     counter variable and various other registers.
   
   * Regular expressions:
   The tests or "ranges" in sed, as well as the substitutions are based
   on regular expressions. In pp, however, no regular expressions are
   used. The reason for this difference is that pp is designed to
   parse and transform a different set of patterns than sed. The patterns
   that pp is designed to deal with are referred to formally as
   "context free languages" 
   
   * Negation of tests:
     The negation operator ! in the pp language is placed before the test
     to which is applies, whereas in sed, the negation operator comes after
     the test or range. So, in the pattern-parse language it is correct to
     write

     >> !"tree" { add ': not a tree'; print; clear;}
     But in sed the correct syntax is
     >> /tree/! { s/$/not a tree/; }
   
   
** The self referentiality of the parse-language

 The parse-language is capable of parsing and compiliing itself with one or
 two caveats.  This is because the parse-language is designed to parse and
 transform, translate or compile context free languages (but not necessarily
 all context free languages) and the language itself is a context free
 language.

SHIFT REDUCTIONS ON THE STACK ....

  Imagine we have a 'pp' script as follows:
  ----
    pop;pop;
    "command*command*",
    "commandset*command*" { 
      clear; add 'commandset*'; push;
    }
    "word*colon*" {
      clear; add 'command*'; push; 
    }      
    push; push;
  ,,,

  This script corresponds directly to the (e)bnf grammar rules

  ---
    commandset := command , command;
    commandset := commandset , command;
    command := word , colon;
  ,,,,

  But in the script above there is a problem; that the first rule
  needs to be applied after the second rule.

  But it seems now that another reduction should occur 
  namely
   >> if-block --> if-statement block
  
  which can be simply implemented in the language using the 
  statements:

  -----
    pop; pop;
    "if-statements*block*" { clear; add "if-block*"; push; }
  ,,,,

  But the crucial question is, what happens if the statements just written
  come before the statements which were presented earlier on? The problem is
  that the second reduction will not occur because the script has already
  passed the relevant statements. This is the reason for the .reparse command


TASKS TO DO PATTERN PARSER VIRTUAL MACHINE AND TOOL

  * create a java version of the machine? 
  * create a proper "check.syntax.pss" script which will expand
    the syntax checks in compile.pss 

IDEAS FOR CHANGING THE PATTERN PARSER 

  * add a new command "untiltape" which has no arguments,
    which reads the input stream until the workspace ends with
    the text contained in the current cell of the tape.

  * add an execute command, which executes the workspace as a 
    shell command and inserts the output in the output stream.
    This add unlimited power to the language but is it really
    necessary????

PARSE TOKENS 

OVERVIEW

 The parsing script language is a "sed" like language which is designed to
 parse and transform "context free languages" or in non-technical terminology,
 is designed to re-format text patterns.

NAMING OF THE LANGUAGE

  The tool is named 'pp' for Pattern Parser, and the virtual machine
  may be refered to as the parsing virtual machine. I originally named
  this language 'chomski' but decided to follow the Unix tradition of 
  short names.

AN EXAMPLE SCRIPT

  ----

   read;
   "=" 
     { put; clear; add "operator*"; push; }
 parse>
   pop;pop;
   "operator*number*"
    { clear; add "equation*"; push; .reparse }
   push; push;

  ,,,,
 

  As can be seen the language has a number of things in common
  with sed. 
   * it is a cycle based language; that is, each command in 
     the script is executed once for each cycle of the interpreter.
   * it uses "tests" in the form "text" [abcd] [a-d] [:space:] ![:alpha:] 
   * it has the concept of a "workspace" or "pattern space" like sed,
     being a central text buffer against which tests and manipulations
     occur
   * it also uses other buffers, like the sed "hold-space" but in the 
     case of the parse language the buffers are more capable because they
     include a stack and a "tape" (which is an array of text cells). 

THE PURPOSE OF THE LANGUAGE

 The language is designed to allow the simple creation of parsers and
 translators, without the necessity to become involved in the complexities of
 something like Lex or Yacc. Since the interpreter for the language is based
 on a virtual machine, the language is platform independant and has a level
 of abstraction.

 The language combines the features of a tokenizer and parser and
 translator.

AVAILABLE IMPLEMENTATIONS 

 http://bumble.sourceforge.net/books/pars/object/
   This folder contains a working implementation in the c language.
   The code can be compiled with the bash functions in the file 
   "helpers.gh.sh"
 
 http://bumble.sourceforge.net/books/pars/object.java
   A reasonably complete machine object in java. We can use this 
   in conjunction with the script language to create compilable
   java code (the script "compilable.c.pss" provides an example of 
   how to do this for the c language).
  

COMPARISON BETWEEN PP AND SED 

 The pp tool was largely inspired by the "sed" stream editor. Sed is a program
 designed to find and replace  patterns in text files. The patterns which Sed
 replaces are "regular expressions" 

SIMILARITIES ....


  * The workspace:
    Both sed and chomski have a 'workspace' buffer (which in sed is 
    called the "pattern space"). This workspace is the area where
    manipulations of the text input stream are carried out.

  * The script cycle:
    The languages are based on an implicit cycle. That is to say
    that each command in a sed or a chomski script is executed once
    for each line (sed) or once for each character (chomski)

  * Syntax:
    The syntax of sed and chomski are similar. Statement blocks are
    surrounded by curly braces {} and statements are terminated with
    a semi-colon ; .Also the sed idea of "tests" based on the contents
    of the "pattern space" (or @workspace ) is used in the chomski
    language. 

  * Text streams:
    Both sed and chomski are text stream based utilities, like many 
    other unix tools. This means that both sed and chomski consume an
    input text stream and produce as output an output text stream.
    These streams are directed to the programs using "pipes" | in a 
    console window on both unix and windows systems. For example, for
    sed we could write in a console window
     >> echo abcabcabc | sed "s/b/B/g"
    and the output would be
     >> aBcaBcaBc
    In the case of "chomski", the command line could be written
     >> echo abcabcabc | chomski -s "/b/{clear;add'B';print;clear;}print;clear;"
   and the output, once again will be
     :> aBcaBcaBc
   In other words the sed and the chomski script to the same thing.
    -


DIFFERENCES ....

    * Lines vs characters:
     Sed is a "line" based language where as chomski is a character
     based script language. This means that the sed script is \
     executed once for each line of the input text stream, but in the 
     case of chomski, the script is executed once for each character
     of the input text stream.

    * Strict syntax:
     The syntax of the parsing language is stricter than that of sed. For 
     example, in a parsing script all commands must end in a semi-colon
     (except .reparse, parse> and .restart - which affect program flow)
     and all statements after a test must be enclosed in curly braces.
     In sed, it is not always necessary to terminate commands with a 
     semi-colon. For example, both of the following are valid sed 
     statements.
      >> s/b/B/g
      >> s/b/B/g;

    * White space:
      In the parsing language whitespace is completely insignificant, even
      within quoted arguments (?). In sed, this is not the case. For example
      in pp one can write
      ------
        "text" { print; }
      ,,,,

      That is, the opening brace is on a different line to the 
      test. This would not be a legal syntax in sed.
    
    * Commands names:

      Sed uses single character "memnonics" for its commands. For example,
      "p" is print, "s" is substitute, "d" is delete. In the chomski 
      language, in contrast, commands are full english words, such as 
      @print for print the workspace, @clear for clear the workspace, or
      @pop in order to pop the last element of the @stack onto the workspace.
      While the sed approach is useful for writing very short, terse 
      scripts, the readability of the scripts not good. Chomski sacrifices
      conciseness for readability
     
    * The virtual machines:
      While both sed and chomski are in a sense baed on 
      "virtual machines" [->wikipedia], the machine for chomski is more 
      extensive. The sed machine essential has 2 buffers, the pattern space,
      and the hold space. The chomski @virtual-machine ,however has a 
      workspace, a stack, a tape, an counter variable amongst other things.
      
    * Regular expressions:
     The tests or "ranges" in sed, as well as the substitutions are based
     on regular expressions. In pp, however, no regular expressions are
     used. The reason for this difference is that pp is designed to
     parse and transform a different set of patterns than sed. The patterns
     that pp is designed to deal with are referred to formally as
     "context free languages" [->wikipedia].
     
    * Negation of tests:

     The negation operator ! in the "chomski" language is placed
     before the test to which is applies, where as in sed the 
     negation operator comes after the test or range. So, 
     in "chomski" it is correct to write
      >> !/tree/ { ... }
     But in sed the correct syntax is
      >> /tree/! { ... }
   

COMMANDS
  ** Test Structures

  All tests may be negated by placing the '!' character
  -before- the test. For example

    :> !// { add 'x'; }
    If the workspace is not empty then add the character
    'x' to the workspace
 
  Multiple equal tests may be used before a brace block. 
   
    "text","word" { print; }
  --<

 If the workspace is either 'text' or 'word' then print
 the workspace.

   D- "equals" doc:test-is.txt.html test:
      :> /sometext/ {...} 
      If the @workspace is exactly the same as the text contained within
      the forward-slashes then the 
      the instructions within the braces will be executed. If
      not, then the next instruction executed will be the first 
      instruction after the matching closing brace.
    - "begins with" doc:test-begin.txt.html test:
      :> <sometext> {...} 
      If the text in the workspace begins with the text contained
      in the angle brackets, then the code in the braces will
      be executed.
    - "ends with" doc:test-begin.txt.html test:
      :> (sometext) {...} 
      If the text in the workspace begins with the text contained
      in the angle brackets, then the code in the braces will
      be executed.
    - "list" doc:test-list.txt.html test:
      :> =file.name= {...} 
       This test checks the workspace against the lines a 
       the text file 'file.name'. If the @workspace 
       matches any one of the lines then the test will return
       true and the instructions within the braces will be 
       executed. 
    - "tape = workspace" doc:test-class.txt.html test:
      :> == {...}
       tests if the current element of the tape is equal to
       the current value of the workspace
    - "class" doc:test-class.txt.html test:
      :> [aeiuo] {...}
      If the text in the workspace is exactly any one of the 
      characters contained within the square braces then the 
      test will return true. 
    - "end of input stream" doc:test-eof.txt.html test:
      :> <> {...}
      This test checks to see if the end of the input stream has been
      reached.  If the input stream has no more characters to read then
      the test returns true and the code within the braces
      will be executed.

  All of the above tests can be "negated"by placing the "!"
  character -before- the test. For example, if the script contains
  the code
  :> !/sometext/ {...}
  then the instructions within the braces will be executed only
  if the @workspace buffer is -not- the text "sometext".

 ** Comments in scripts

   Comments, which will be ignored by the interpreter may
   be placed in a script file by -enclosing- the comment
   in "#" characters. Notice that this is different from the 
   normal unix syntax  of beginning each line with a hash 
   character. For example
   -->
     # This comment will be ignored by the interpreter
       Comments may span lines and must end with 
       a hash character #
     /a/ { pop; }
   --<


BACKUS-NAUR FORM AND THE PATTERN PARSER

  Backus-naur form is a way of expressing grammar rules for formal languages.
  A variation of BNF is "EBNF" which modifies slightly the syntax of bnf.
  Sometimes on this site I use (yet another) syntax for bnf or ebnf rules but
  the idea is the same. 

  There is close relationship between the syntax of the 'pp' language
  and a bnf grammar. For example

  >>  "word*colon*" { clear; add "command*"; push }

  corresponds to the grammar rule

  >> command := word colon


PP LANGUAGE PARSING ITSELF

  An important problem for the pp language is to "generate itself"
  from a set of bnf rules. In other words given rules such as
  -----
    command := word semicolon;
    command := word quotedtext semicolon;
  ,,,,

  then it should be possible to write a script in the language 
  which generates the output as follows

  ----
   pop;pop;
   "word*semicolon*" {
     clear;
     add "command*";
     push;
     .reparse
   }

   pop;
   "word*quotedtext*semicolon*" {
     clear;  
     add "command*";
     push;
     .reparse
   }
   push;push;push;
  ,,,,

TOKEN ATTRIBUTE TRANSFORMATIONS

 No, this is not correct. We can write complete translators/compilers
 with the script language. However it may be nice to have a more 
 expressive format like the one shown below.

 After the above problem is solved it is necessary to add a syntax for
 transforming the values of the tokens, which is what actually will allow the
 writing of compilers or translators. For example:

 * a more expressive compiling language built on top of the parse language
 ----
   command := word semicolon {
     $0 = "<em>" $1 "</em"> $2;
   };
 ...
   
 The $n structure will fetch the tape-cell for the corresponding 
 identifier in the bnf rule at the start of the brace block.

REPARSE COMMAND AND PARSE LABEL

  This command makes the interpreter jump back to the @@@ label. 
  The label must occur before the check command, not after. 
  The .reparse command takes no arguments and is written
   >> .reparse 

** Multiple shift reductions

  The reason for using this command lies in the need to apply multiple "shift
  reduce" operations during one cycle of a script. See shift-reduce.txt for a
  more complete explanation.

SHIFT REDUCTIONS ON THE STACK 

  Imagine we have a "recogniser" pp script as follows:
  -----
    read; 
    ";" { clear; add "semicolon*"; push; }
    [a-z] { 
      while [a-z]; clear; add "word*"; push; 
    } 
  parse>
    pop;pop;
    "command*command*" 
      { clear; add 'commandset*'; push; .reparse }
    "word*semicolon*" 
      { clear; add 'command*'; push; .reparse }      
    push; push;
    (eof) { 
       pop; pop; push; 
       !"" { 
         clear; add "incorrect syntax! \n"; print; quit;
       }
       pop;
       "command*", "commandset*" {
         clear; add "Correct syntax! \n"; print; quit;
       }
    }
  ,,,,,
  
  This script recognises a language which consists of a series 
  of "commands" (which are lower case words) terminated in semicolons.
  At the end of the script, there should only be one token on the 
  stack (either command* or commandset*). 

  This script corresponds directly to the bnf grammar rules
  ----
    word := [a-z]+;
    semicolon := ';';
    commandset := command command;
    command := word colon;
  ,,,,

  But in the script above there is a problem; that the first rule
  needs to be applied after the second rule.

  The above statements in the bumble language execute a reduction according to
  the grammar rule written above.  Let us examine the state of the machine
  before and after the script statements above:
  
  ==:: virtual machine
  .. stack, if-statement, open-brace*, statements*, close-brace*
  .. tape, if(a==b), {, a=1; b=2*a; ..., }
  .. workspace,
  ..,

  and afterwards

  ==:: virtual machine
  -stack: if-statement, block*
  -tape: if(a==b), { a=1; b=2*a; ... }
  -workspace:
  ..,

  But it seems now that another reduction should occur 
  namely
   :> if-block --> if-statement block
  
  which can be simply implemented in the language using the 
  statements:

    pop; pop;
    /if-statements*block*/ { clear; add "if-block*"; push; }
  ..,

  But the question is, what happens if the statements just written
  come *before* the statements which were presented earlier on? The problem
  is that the second reduction will not occur because the script has already
  passed the relevant statements. This is the reason for the .reparse 

  If this explanation has not clarified matters, it should be 
  said that the best way to understand the workings of the 
  machine is to look at a trace of its execution which can
  be obtained from the Program class.

SELF REFERENTIALITY

  The parse script language is a language which is designed to 
  parse/compile/translate languages. This means that it can 
  recognise/parse/compile, and translate itself. The script 
  books/pars/compilable.c.pss is an example of this.

  Another interesting application of this self-referentiality
  is creating a new compiling system in a different target language.
    
NEGATION OF TESTS 

 a test such as 
 >> "some-text" { #* commands *# }

 can be modified with the logic operator "!" (not) as in
 >> !"some-text" { #* commands *# }

 Note that the negation operator '!' must come before the 
 test which it modifies, instead of afterwards as in
 sed.

 Multiple negations are not allowed 
 >> !!"some-text" {...} # incorrect
  
** Examples

 If the workspace is not a single digit then clear the 
 workspace
  >> ![0-9] { clear; }

 If the workspace does not begin with the text 'apple' then
 print the workspace
   >> !B"apple" { print; } # not implemented (yet, august 2019) 

 If the workspace does not match any of the lines of 
 the file 'verbs.txt' then add the text 'noun' to 
 the workspace
  >> !=verbs.txt= { add 'noun'; }

 If the end of the input stream has not been reached
 then push the contents of the workspace onto the 
 stack
  >> !(eof) { push; } # not implemented (august 2019)
 
 If the value of the workspace is exactly equal to the 
 value of the current element of the tape, then exit
 the script immediately
  >> !(==) { quit; } # not implemented (august 2019)

COMMENTS IN THE PARSER LANGUAGE 

 Both single line, and multiline comments are available in the
 parser script language as implemented in books/pars/asm.pp and 
 compile.pss

 * single and muliline examples
 ---- 
  #* check if the workspace 
    is the text "drib" *#
  "drib" {
    clear; # clear the workspace buffer 
  }
 ,,,,

The script books/pars/compile.pss attempts to preserve comments 
in the output "assembler" code to make that code more readable.

 
COMMANDS

ADD COMMAND ....
  

  The "add" command  adds a piece of text to the -end- of the 'workspace'
  buffer No other register or buffer within the virtual machine is affected.
  The command is written

   >> add 'text'; or
   >> add "text";

** arguments for the command
   
** examples

  * Add a space after every character of the input 
  >> add ' '; print; clear;
  
   /style*type*/ {
     clear;
     add "command*";
     push;
   }

  The script above does a shift-reduce operation while parsing some
  hypothetical language. The "add" command is used to add a new token name to
  the 'workspace' buffer which is then pushed onto the stack (using the
  'push' operation, naturally). In the above script the text added can be
  seen to be a token name (as opposed to some other arbitrary piece of text)
  because it ends in the "*" character. Actually any character could be used
  to delimit the token names, put "*" is the default implementation. 

  The add command takes *one* and only one parameter, or argument.

QUIT COMMAND ....

  This command immediately exits out of the script without
  processing any more script commands or input stream characters.
  
  This command is similar to the "Sed" command 'q'

INCREMENT COMMAND ....

 This command is simple but important. The command 
 adds -one- to the 'tape'-pointer. The command is written
  >> ++;

 Notice that the only part of the machine state which has
 changed is that the tape-pointer has been incremented by one cell.
  
 This command allows the tape to be accessed as an array of strings.  Being
 able to increment and decrement the tape pointer allows the script writer
 and the virtual machine to access any value on the tape using the 'get' and
 'put' commands.
 
 It should be remembered that the 'pop' command also automatically
 increments the tape pointer, in order to keep the tape pointer
 and the stack in synchronization.

 Since the get command is the only way to retrieve values from
 the tape the ++; and --; commands are necessary. the tape
 cannot be accessed using some kind of array index because
 the get and 'put' commands to not have any arguments.

 An example, the following script will print the string associated
 with the "value" token.

 >> pop;pop; "field*value*" { ++; get; --; print; }


MINUS COMMAND  ....

 The minus command decreases the counter variable by one.
 This command takes no arguments and is written
  :> minus;
 
PEEP BUFFER 

  The peep buffer is a single character buffer which stores
  the next character in the input stream.  When a 'read' command
  is performed the current value of the peep buffer is appended to
  the 'workspace' buffer and the next character from the input
  stream is placed into the peep buffer.

  The 'test-eof' <> checks to see if the peep buffer contains the 
  end of input stream marker. 

  The 'while' command reads from the input stream while the 
  peep buffer is or is not some set of characters
  For example
   :> while 'abc';
  reads the input stream while the peep buffer is any one of the 
  characters 'abc'. 

 = The plus command =

 This command increments the machine counter variable by
 one. It is written
  >> a+;

 The plus command takes no arguments. Its counter part is the
 'minus' command.

 ** See also

  D- 'count' :
     adds the value of the counter to the end of the
     'workspace' buffer.
   - 'minus' :
     decrements the internal counter by one.
   - 'zero' :
     sets the internal counter to zero
   -

NOTES

  The pop command is usually employed in the parsing phase 
  of the script (not the lexing phase).
   
DETAILS ....
  
  If the stream parser language is being used to parse and translate a
  language then the script writer needs to ensure that each token ends with a
  star character in order for the push and pop commands to work correctly.

  The pop command also affects the "tape" of the virtual machine in
  that the tape-pointer is automatically decremented by one once for
  each pop command. This is convenient because it means that the 
  tape pointer will be pointing to the corresponding element for the 
  token. In other words in the context of parsing and compiling a
  "formal language" the tape will be pointing to the 
  "value" or "attribute" for the the token which is currently in the 
  workspace.

PUSH COMMAND ....

  The push command pushes part of the workspace buffer 
  onto the virtual machine stack. The command is written
   >> push;

  The "push" command is the inverse of the "pop" command.
  
  The command takes no arguments. The push command also automatically
  increments the 'tape' pointer by one. The tape pointer is incremented so
  that the push command can be used in conjunction with the 'put' command. 

** The virtual machine and push

   The delimiter character (by default '*') is important here, because it is
   used to separate the token names when as they are being pushed onto the
   'stack' . That is, the machine reads the workspace up to the first "*" and
   pushes that portion of the workspace onto the stack and    the workspace
   is diminished by that text. In the diagrams above the arrow in the tape
   indicates the position of the tape pointer.  The "push" operation
   increments or decrements the tape in order to keep the pointer
   "synchronized" with the stack. It is important to note that after a push
   operation the pointer is actually pointing to one element -after- the last
   item on the tape. 
   
** When the workspace is empty

  When the workspace contains no text then the push operation
  actually does nothing. This is desirable because it allows
  the script writer to include possibly redundant "push" operations
  without affecting the validity of the script. For example
  
  ----
    pop;
    "token*" { clear; add "idea*"; push; .reparse
    } 
  ,,, 
   
  The script above pops the stack once onto the workspace and then checks if
  the workspace matches the text "token*".  If so, it clears the workspace,
  adds the text "idea*" and then pushes the text onto the stack. After this
  push operation the workspace will be empty and so the push which occurs
  after the final closing brace will not do anything. 

UNSTACK COMMAND ....

  Pop the entire stack as a prefix onto the workspace. This may be 
  useful for displaying the state of the stack at the end of 
  parsing or when an error has occurred. Currently (Aug 2019) the 
  tape pointer is not affected by this command.

STACK COMMAND ....

  Push the entire workspace onto the stack regardless of token
  delimiters.

PUT COMMAND ....

 The put command places the entire contents of the workspace
 into the current tape cell, overwriting any previous value
 which that cell might have had. The command is written
  >> put;
 
 The contents of the workspace are not changed.

  * put the text "one" in the current tape cell
  >>  clear; add "one"; put; ++;
  
 After a put command the workspace buffer is unchanged. This contrasts with
 the stack "push" command which pushes a certain amount of text from the
 workspace onto the stack and deletes the same amount of text from the
 workspace buffer.

 The put command is the reverse of the 'get' command which
 retrieves or gets the contents of the current item of the 
 tape data structure. Since the tape is
 generally designed for storing the values or the attributes
 of parse tokens, the put command is essentiallly designed to
 store values of attributes.

 The put command can be used in conjunction with the 'increment' ++ and
 'decrement' -- commands to store values in the tape which are not the current
 tape item. 

READ COMMAND 
  
 The read command reads one character from the input stream and places that
 character in the 'peep' buffer.  The character which -was- in the peep buffer
 is added to the -end- of the 'workspace' buffer. 

  The read command is the fundamental mechanism by which the input stream is
  "tokenized" also known as "lexical analysis". The commands which also perform
  tokenization are "until", "while" and "whilenot".  These commands perform
  implicit read operations.

  An implicit read operation may also be performed at the beginning of each
  "cycle" of the script interpreter. This depends apon the implementation and
  is an issue which I havent resolved yet.  If an implicit read is not
  performed then each script will need to have an explicit read command at the
  beginning of the script.  

  
REPLACE COMMAND ....

 This command replaces one piece of text with another in the workspace
 I wanted to keep the machine very simple and only allow the 'workspace'
 to be altered with the 'clear',  'add', and 'indent' commands (and also 
 implicitly with the 'push' , 'pop' , 'get' , 'put' ... etc commands).
 But for some applications a replace command seems to be necessary.


  && The 'until' command
  ----------------------

SUMMARY

  until 'text';
  until 'text' not 'words';
  until;

  reads the input stream until the workspace ends with the given
  text or with the text of the current tape element.

EXAMPLES ....

  * print any text that occurs between '<' and '>' characters
  >> read; "<" { until ">"; print; } clear;

  * print text between quote characters (excluding the quotes)
  >> read; '"' { until '"'; clip; clop; print; } clear;

  * create a parse token 'quoted' from quoted text
  ------
    read; 
    '"' { 
      until '"'; clip; clop; put; clear; 
      add 'quoted*'; push; .reparse
    } 
    clear;
  ,,,

  * print quoted text, ignoring 'escaped' quotes
  >> /"/ { until '"' not '\"'; print; } clear;

  The 'while' and 'whilenot' commands are similar to the until
  command but they depend on the value of the 'peep' virtual
  machine buffer (which is a single-character buffer) rather than
  on the contents of the 'workspace' buffer like the until command.

NOTES

  The 'until' command usually will form part of the 'lexing' phase
  of a script. That is, the until command permits the script to 
  turn text patterns into 'tokens'. While in traditional parsing tools
  (such as Lex and Yacc) the lexing and parsing phases are carried out
  by seperate tools, with the 'pp' tool the two functions are combined.
  
  The until command essentially performs
  multiple read operations or commands and after each read checks to see
  whether the workspace meets the criteria specified in the argument.

PROPOSED IMPLEMENTATION CHANGES

  A variant of the until command may be implemented, without any 
  argument. That is 
    >> until;
  will read the input stream until the workspace ends with the same text
  as is contained in the current tape element. This form of the 'until' 
  command would be useful for parsing, for example, a sed substitution
  command such as 's@this@that@g' or 's/this/that/g' where the delimiter
  character is whatever character follows the 's' character.


WHILE COMMAND ....

  The 'while' command in the pp language reads the input stream while the
  'peep' buffer is any one of the characters or character sets mentioned in
  the argument.  The command takes 1 parameter and it written in a script:
     >> while 'cdef';

  The command takes one argument. This argument may also
  include character classes as well as literal characters.
  
   >> while [:space:];

  reads the input stream while the peep buffer is a digit.
  The read characters are appended to the 'workspace' buffer.

NEGATION

  The while command also supports the negation operator !
  and this is written as follows
    :> while !'text';
  where text is a sequence of characters and character classes.

WHILENOT COMMAND ....
  
  This is a command that was dreamed up in a rash moment. It
  is yet to be seen if it really is necessary. It reads into
  the workspace characters from the input stream -while- the 
  'peep' buffer is -not- a certain character. This is 
  a "tokenizing" [->wikipedia] command and allows the input stream
  to be parsed up to a certain character without reading that 
  character.

 an example
 -----
  '"' {
    clear;
    whilenot '"';
    print;
  }
 ,,,,
 
 The above script tokenizes a quoted text section and prints
 the contents to the standard output.

 ** See also

 'while', 'until', 'virtual-machine'


ZERO COMMAND ....
         
  The "zero" command sets the internal counter to zero.  This counter may be
  used to keep track of nesting during parsing processes, or used by other
  mundane purposes such as numbering lines or instances of a particular string
  or pattern.

  * 
  >> read; "x" { zero; } 

GRAMMAR AND SCRIPT CONSTRUCTION

  My knowledge of formal grammar theory is quite limited. I am more 
  interested in practical techniques. But there is a reasonably close
  correlation between bnf-type grammar rules and script construction.
  
COMPILATION TECHNIQUES

  One of the enjoyable aspects of this parsing/compiling machine is
  discovering interesting practical "heuristic" techniques for 
  compiling syntactical structures, within the limitation of the 
  machine capabilities.

  This section will detail some techniques.

REPETITIONS ....
 
  The trick below only creates a new list token if the preceding
  token is not a list of the same type.

  * a technique for building a list token from repeated items 
  ----
     # ebnf rules:
     #   alist := a {a}
     #   blist := b {b}

     read;
     # terminal symbols
     "a","b" { add "*"; push; }
     !"" { 
       put; clear; 
       add "incorrect character '"; get; add "'";
       add " at position "; cc; add "\n"; 
       add " only a's and b's allowed. \n"; print; quit;
     }
   parse>
     # 1 token (with extra token)
     pop;
     "a*" { 
       pop; !"a*".!"alist*a*" { push; } 
       clear; add "alist*"; push; .reparse
     }
     "b*" { 
       pop; !"b*".!"blist*b*" { push; } 
       clear; add "blist*"; push; .reparse
     }
     push;
     <eof> {
       unstack; put; clear; add "parse stack is: "; get;
       print; quit; 
     }

  ,,,,

OFFSIDE OR INDENT PARSING ....

  Some languages use indentation to indicate blocks of code, or 
  compound statements. Python is an important example. These languages
  are parsed using "indent" and "outdent" or "dedent" tokens.
  Currently (Aug 2019) the parse machine cannot parse these languages.
  
  A simple extension to the machine would be to add "marks" to tape
  cells as well as a "go" command that jumps to a previously set mark.
  This would allow parsing of offside languages.

OPTIONALITY ....

  The parse machine cannot directly encode the idea of an
  optional "[...]" element in a bnf grammar.

    * a rule with an optional element
    >> r := 'a' ['b'] .

  In some cases we can just factor out the optional into alternation "|"
    >> r := 'a' | 'a' 'b' .

  However once we have more than 2 or 3 optional elements in a rule, this
  becomes impractical, for example
    >> r := ['a'] ['b'] ['c'] ['d'] .
  
  In order to factor out the optionality above we would end up with
  about 4 + (3+2+1) + (2+1) + 1 = 14 rules

  another approach is to encode some state in a parse token.

   * a strategy for parsing optionalities
   ----------
     # parse the ebnf rule
     # rule := ['a'] ['b'] ['c'] ['d'] ';' .
     begin { add "0.rule*"; push; }
     read;
     [:space:] { clear; }
     "a","b","c","d",";" { add "*"; push; .reparse  }
     !"" { add " unrecognised character."; print; quit; }
   parse>
     pop; pop;
     E"rule*a*" { 
       B"0" { clear; add "1.rule*"; push; .reparse }
       clear; add "misplaced 'a' \n"; print; quit; 
     } 
     E"rule*b*" { 
       B"0",B"1" { clear; add "2.rule*"; push; .reparse }
       clear; add "misplaced 'b' \n"; print; quit; 
     } 
     E"rule*c*" { 
       B"0",B"1",B"2" { clear; add "3.rule*"; push; .reparse }
       clear; add "misplaced 'c' \n"; print; quit; 
     }
     E"rule*d*" { 
       B"0",B"1",B"2",B"3" { clear; add "4.rule*"; push; .reparse }
       clear; add "misplaced 'd' \n"; print; quit; 
     }

     E"rule*;*" { clear; add "rule*"; push; }

    push; push;
    (eof) {
      pop; 
      "rule*" { add " its a rule!"; print; quit; }
    } 
   ,,,,

REPETITION PARSING ....

  Similar to the notes above about parsing grammar rules containing
  optional elements, we have a difficulty when parsing elements or 
  tokens which are enclosed in a "repetition" structure. In ebnf 
  syntax this is usually represented with either braces "{...}" or 
  with a kleene star "*". 

  We can use a similar technique to the one above to parse repeated
  elements within a rule.

  The rule parsed below is equivalent to the regular expression
    >> /a?b*c*d?;/
  So the script below acts as a recogniser for the above regular
  expression. I wonder if it would be possible to write a script
  that turns simple regular expressions into parse-scripts?

  In the code below we dont have any separate blist clist tokens

   * parsing repetitions within a grammar rule
   ----------
     # parse the ebnf rule
     # rule := ['a'] {'b'} {'c'} ['d'] ';' .
     # equivalent regular expression: /a?b*c*d?;/

     begin { add "0/rule*"; push; }
     read;
     [:space:] { clear; }
     "a","b","c","d",";" { add "*"; push; .reparse  }
     !"" { add " unrecognised character."; print; quit; }
   parse>
   # ------------
   # 1 token
     pop;

   # ------------
   # 2 tokens 
     pop; 

     E"rule*a*" { 
       B"0" { clear; add "a/rule*"; push; .reparse }
       clear; add "misplaced 'a' \n"; print; quit; 
     } 
     E"rule*b*" { 
       B"0",B"a",B"b" { clear; add "b/rule*"; push; .reparse }
       unstack; add " << parse stack.\n"; 
       add "misplaced 'b' \n"; print; quit;
     } 
     E"rule*c*" { 
       B"0",B"a",B"b",B"c" { clear; add "c/rule*"; push; .reparse }
       clear; add "misplaced 'c' \n"; 
       unstack; add " << parse stack.\n"; print; quit;
     }
     E"rule*d*" { 
       B"0",B"a",B"b",B"c" { clear; add "d/rule*"; push; .reparse }
       clear; add "misplaced 'd' \n"; print; quit; 
     }

     E"rule*;*" { clear; add "rule*"; push; }

    push; push;
    (eof) {
      pop; 
      "rule*" { add " its a rule!"; print; quit; }
      unstack; add " << parse stack.\n"; print; quit;
    } 
   ,,,,

PL ZERO ....

  Pl/0 is a minimalistic language created by Niklaus Wirth, for 
  teaching compiler construction.

  In this section, I will explore converting the pl/0 grammar
  into a form that can be used by the parsing machine and 
  language. I will do this in stages, first parsing "expressions"
  and then "conditions" etc. 
  
  Wirths grammar is designed to be used in a recursive descent
  parser/compiler, so it will be interesting to see if it can be adapted for
  the machine which is essentially a LR shift-reduce parser/compiler.

  The grammar below seems to be adequate for LR parsing, except for 
  the expression tokens (expression, term, factor etc). Apart from
  expression we should be able to factor out the various [] {} and ()
  constructs and create a machine parse script.

  * the pl/0 grammar in wsn/ebnf form
  -------

   program = block "." .
   block = [ "const" ident "=" number {"," ident "=" number} ";"]
        [ "var" ident {"," ident} ";"]
        { "procedure" ident ";" block ";" } statement .

   statement = [ ident ":=" expression | "call" ident 
              | "?" ident | "!" expression 
              | "begin" statement {";" statement } "end" 
              | "if" condition "then" statement 
              | "while" condition "do" statement ].

   condition = "odd" expression |
               expression ("="|"#"|"<"|"<="|">"|">=") expression .
   expression = [ "+"|"-"] term { ("+"|"-") term}.
   term = factor {("*"|"/") factor}.
   factor = ident | number | "(" expression ")".

  ,,,
  
  
PROGRAMS AND BLOCKS IN PLZERO ....

  I will try to keeps wirths grammatical structure, but factor the 
  2 rules and introduce new parse tokens for readability, such
  "constdec*" for a constant declaration, and "vardec* for a variable
  declaration. I will also try to keep Wirth's rule names for reference

  * the program and block grammar rules
  --------
   program = block "." .
   block = [ "const" ident "=" number {"," ident "=" number} ";"]
        [ "var" ident {"," ident} ";"]
        { "procedure" ident ";" block ";" } statement .
  ,,,

  Once this script is written we can check if it is parsing 
  correctly, and then move on to variable declarations, proceedure
  declarations and so forth (but not the forth language, which doesnt
  require a grammar).

  * a script just for pl/0 constant and variable declarations
  ---------

  # a constant declaration rule in Wirths ebnf grammar
  # constdec =  "const" ident "=" number {"," ident "=" number} ";" .
  # the script below factors out the {} repetition construct.

    read;

    [:alpha:] { 
      while [:alpha:]; 

      # keywords in pl/0
      "const","var","if","then","while","do","begin","end" { 
        put; add "*"; push; .reparse
      }
      put; clear; add "ident*"; 
      push; .reparse 
    }

    [0-9] { 
      while [0-9]; put; clear; add "number*"; 
      push; .reparse 
    }
    # literal tokens
    ",","=",";" { add "*"; push; }

  parse>

    # or parse this like varset/vardec
    "ident*=*number*" {
      clear; add "equalityset*"; push; .reparse
    }
   
    "equalityset*,*ident*=*number*" {  
      clear; add "equalityset*"; push; .reparse
    }

    "const*equalityset*;*" {
       clear; add "constdec*"; push; .reparse
    }

    # to actually compile we might have to split this rule
    "var*ident*","varset*,*ident*" {
      clear; add "varset*"; push; .reparse
    }

    "varset*;*" {
       clear; add "vardec*"; push; .reparse
    }
    
    <eof> {
      pop; pop;
      "varset*","vardec*" { 
        add " declaration \n"; print; clear; quit; 
      }
    }
  ,,,

PLZERO EXPRESSIONS ....

  * wsn/ebnf grammar for pl/0 expressions
  ---------
    expression = [ "+"|"-"] term { ("+"|"-") term}.
    term = factor {("*"|"/") factor}.
    factor = ident | number | "(" expression ")".
  ,,,

  The parse-machine cannot directly implement repetitions {} optionals []
  or groupings () so the expression grammar needs to be "factored" so
  that only alternations remain.

  * a factored grammar   
  ---------
    factor = ident | number | "(" expression ")".
    mulfactorset = "*" factor | "/" factor.
    mulfactorset = mulfactorset mulfactorset.
    term = factor | factor mulfactorset.
    addtermset = "+" term | "-" term. 
    addtermset = addtermset addtermset.
    expression = term | term addtermset 
    expression = "+" term | "-" term.
    expression = "+" term addtermset | "-" term addtermset.
  ,,,

  But this grammar has a problem for the parse-machine!! For 
  example "term" will reduce immediately to "expression" without
  looking for extra factors. So this grammar is probably not
  suitable for an LR parser....

  There are a number of extra rules to compensate for the lack
  of ebnf constructs. We also have to split rules that have a 
  different number of parse tokens in each alternation. (eg factor)
  
TRIGGER RULES ....

  ',' orset '{' ::= ',' test '{' ;

NEGATED CLASSES ....

  Often when creating a language or data format, we want to 
  be able to negate operators or tests (so that the test has 
  the opposite effect to what it normally would). In the parse 
  script language I use the prefixed "!" character as the 
  negation operator.

  We can create a whole series of negated "classes" or tokens in
  the following way. So the "negation" logic is actually stored
  on the stack, not on the tape. This is useful because we can
  compile all these negated tokens in a similar way (see the 
  new "quoteset*" implementation.

  * example of creating a set of negated classes
  ---------
    "!*quote*","!*class*","!*begintext*", "!*endtext*" {
      # save the second token (quote/class/begintext) on the 
      # tape (but not the current tape cell which has other stuff in 
      # it
      push; ++; put; --; clear; pop; 
      clear; add "not"; 
      # extract the saved token name from the tape
      ++; ++; get; --; --;
      # now we have a token "notquote*" / "notclass*" / "notbegintext*
      push; get; --; put; ++; clear;
      .reparse
    }
  ,,,

LOOKAHEAD AND REVERSE REDUCTIONS ....

  Quotesets have been replaced in the current (aug 2019) implementation
  of compile.pss with 'ortestset' and 'andtestset'. But the compilation
  techniques are similar to those shown below.

  The current implementation of quotesets in compile.pss seems quite
  interesting. It waits until the stack contains a brace token "{*"
  until it starts reducing the quoteset list.
   * a quoteset
   >> 'a','b','c','d' { nop; }

  so the compile.pss script actually parses "'c','d' { " first, and
  then resolves the other quotes ('a','b'). This is good because 
  the script can work out the jump-target for the forward true jump.
  (the accumulator is used to keep track of the forward true jump).

  It also uses the brace as a lookahead, and then just pushes it
  back on the stack, to be used later when parsing the whole brace block.

  * bnf rules for parsing quotesets
  >> quoteset '{' := quote ',' quote '{' ;
  >> quoteset '{' := quote ',' quoteset '{' ;

  But this has 2 elements on the left-hand side. This works but is 
  not considered good grammar (?)

QUOTESET REFINEMENTS ....
 
  quoteset should be renamed to something else, eg "testset*" because
  it will contain class tests, begin-tests, end-tests etc. Also
  we will parse like this
  
  * some rules for parsing/compiling quotesets/testsets tokens
  --------
   quoteset '{' := quote '{' ;
   quoteset '{' := begintest '{' ;
   quoteset '{' := endtest '{' ;
   .... (more test types) ...
   quoteset '{' := quote ',' quoteset '{' ;
   quoteset '{' := begintest ',' quoteset '{' ;
   quoteset '{' := begintest ',' quoteset '{' ;
   .... (more test types) ...

  ,,,

 We will also be able to get rid of various parse rules that 
 are currently in the script compile.pss

 The consequence of this parsing technique, is that we will be 
 able to include any type of test in a "testset*" sequence ...
 These multiple testset sequences may be used as a poor-womans 
 regular expression pattern matcher.

 * test if the workspace begins with 'http://', ends with '.txt' and
 * doesnt contain the letter 'x'
 -------
   B"http://", E".txt", ![x]  {
     add " (found text file link!) \n"; print; clear;
   }
 ,,,,

 It doesnt really make sense to combine a text-equals test with
 any other test, but the other combinations are useful.f

RABBIT HOPS ....

  The "quoteset" token syntax parses a string such as 
  >> 'a','b','c','d' { nop; }

  The comma is the equivalent of the alternation operator (|) in
  bnf syntax.

  In my first attempt to parse "quoteset" tokens with the "compile.pss"
  script compiler, I used a rabbit hop technique. From an efficiency 
  and compiled code size point-of-view this is a very bad way to 
  compile the code, but it seems that it may be useful in other 
  situations. It provides a way to generate functional code when
  the final jump-target is not know at "shift-reduce" time.

  * create rabbit hops for the true jump
  ------
  "quote*,*quote*" {
     clear; add "testis "; get;
     # just jump over the next test
     add "\njumptrue 3 \n"; ++; ++;
     add "testis ";
     get;
     add "\n";
     # add the next jumptrue when the next quote is found
     --; --; put;
     clear;
     add "quoteset*";
     push;
     # always reparse/compile
     .reparse
   }
     
   # quoteset ::= quoteset , quote ;
   "quoteset*,*quote*" {
      clear; get;
      ++; ++;
      add "jumptrue 4 \n ";
      add "jumptrue 3 \n ";
      add "testis "; get;
      add "\n";
      # add the next jumptrue when/if the next quote is found
      --; --;
      put; clear;
      add "quoteset*";
      push;
      # always reparse/compile
      .reparse
    }

  ,,,,

YACC AND LEX COMPARISON

  The tools "yacc" and "lex" and the very numerous clones, rewrites
  and implementations of those tools are very popular in the implementation
  of parsers and compilers. This section discusses some of the 
  important differences between the parse-machine and language and 
  those tools.

  The pp language and machine is (deliberately) a much more limited
  system than a lex/yacc combination. a lex/yacc-type system often 
  produces c-code or some other language code which is then compiled 
  or run to implement the parser/compiler. 

  The pp system, on the other hand, is a "text-filter"; it simply 
  transforms one text format into another. For this reason, it cannot
  perform the complex programmatic "actions" that a yacc-tool can
  achieve.


ADVANTAGES COMPARED TO LEX YACC SYSTEM ....

  While clearly more limited than a lex-yacc style system, in my opinion
  the current machine has some advantages:

    * It is much simpler and there should be much easier to understand
    * It does not make use of shift-reduce tables.
    * It should be possible to implement it on much smaller systems.
    * Because it is a text-filter, it should be more accessible for
      "playing around" or experimentation. Perhaps it lacks the 
      psychological barrier that a lex-yacc system has for a 
      non-specialist programmer.
    * It may be simpler and faster to port the system to new platforms/
      languages. All that is required is a parse-machine "object" in the
      target language and a script similar to "compilable.c.pss"

NAMING OF SYSTEM

  I dont have a good name for this virtual machine and script language
  yet. The executable is called 'pp' for pattern-parse, and the source
  file is called gh.c for no particular reason. The source files are 
  split into .c files where each one corresponds to a particular 
  "object" (data structure) within the machine (eg tapecell, tape,
  buffer (with stack and workspace).

  The name should be short, easy to type, and hopefully positive and
  fun sounding...

HISTORY OF IDEA

  The file object/gh.c contains detailed information about the 
  development of this idea.

EVOLUTION OF THE MACHINE AND LANGUAGE

  * I initially called this the "chomski" machine and language but
    decided to change it to something more "unixy".
  * The a+ and a- commands were initially called "plus" and "minus"
  * The "ll" and "cc" (line number and character number) registers
    and commands are recent, but very important additions, because they
    allow script error messages to pinpoint the line and character number 
    of the error. 


IMPORTANT FILES 

EXAMPLE SCRIPTS ....

  The folder "eg/" contains a limited collection of example scripts
  to show the capabilities of the language and machine. 

COMPILE PSS ....

  This is the main script compiler. It has replaced the handcoded
  asm.pp file because it is easier to write and maintain.

ASM PP ....

  This file is important because it actually implements the parsing vm
  scripting language. It is a text file which consists of a series of 
  "instructions" or commands for the parsing virtual machine. These 
  instructions include instructions which alter the registers of the 
  virtual machine; tests, which set the flag register of the machine
  to true if the test returns true, or else false; and conditional and
  unconditional jumps which change the instruction pointer for the 
  machine if the flag register is true.
  
  Asm.pp also contains labels (lines ending in :). These labels make
  it much easier to write code containing jumps (a label can be used 
  instead of an instruction number. 

  Because of the similarity of this format to many "assembly" languages
  I refer to this as assembly language for the parse virtual machine.

LEXICAL PHASE ....

  Like most systems which are designed to parse and compile 
  context-free languages Asm.pp has 2 destinct phases: A "lexing" phase
  and a parse/compile phase. This is shown by the separation of 
  the unix tools "lex" and "yacc" where lex performs the lexical 
  analysis (which consists of the recognition and creation of lexical
  tokens) and yacc performs parsing and compilation of tokens.
   
  In the asm.pp program, as with many scripts that can be written
  in the parse language, the lexical and compilation phases are 
  combined into the same script. 

  In the first phase, the program performs lexical analysis of the input
  (which is an uncompiled script in the parse script language), and converts
  certain patterns into "tokens".  In this system, a "token" is just a string
  (text) terminated in an asterix (*) character. Asm.pp constructs these
  tokens by using the "add" machine instruction, which appends text to the
  workspace buffer.

  Next the parse token is "pushed" onto the "stack". I have quoted these
  words because the stack buffer is implemented as a string (char *) 
  buffer and "pushing" and "popping" the stack really just involves
  moving the workspace pointer back and forth between asterix characters.

  I used this implementation because I thought that it would be 
  fast and simple to implement in c. It also means that we dont have to
  worry about how much memory is allocated for the stack buffer or
  each of its items. As long as there is enough memory allocated for
  the workspace buffer (which is actually just the end of the stack buffer)
  there will be enough room for the stack.

  The lexical phase of asm.pp also involves preserving the "attribute"
  of the parse token. For example if we have some text such as
  "hannah" then our parse token may be "quoted.text*"
  and the attribute is the actual text between the quotes 'hannah'.
  The attribute is preserved on the machine tape data structure, which 
  is array of string cells (with no fixed size), in which data can
  be inserted at any point by using the tape pointer.

PARSING PHASE ....
  
  The parsing phase of the asm.pp compiler involves recognising and
  shift-reducing the token sequences that are on the machine "stack".
  These tokens are just strings post-delimited with the '*' character.
  Because the tokens are text, they popped onto the workspace buffer
  and then manipulated using the workspace text commands.
 
TEST.COMMANDS.PSS ....

  The file test.commands.pss has a set of examples of correct and 
  incorrect syntax as determined by the asm.pp compiler. It also 
  contains comments about the use of the syntax elements

VIM AND PP

  Since I usually edit with vim (since I dont have a copy of 
  "sam" or "acme"), here are some techniques for using vim with the 
  pp tool. The vim mappings and commands below are useful for 
  checking that pp "one-liners" and pp scripts and script fragments
  that are within a text document, actually compile and run.
  Perhaps this is clumsy way of achieving "literate programming"
  (Knuth).

  The mapping below can only run the script with a static input "abc"
  which is not very useful, but at least tests if the script compiles
  properly. The compiled script will be saved in "sav.pp"

  The multiline snippet is embedded in "---" and ",,," tags which are both
  on an empty line.

  * create a vim mapping to run a script embedded in a text document
  >> map ,pp :?^ *---?+1,/^ *,,,/-1w! test.pss \| !pp -f test.pss -i "abc"<cr>

  * create a vim mapping to execute the current line as a pp "one-liner"
  >> map ,pl :.w !sed 's/^ *>>//' \| bash

  * create a vim command to execute the current line as a pp "one-liner"
  >> command! Ppl .w !sed 's/^ *>>//' | bash

  The mappings and commands are for putting in the vimrc file. To 
  create them within the editor pre-pend a ":" to each mapping etc.
  >> :command! Ppl .w !sed 's/^ *>>//' | bash

  * create a vim command to run a script embedded in a text document
  * with input provided as an argument to the vim command
  >> com! -nargs=1 Pp ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pp -f test.pss -i "<args>"

  This is useful since scripts always need some input.
  
  * the same as above but interactive. NOT working!! 
  >> com! -nargs=1 Ppi ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pp -If test.pss -i "<args>"


STATUS

  25 august 2019

    great progress has been made. compile.pss has all sorts of 
    nice new syntax like negated text= tests !"abc"{ ... }
    Almost all tests can be negated. There is now a AND concatentation
    operator (.).

    begin blocks, begintests in quotesets (in progress). compile.pss
    has replaced asm.handcode.pp for compiling scripts.

HISTORY

  see object/gh.c for detailed development history

2019

  For a number of years I have been working on a project to write
  a virtual machine for pattern parsing. The source code is located
  at ~/sf/htdocs/books/pars/gh.c  or https://bumble.sf.net/books/pars/gh.c
  This virtual machine can (and is) used to implement a script language
  for parsing and compiling some context-free languages. (The implementation
  is in 'asm.pp')

  The project is now at a stage where useful scripts can be written in
  the parse-script-language.

  The purpose of the virtual machine is to be able to parse and transform
  patterns which cannot normally be dealt with through "regular expressions".
  I.e patterns which are not "regular languages". Possibly the simplest
  example of one of these would be palindromes (eg "aba", "hannah", "anna").
  The machine also allows a script language to describe patterns and
  transformations, and this language has similarities to sed and to awk. 

  In fact the whole idea was inspired by sed and its limitations for 
  context free languages.

DESIGN PHILOSOPHY FOR THE MACHINE

  When designing the parse machine, I wanted to make it capabilities
  as limited as possible, while still being able to properly parse
  and translate context-free languages. Related to this idea, was the
  aim to make the machine implementable in the smallest possible way.

  Also, I deliberately excluded the use of regular expressions, so
  that the script writer would not be tempted to try to "parse" 
  context-free patterns with them.

  The general design of the syntax and command-line usage is inspired
  by various unix tools, such as sed/grep/awk/perl.

IDEAS 

 Palindromes are also interesting because they can be parsed with the 
 simplest possible recursive descent parser. 
   
  * pseudo code
  ---------
    parsePalindrome (text) {
      n = text.firstCharacter, m = text.lastCharacter
      if n <> m { return false }
      newText = text - first and last characters
      parsePalindrome(newText)
      return true
    }
  ,,,

 But the "pp" machine does not use recursive descent parsing. In fact
 the gh machine was written because recursive descent parsing seemed
 aesthetically unpleasing. 

 The machine is still not finished, but it is getting there.
 The actual virtual machine has been written in c along with an
 interpreting loop that allows the machine to be inspected and 
 tested as well as programs loaded and saved.

 The machine uses a type of "assembly language" for saving machine
 instructions into a text file. 

 There are more comprehensive comments in the c source code file gh.c

DOCUMENT HISTORY

 4 September 2019

   Adding some ideas about parsing optional elements.

 23 August 2019

   trying to organise this document.