&& The Parsing Virtual Machine and script Langauge OVERVIEW This booklet is about the pattern-parsing virtual machine and script language. The executable is currently called 'pp' (August 2019) (in the folder /books/pars ) The virtual machine and language allows simple "LR" bottom-up shift reduce parsers to be implemented in a limited script language with a syntax which is very similar to the "sed" unix stream editor. As far as I am aware, this is a new approach to parsing context-free languages and according to the scripts and tests so far written has great potential. The script language is deliberately limited in a number of ways (it does not have regular expressions, for example), but it seems to be an interesting tool for learning about compiler techniques. DOWNLOAD A ".tar.gz" archive file of the c source code can be downloaded from the https://sourceforge.net/projects/bumble/ folder. COMPILING SOURCE CODE At some point I will package this using autotools and automake to allow simple installation. The main folder, source file and booklet are called pars/ gh.c and pars-book.txt * obtain, extract and compile the "pp" source code ------- # extract the tar ball tar xvzf gh.tar.gz cd pars/object # compile the pp tool using the object files in the "object" folder. gcc -o ../pp \ gh.c tapecell.c tape.c buffer.c colours.c \ charclass.c command.c parameter.c instruction.c \ labeltable.c program.c machine.c exitcode.c \ machine.interp.c cd .. # create a symlink if you like into the bin folder. ln -s pp /usr/local/bin/pp ,,, At the moment (Aug 2019) the pp executable looks for the file "asm.pp" in the same folder where it is run. This obviously needs to be changed so that pp looks for asm.pp in some standard folder. asm.pp is necessary to compile and run scripts. * rebuild the libmachine.a library file ----- cd gh/object # rebuild all object files gcc -c *.c # not all of these files are really necessary. ar rcs libmachine.a buffer.o charclass.o colours.o command.o \ exitcode.o instruction.o labeltable.o machine.o machine.interp.o \ machine.methods.o parameter.o program.o tape.o tapecell.o ,,,, The libmachine.a file can be used when compiling executables from c source code generated by the script "compilable.c.pss" DOCUMENTATION This file (pars-book.txt) will become the principle documention about the pattern-parser machine and language. There is also a lot of documentation in the "gh.c" file (which implements the virtual machine) and some in the "asm.pp" assembler file, which converts scripts into assembler listings which can then be loaded by the machine into its program. COMMAND LINE USAGE I will create 2 separate tools for interactively running, debugging and viewing scripts, and another for interpreting scripts. but at the moment (august 2019) there is only one tool called "pp" (source code gh.c) * specify script and input on the command line >> pp -e "read; print; print; clear;" -i "abcXYZ" (This prints "aabbccXXYYZZ") * load a script from file and start an interactive session to view/debug >> pp -If scriptfile somefile.txt * load an "assembly" file into the machines program and view/debug. >> pp -Ia asmfile somefile.pss * convert a script to compilable c code >> pp -f compile.ccode.pss script ONE LINE EXAMPLES * double space a text file >> pp -e "r;'\n'{a'\n';}t;d;" someFile * the same as above with long command names. >> pp -e "read; '\n' { add '\n'; } print; clear;" someFile ONE LINE SCRIPTS The pp tool may have useful applications in unix "one-liners" but its main power is in the implementation of simple context free languages and compilers. * convert all spaces to dots >> pp -e "read; [:space:] {d;a'.';t;d;}t;d; " somefile.txt >> pp -e "read; [:space:] {clear; add '.';print;clear;}print;clear; " \ somefile.txt * double every instance of vowels >> read; [aeiou] { put; get; } print;clear; * Only print text within double quotes: >> read; '"' { until '"'; print; } clear; * removes multiple consecutive instances of the character "a": >> read; "a" { while "a"; clear; } print;clear; * Number each line of the input: >> read; "\n" { ll; add " "; } print; clear; >> read; "\n" { ll; a " "; } t; d; # the same, with short commands * Count the number of lines in the input stream >> read; "\n" { a+; } clear; (eof) { add "lines: "; count; print; } * Delete leading whitespace (spaces, tabs) from end of each line: >> read; "\n" { while '- '; clear; newline; } print; * Delete trailing whitespace (spaces, tabs) from end of each line: >> r; [:space:]{ w[:space:];G;d;r;!"\n"{clear;get;} print; clear; * insert 5 blank spaces at beginning of each line (make page offset): >> r; "\n" { add ' '; } print; clear; * print the first ten lines of the input: >> r; "\n" { a+; clear;count; "10"{ quit;} clear;add "\n"; } print; clear; * Strip out c comments from a file: >> r; "/" { r; "/*"{ until '*/'; d; } "//" { until '\n'; d; } print;clear; * Delete trailing whitespace (spaces, tabs) from end of each line: ---------- # not tested, read; [:space:] { while [:space:]; put; clear; read; !"\n" { clear; get; add "\n"; } print; clear; } ,,, * insert 5 blank spaces at beginning of each line (make page offset): >> "\n" { add " "; } print; clear; * print the first ten lines of the input: >> read; "\n" { a+; d; count; "10" {quit;} d; add "\n"; } print; d; * Strip out c comments from a file: >> read; "/" { read; "/*"{ until '*/'; clear; }} print;clear; - ASSEMBLY FILES The implementation of the language uses an intermediary "assembly" phase when loading scripts. asm.pp is responsible for converting the script into an assembly (text) format. asm.pp is itself an assembly file. These files consist of "instructions" on the virtual machine, along with "parameters", jumps, tests and labels, which make writing assembly files much easier (instead of having to use line numbers). So the asm.pp file actually implements the script language The proof is in the pudding: this shows that the parse-machine is capable of implementing "languages" (or at least simple ones). It should not normally be necessary to write any assembler code, since the script language is much more readable. However, it is useful for debugging scripts to view the assembly listing as it is loaded into the machine (the -I switch allows this). PARSING WITH THE SCRIPT LANGUAGE A simple example: * parse a sentence in a very simple natural language -------------- "article*noun*" { clear; add "nounphrase*"; push; } "nounphrase*verb*object*" { clear; add "sentence*"; push; } ,,, The example above corresponds to a backus-naur grammar rules: nounphrase <- article noun sentence <- noun verb object If we only want to write a "recogniser" for a particular context-free language (that is: a script that determines if a particular input is a legal element of that language) then the previous script style is sufficient. But usual, we will be trying to transform an input into another format. For example, translate one computer language into another, TRANSFORMATIONS AND COMPILATIONS The parsing virtual machine and language is designed to transform one textual (data) format into another, or compile one context-free language into another. Examples might be: * transform a csv (comma-separated-values) file into a json data format. * convert from "infix" arithmetic notation to "postfix" notation. * compile a simple computer language into an assembly language. * properly indent a computer language source code file. DEBUGGING The parse virtual machine is more complicated than a sed virtual machine (sed only really has 2 string registers, the "work-space" and the "hold-space"). So you may find yourself needing to debug a script. * load a script and view/execute/step through it interactively >> pp -If someScript input.txt * interactively view how some script is being compiled by "asm.pp" >> pp -Ia asm.pp someScript (now you can step through the compiled program "asm.pp" and watch as it parses and compiles "someScript". Generally, use "rr" to run the whole script, and "rrw text" to run the script until the workspace is some particular text. This helps to narrow down where the asm.pp compiler is not parsing the input script correctly. Once in an interactive session, there are many commands to run and debug a script. For example: n - execute the next instruction in the program m - view the state of the machine (stack/workspace/registers/tape/program) rrw - run the script until the workspace is exactly some text. rre - run script until the workspace ends with something rr - run the whole script from the current instruction M.r - reset the virtual machine and input stream (but not the compiled program) COMMON BUGS .... * make sure that you are pushing as many times as there are tokens. >> add "noun*verb*noun*"; push; push; # << error, 3 tokens, 2 pushes * make sure there is at least one read command in the script >> "."{ clear; } print; clear; # << error: no read in script SCRIPT EXAMPLES * print alphanumeric words in file one per line ---------------------- read; [:alpha:] { while [:alpha:]; add "\n"; print; clear; a+; } clear; ,,, * print the number of alphabetical words in a file --------------------- read; [:alpha:] { while [:alpha:]; clear; a+; } clear; (eof) { add "words in file: "; count; add "\n"; print; clear; } ,,, * convert c comments (// and /* ... */) to bash comments ----------- read; "/" { read; "//" { clear; add "# "; until "\n"; print; clear; } "/*" { clear; add "# "; until "*/"; clip; clip; replace "\n" "\n#"; print; clear; } } print; clear; ,,, MACHINE DESCRIPTION The parsing virtual machine consists of a number of parts which I will describe here. WORKSPACE BUFFER .... The workspace buffer is the heart of the virtual machine and is analogous to an "accumulator" register in a cpu chip. All incoming and outgoing data is processed through the workspace. All machine instructions either affect or are affected by the state of the workspace. STACK .... * machine instructions relating to the stack >> push, pop The stack (which is implemented as a single string buffer and is actually the prefix of the workspace buffer) is used to store and access the parse tokens that are constructed by the virtual machine while parsing input. ** parse tokens Within the virtual machine of the parse script language the "stack" structure is designed to hold and contain the "parse tokens" parse tokens can be any string ended by the delimiter eg: >> add "set*"; DELIMITER REGISTER .... The delimiter register determines what character will be used for delimiting parse tokens on the stack when using the "push" and "pop" commands. This can be set with the "delim" command. * set the parse token delimiter to '/' >> begin { delim '/'; } The delimiter should be set in a 'begin' block. I normally use the "*" character, but "/" or ";" might be good options. FLAG REGISTER .... * machine instructions relating to the flag register >> testis, testbegins, testends, testtape, testeof >> jumptrue, jumpfalse The flag register affects the operation of the conditional jump instructions "jumpfalse" and "jumptrue" and it is affected by the test instructions such as "testis", "testtape", "testeof" etc. It is analogous to a "flags" register in a cpu, but (currently) it only contains one boolean (true/false) value. ACCUMULATOR REGISTER .... The machine contains 1 integer accumulator register that can be incremented (with "a+"), decremented (with "a-") and set to zero (with "zero"). This register is useful for counting occurances of miscelaneous elements that occur during parsing and translating. An example of the use of the accumulator register is given during the parsing of "quotesets" in the the "compile.pss" script. The accumulator in this case, keeps track of the target for true-jumps. * count how many "x" characters occur in the input stream >> pp -e "r; 'x' {a+;} print;d; (eof) { add '='; count; print; }" -i "axax" PEEP REGISTER .... The peep buffer is a single character buffer which stores the next character in the input stream. When a 'read' command is performed the current value of the peep buffer is appended to the 'workspace' buffer and the next character from the input stream is placed into the peep buffer. TAPE .... * machine instructions with an effect on the tape: >> get, put, ++, --, pop, push The tape is an array of string cells (with memory allocated dynamically), each of which can be read or written, using the workspace buffer. The tape cell array also includes a tape cell pointer, which is usually called the "current cell". CURRENT CELL OF TAPE ARRAY .... * instructions affected the current cell of the tape >> ++, --, push, pop The current cell of the tape is a very important mechanism for storing attributes of parse tokens, and then manipulating those attributes. The parse virtual machine has the ability to "compile" one text format into another. (I call this compilation because because the target text format may be an assembly language - for either a real or virtual machine) In the parsing phase of a script, the attributes of different parse tokens are accessed from the tape and combined and manipulated in the workspace buffer. This means that a script which is transforming some input stream will finish with the entire transformed input in the 1st cell of the tape structure. The script can then print out the cell to stdout (with "get; print;", which allows further processing by other tools in the pipe chain) or else write the contents of the cell to the file 'sav.pp' (with "get; write;"). The current cell is also affected by the stack machine commands "pop" and "push". The push command increments the tape pointer (current cell) by 1 and the pop command decrements the tape pointer by one. This is a subtle and powerful mechanism that allows the tape pointer to stay in sync with the stack of the stack. After a "push" or "pop" command the tape pointer will be pointing at the correct tape cell for that item on the stack. In some cases, it may be easiest to see how these mechanisms work by running the machine engine in interactive mode (pp -If script input) and stepping through a script or executing commands at the prompt. The tape pointer is also incremented and decremented by the ++ and -- commands, and these commands are mainly used during the compilation phase to access and combine attributes to transform the input into the desired output. SYNTAX OF THE SCRIPT LANGUAGE The script language, which is implemented in the file books/pars/asm.pp looks somewhat similar to sed and awk. Unlike sed, it also allows long names for commands (eg "clear" instead of "d", "add" instead of "a"). Each command has a long and a short form. LANGUAGE FEATURES .... The script language is implemented in the file asm.pp. Some commands, such as "reparse" and "restart" (to be implemented) are not instructions on the virtual machine, they are commands that are parsed in asm.pp. * tests on the workspace buffer, followed by a block of commands >> [a-z] { print; clear; } Most scripts will start with "read;" or "r;" (which is the abbreviated equivalent. This reads one character from the input stream. Whereas sed and awk are line oriented, pp is character orientated. Scripts also have an implicit loop. When the interpreter reaches the end of the script, it jumps back to the first command (usually "read") and continues looping until the input stream is finished. This is similar to the behavior of sed. CHARACTER CLASSES .... character classes are written [:space:] [:alnum:] etc Currently (August 2019) the pp machine and language uses plain ctype.h character classes as a way of grouping characters. These classes are important in a unicode setting because they allow specifying types of characters in a locale-neutral way. The "while" command uses character classes as do tests. I may allow the script language to define new character classes. * check if the workspace is only alphanumeric characters >> [:alnum:] { nop; } * read the input stream while the peep register is whitespace >> while [:space:]; ** Not regular expressions QUOTES .... both single and double quotes may be used in scripts. >> '"' { add "<< single quote!\n"; print; } If using the negation operator ! in an "in-line" script, then enclose the whole thing in single quotes. * use single quotes in one-liners to avoid special char problems >> pp -f compile.pss -i '![a-z],"a","b"{nop;}' The until command already understands escaped characters, so it will read past an escaped delimiter (eg \"). COMMENTS .... Both single line comments (line starts with "#") or multiline comments (hash+asterix .... asterix+hash) are available. Multiline comments are useful for disabling blocks of code during script developement. BEGIN BLOCKS .... Like awk, the script language can have a 'begin' block. These blocks are only executed once, where as the rest of the script is executed in a loop for every character in the input stream. * an example begin block and script ----- begin { add "starting script ...\n"; print; clear; # set the token delimiter to / delim "/"; } read; print; clear; ,,, CONDITIONS AND TESTS .... CLASS TESTS .... The class text checks whether the @workspace buffer matches any one of the characters or character classes listing between the square braces. A class test is written >> [characters] { ... } For example, the following script will delete all vowels from the input text >> [aeiou] { clear; } ![aeiou] { print; clear; } The above script also demonstrates that the class tests can be -negated- with a prefixed "!" character. If the workspace is exactly any one of the characters contained within the square braces then the test returns true and the instructions contained within the curly braces are executed and if not, then not. EOF END OF STREAM TEST .... * formats possible >> (eof) (EOF) This test returns true if the 'peep' look-ahead register currently contains the end of stream marker for the input stream. The test is written >> (eof) { ... } This test can be combined with other tests either with AND logic (.) or with OR logic * test if workspace is "horse" when the end-of-stream is reached >> (eof)."horse" { add ' and cart'; print; } * test if workspace is "horse" OR end-of-file has been reached >> (eof),"horse" { ... } The test is important for checking if the script has successfully parsed the input stream when the end of stream is reached. TAPE TEST .... >> (==) { ...} BEGINS WITH TEST .... workspace begins with: B"abc" { ... } workspace ends with: E"abc" { ... } all chars in workspace in class: [:space:] { ... } ** Concatenating tests Conditional tests can be chained together with OR (,) or AND (.) * test if workspace starts with "ba" OR "ab" >> B"ab",B"ba" { ... } We can also do AND logic concatenation with * test if the workspace starts with 'a' AND ends with 'z' >> B"a".E"z" { } * check if workspace is a url but not a text file >> B"http://" . !E".txt" { add "<< url"; print; clear; } ** Negated tests !B"abc", !E"xyz", ![a-z], !"abc" ... MATERIAL FROM THE "PARSE-BOOK.TXT" A Book about the parse system. This document was written for an earlier version of the parse virtual machine (as of august 2019). COMMANDS * all commands have a single letter variants for the commands. eg: p, pop, P, push, etc COMMAND SUMMARY .... All commands have an abbreviated (one letter) form as well. * ++ increments the tape pointer by one (see 'increment' ) * --: decrements the tape pointer by one. (see "decrement;") * mark "text" adds a text tag to the current tape cell * go "text" sets the current tape cell to the marked cell * add "text" / add 'text' adds text to the end of the workspace * check jumps back to the @@@ label. This is used to ensure that all shift reductions take place. * clip removes the last character from the workspace * count appends the counter to the end of the workspace * crash exits the script without reading anything more from standard input. equivalant to the sed command 'q' * clear sets the workspace to a zero length string. Equivalent to the sed command >> s/^.*$//; * put puts the contents of the workspace into the current item of the tape (as indicated by the @tape-pointer ) * get gets the current item of the tape and adds it to the -end- of the workspace with -no- separator character * swap swaps the contents of the current tape cell and the workspace buffer. * a+ increments the accumulatorr variable/register in the virtual machine by 1. * a- this command decrements a counter variable by one * ll appends the line number register to the workspace buffer. * cc appends the character number register to the workspace buffer * print prints the contents of the workspace buffer to the standard output stream (stdout). Equivalent to the sed command 'p' * pop pops one token from the stack and adds it to the -beginning- of the stack * push : pushes one token from the workspace onto the stack, or reads upto the first star "*" character in the @workspace buffer and pushes that section of the buffer onto the stack. * unstack pops the entire stack as a prefix onto the workspace buffer * stack pushes the entire workspace (regardless of any token delimiters in the workspace) onto the stack. * replace replaces a string in the workspace with another string. * read reads one more character from the stdin. * state prints the current state of the virtual machine to the standard output stream. maybe useful for debugging I may remove this command. * until 'text' reads characters from stdin until the workspace ends in the given text and does not end in the second given text. This maybe used for capturing quoted strings etc. * while class/text; reads characters from the input stream while the @peep character -is- the character 'c' (or a character class) * whilenot class/text; reads characters from the input stream while the 'peep' register is not the character 'c' * zero; sets the counter to zero ADD .... The "add" command adds a piece of text to the -end- of the 'workspace' buffer No other register or buffer within the virtual machine is affected. The command is written >> add 'text'; or >> add "text"; The argument may span more than one line. So the following is also a valid script statement. add 'one line another line'; * Add a space after every character of the input >> add ' '; print; clear; >> a ' ';p;d; /style*type*/ { clear; add "command*"; push; } The script above does a shift-reduce operation while parsing some hypothetical language. The "add" command is used to add a new token name to the 'workspace' buffer which is then pushed onto the stack (using the 'push' operation, naturally). In the above script the text added can be seen to be a token name (as opposed to some other arbitrary piece of text) because it ends in the "*" character. Actually any character could be used to delimit the token names, put "*" is the default implementation. The add command takes -one- and only one parameter, or argument.U. REPARSE COMMAND .... This command makes the interpreter jump to the parse> label. The label must occur before the check command, not after. The .reparse command takes no arguments and is written parse> ** Why use this command? The reason for using this command lies in the need to apply multiple "shift reduce" operations during one cycle of a script. fk CLIP COMMAND .... This command removes one character from the end of the 'workspace' buffer and sends it into the void. It deletes it. The character is removed from the -end- of the workspace, and so, it represents the last character which would have been added to the workspace from a previous 'read' operation. The command is useful for example, when parsing quoted text, used in conjunction with the 'until' command. As in The following script only prints text which is contained within double quote characters, from the input. * parse and print quoted text >> '"' { clear; until '"'; clip; print; } >> "\"" { clear; until "\""; clip; print; } If the above script receives the text 'this "big" and "small" things' as its input, then the output will be 'big small' .That is, only that which is within double quotes will be printed. The script above can be translated into plain english as follows a- If the workspace is a " character then - clear the workspace - read the input stream until the workspace -ends- in " - remove the last character from the workspace (the ") - print the workspace to the console - end if - The script should print the contents of the quoted text without the quote characters because the 'clear' and the clip commands got rid of them. CLOP COMMAND .... The "clop" command removes one character from the front of the workspace buffer. The clop command is the counterpart of the 'clip' command. >> clop; COUNT COMMAND .... The count command adds the value of the counter variable to the end of the workspace buffer . For example, if the counter variable is 12 and the workspace contains the text 'line:', then after the count command the workspace will contain the text >> line:12 The count command only affects the 'workspace' buffer in the virtual machine. CRASH COMMAND .... This command immediately exits out of the script without processing any more script commands or input stream characters. This command is similar to the "sed" command 'q' DECREMENT COMMAND .... The decrement command reduces the tape pointer by one and thereby moves the current 'tape' element one to the 'left' The decrement command is written "--" or "<" If the current tape element is the first element of the tape then the tape pointer is not changed and the command has no effect. ** See also 'increment' , 'put' , 'get' INCREMENT COMMAND .... This command is simple but important. The command adds -one- to the 'tape'-pointer. The command is written >> ++ or > Notice that the only part of the machine state which has changed is that the 'tape-pointer' (the arrow in the 'tape' structure) has been incremented by one cell. This command allows the tape to be accessed as a 'tape' . This is tortological but true. Being able to increment and 'decrement' the tape pointer allows the script writer and the virtual machine to access any value on the tape using the 'get' and 'put' commands. It should be remembered that the 'pop' command also automatically increments the tape pointer, in order to keep the tape pointer and the stack in synchronization. Since the get command is the only way to retrieve values from the tape the ++; and --; commands are necessary. the tape cannot be accessed using some kind of array index because the get and 'put' commands to not have any arguments. An example, the following script will print the string associated with the "value" token. >> pop;pop; "field*value*" { ++; get; --; print; } MARK COMMAND .... The mark command adds a text "tag" to the current tapecell. This allows the tapecell to be accessed later in the script with the "go" command. The mark and go commands should allow "offside" or indent parsing (such as for the python language) * use mark and go to use the 1st tape cell as a buffer. ------ begin { mark "topcell"; ++; } read; [:space:] { d; } whilenot [:space:]; put; # create a list of urls in the 1st tapecell B"http:" { mark "here"; go "topcell"; add " "; get; put; go "here"; } clear; add "word*"; push; { go "topcell"; get; print; quit; } ,,,, GO COMMAND .... The "go" command set the current tape cell pointer to a cell which has previously been marked with a "mark" command (using a text tag. The go/mark commands may be useful for offside parsing as well as for assembling, for example, a table of contents for a document, while parsing the document structure. * basic usage >> read; mark "a"; ++; go "a"; put; clear; MINUS COUNTER COMMAND .... The minus command decreases the counter variable by one. This command takes no arguments and is written >> a-; The 'testeof' (eof) checks to see if the peep buffer contains the end of input stream marker. The 'while' command reads from the input stream while the peep buffer is or is not some set of characters For example >> while 'abc'; reads the input stream while the peep buffer is any one of the characters 'abc'. PLUS COUNTER COMMAND .... This command increments the machine counter variable by one. It is written >> a+ The plus command takes no arguments. Its counter part is the 'minus' command. POP COMMAND .... The pop command pops the last item from the virtual machine stack and places its contents, without modification, at the -beginning- of the workspace buffer. * pop the stack into the workspace. >> pop; * check if a grammar rule is applicable >> pop;pop; "word*colon*" { clear; add 'command*'; push; } The pop command affects the 'workspace' buffer and the 'stack' structure and decrements the 'tape-pointer' variable. The pop command is usually employed in the parsing phase of the script (not the lexing phase); that is, after the "parse>" label. If the stream parser language is being used to parse and translate a language then the script writer needs to ensure that each token ends with a delimiter character (by default "*") in order for the push and pop commands to work correctly. The pop command also affects the "tape" of the virtual machine in that the tape-pointer is automatically decremented by one once for each pop command. This is convenient because it means that the tape pointer will be pointing to the corresponding element for the token. In other words in the context of parsing and compiling a "formal language" the tape will be pointing to the "value" or "attribute" for the the token which is currently in the workspace. PRINT COMMAND .... The print command prints the contents of the workspace to the standard output stream (the console, normally). This is analogous to the the sed 'p' command (but its abbreviated form is 't' not 'p' because 'p' means "pop". ** Examples * Print each character in the input stream twice: >> print; print; clear; * Replace all 'a' chars with 'A's; >> "a" { clear; add "A"; } print; clear; * Only print text within double quotes: >> '"' { until '"'; print; } clear; ** Details The print command does not take any arguments or parameters. The print command is basically the way in which the parse-language communicates with the outside world and the way in which it generates an output stream. The print command does not change the state of the parse virtual machine in any way. Unlike sed, the parse-language does not do any "default" printing. That is, if the print command is not explicitly specified the script will not print anything and will silently exit as if it had no purpose. This should be compared with the default behavior of sed which will print each line of the input stream if the script writer does not specify otherwise (using the -n switch). Actually the print command is not to be so extensively used as in sed. This is because if an input stream is successfully parsed by a given script then only -one- print statement will be necessary, at the end of the input stream. However the print command could be used to output progress or error messages or other things. GET COMMAND .... This command obtains the value in the current 'tape' cell and adds it (appends it) to the end of the 'workspace' buffer. The "get" command only affects the 'workspace' buffer of the virtual machine. * if the workspace has the "noun" token, get its value and print it. >> "noun*" { clear; get; print; clear; } PUT COMMAND .... The put command places the entire contents of the workspace into the current tape cell, overwriting any previous value which that cell might have had. The command is written >> put; * put the text "one" into the current tape cell and the next one. >> clear; add "one"; put; ++; put; The put command only affects the current tape cell of the virtual machine. After a put command the workspace buffer is -unchanged- . This contrasts with the machine stack 'push' command which pushes a certain amount of text (one token) from the workspace onto the stack and deletes the same amount of text from the workspace buffer. The put command is the inverse of the 'get' command which retrieves or gets the contents of the current item of the tape data structure. Since the tape is generally designed for storing the values or the attributes of parse tokens, the put command is essentiallly designed to store values of attributes. The put command can be used in conjunction with the 'increment' ++ and 'decrement' -- commands to store values in the tape which are not the current tape item. READ COMMAND .... The read command reads one character from the input stream and places that character in the 'peep' buffer. The character which -was- in the peep buffer is added to the -end- of the 'workspace' buffer. The read command is the fundamental mechanism by which the input stream is "tokenized" also known as "lexical analysis". The commands which also perform tokenization are "until", "while" and "whilenot". These commands perform implicit read operations. An implicit read operation may also be performed at the beginning of each "cycle" of the script interpreter. This depends apon the implementation and is an issue which I havent resolved yet. If an implicit read is not performed then each script will need to have an explicit read command at the beginning of the script. REPLACE COMMAND .... This command replaces one piece of text with another in the workspace The replace command is useful indenting blocks of text during formatting operations. The replace command only replaces plain text, not regular expressions * replace the letter 'a' with 'A' in the workspace buffer >> replace "a" "A"; * indent a block of text >> clear; get; replace "\n" "\n "; put; clear; UNTIL COMMAND .... * example >> until 'text'; Reads the input stream until the workspace ends with the given text. * print any text that occurs between '<' and '>' characters >> /"; print; } clear; * print only text between quote characters (excluding the quotes) >> r; /"/ { until '"'; clip; clop; print; } clear; * create a parse token 'quoted' from quoted text >> r; "/" { until '"'; clip; clop; put; add 'quoted*'; push; } clear; * print quoted text, reading past escaped quotes (\") >> /"/ { until '"'; print; } clear; The 'while' and 'whilenot' commands are similar to the until command but they depend on the value of the 'peep' virtual machine buffer (which is a single-character buffer) rather than on the contents of the 'workspace' buffer like the until command. ** notes The 'until' command usually will form part of the 'lexing' phase of a script. That is, the until command permits the script to turn text patterns into 'tokens'. While in traditional parsing tools (such as Lex and Yacc) the lexing and parsing phases are carried out by seperate tools, with the 'pp' tool the two functions are combined. The until command essentially performs multiple read operations or commands and after each read checks to see whether the workspace meets the criteria specified in the argument. ** proposed implementation changes A variant of the until command may be implemented, without any argument. That is >> untiltape; will read the input stream until the workspace ends with the same text as is contained in the current tape element. This form of the 'until' command would be useful for parsing, for example, a sed substitution command such as 's@this@that@g' or 's/this/that/g' where the delimiter character is whatever character follows the 's' character. WHILENOT COMMAND .... This is a command that was dreamed up in a rash moment. It is yet to be seen if it really is necessary. It reads into the workspace characters from the input stream -while- the 'peep' buffer is -not- a certain character. This is a "tokenizing" command and allows the input stream to be parsed up to a certain character without reading that character. * an example ---- /"/ { clear; whilenot '"'; print; } ,,,, The above script tokenizes a quoted text section and prints the contents to the standard output. WHILE COMMAND .... The 'while' command in the pattern-parse language reads the input stream while the 'peep' buffer is any one of the characters or character sets mentioned in the argument. The command is written >> while [cdef]; The command takes one argument. This argument may also include character classes as well as literal characters. From example, >> while [:space:]; reads the input stream while the peep buffer is a digit. The read characters are appended to the 'workspace' buffer. NEGATION Negation for the while command is currently supported using the "whilenot" command. ** Escaping characters ENDSWITH TEST .... The 'ends with' test checks whether the workspace ends with a given string. This test is written >> E"xyz" { ... } The script language contains a structure to perform a test based on the content of the workspace and to execute commands depending on the result of that test. An example of the syntax is --> /ocean/ { add " blue"; print; } --< In the script above, if the @workspace buffer is the text "ocean" then the commands within the braces are executed and if not, then not. The test structure is a simple string equivalence test, there are -no- regular expressions and the workspace buffer must be -exactly- the text which is written between the // characters or else the test will fail, or return false, and the commands within the braces will not be executed. This command is clear influenced by the "sed" http://sed.sf.net stream editor command which has a virtually identical syntax except for some key elements. In sed regular expressions are supported and in sed the first opening brace must be on the same line as the test structure. There is also another test structure in the script language which checks to see if the workspace buffer -begins- with the given text and the syntax looks like this :> { add ' blue'; print; } This syntax is visually not very pleasant and its design goes against the simplicity which I was attempting to obtain in writing the parse script language. = the List test in the omski language = The list test may be written as :> :nouns.txt: { ... } The list test determines if the workspace matches any one of a list of strings which are contained within a text file. In the example given above the name of the text file is "nouns.txt". If the "workspace;" was "dog" at the time that the test was executed and the string "dog" was contained in the text file "nouns.txt", then the test would return -true- and the instructions within the braces would be executed. However if the word "dog" was not contained in the text file then the test would return false, and the instructions within the braces would -not- be executed. This test may be useful in a linguistic context, in order to determine if a particular string is a noun, verb, or other part of speech. && Using tests in the pp tool TEST EXAMPLES .... * if the workspace is not empty, add a dot to the end of the workspace >> !"" { add '.'; } * if the end of the input stream is reached print the message "end of file" >> { add "end of file"; print; } * if the workspace begins with 't' trim a character from the end >> B"t" { clip; } TAPE TEST .... The tape test determines if the current element of the @tape structure is the same as the @workspace buffer. The tape test is written :> == This test was included originally in order to parse the sed structure >> s@old@new@g or s/old/new/g or s%old%new%g In other words, in sed, any character can be used to delimite the text in a substitue command. ACCUMULATOR STACK STRUCTURE IN THE VIRTUAL MACHINE The 'virtual'-machine of the chomski language contains a stack structure which is primarily intended to hold the parse tokens during the parsing and transformation of a text pattern or language. However, the stack could hold any other string data. Each element of the stack structure is a string buffer of unlimited size. The stack is manipulated using the pop and push commands. When a value is popped off the stack, that value is appended to the -front- of the workspace buffer. If the stack is empty, then the pop command has no effect. && The tape in the chomski machine The tape structure in the @virtual-machine is an infinite array of elements. Each of these elements is a string buffer of infinite size. The elements of the tape structure may be accessed using the @increment , @decrement , @get and @put commands. ** The tape and the stack The tape structure in the virtual machine and the @stack structure and designed to be used in tandem, and several mechanisms have been provided to enable this. For example, when a @pop operation is performed, the @tape-pointer is automatically decremented, and when a @push operation is performed then the tape pointer is automatically incremented. Since the chomski language and machine have been designed to carry out parsing and transformation operations on text streams, the tape and stack are intended to hold the values and tokens of the parsing process. ** the virtual machine The "virtual machine" [->wikipedia] for the bumble language looks like this -machine- -stack:...,...,...,... -tape:...,...,...,...,..., -workspace: -peep: -counter: The machine has four main elements D- a @stack: which can contain "parse tokens" [->wikipedia] if the language is used for parsing or any other text data. - a @tape: which is an array of text data which is synchronized with the machine stack using a tape pointer. The tape is manipulated with the @get and @put commands - a @tape-pointer : This variable determines the current tape element which will be used by @get and @put commands. The tape pointer is incremented with the ++ command and decremented with the -- command. - a @workspace buffer: This is where all "text change" operations within the machine are carried out. It is similar in concept to a register within a "cpu" [->wikipedia] or to the "Sed" [->wikipedia] stream editor pattern space. The workspace is affected by various commands, such as @clear, @add, @indent, @get, @push, @pop etc - a @peep character: this character is not directly manipulable, but it constitutes a very simple "look ahead" mechanism and is used by the @while and @whilenot commands - a @counter : This counter or "accumulator" is an integer variable which can be incremented with the command @plus ,decremented with the command @minus ,and set to zero with the command @zero . ** Language commands and the machine Most of the commands in the script language affect the virtual machine. For example, the -pop- command takes the last element off the stack and appends it to the -front- of the workspace. We can see this by looking at the virtual machine before and after a pop command. -machine- -stack: noun* , verb* , noun* -tape: cat , hunts , bird , --> ,..., -workspace: noun* -peep: t After the @pop command the machine will look like this -machine- -stack: noun* , verb* -tape: cat , hunts , --> bird ,...,..., -workspace: noun*noun* -peep: t The stack is one element shorter and the workspace has changed && The workspace buffer in the virtual machine The workspace buffer is a buffer within the virtual machine of the stream parsing language. In this buffer all of the text transformation processed take place. For example, the commands @clear, @add, @indent, @newline all affect the text in the workspace buffer. The machine looks like this -machine- -workspace: -stack:...,...,...,... -tape:...,...,...,...,..., -peep: -counter: The workspace buffer is analogeous to a processor register in a non-virtual cpu. In order to manipulate a value, it is generally necessary to first load that value into a cpu register. In the same way, in order to manipulate some text in the chomski machine, it is necessary to first load that text into the workspace buffer. This can be achieve with the @get, @pop, @read commands. && About backus-naur form and the pattern parser ------------------------------------------------ Backus-naur form is a way of expressing grammar rules for formal languages. A variation of BNF is "EBNF" which modifies slightly the syntax of bnf. Sometimes on this site I use (yet another) syntax for bnf or ebnf rules but the idea is the same. There is close relationship between the syntax of the 'pp' language and a bnf grammar. For example >> /word*colon*/ { clear; add 'command*'; push } corresponds to the grammar rule >> command := word colon REFLECTION SELF HOSTING AND SELF PARSING The script "compile.pss" is a parse-script which implements the parse-script language. This was achieved by first writing a hand-coded "assembler" program for the machine (contained in the file "asm.pp"). Once a working asm.pp program implemented a basic syntax for the language, the compile.pss script was written. This allows us to maintain and add new syntax to the language using the language itself. A new "asm.pp" file is generated by running >> pp -f compile.pss compile.pss >> asm.new.pp; cp asm.new.pp asm.pp The command above runs the script compile.pss and also uses compile.pss as its text input stream. In this sense the system is "self-hosting" and "self-parsing". It is also a good idea to preserve the old copy of asm.pp in case there are errors in the new compiler. ** Test Structures All tests may be negated by placing the '!' character -before- the test. For example :> !// { add 'x'; } If the workspace is not empty then add the character 'x' to the workspace Multiple tests may be used before a brace block. /text/ /word/ { print; } --< If the workspace is either 'text' or 'word' then print the workspace. D- "equals" doc:test-is.txt.html test: :> /sometext/ {...} If the @workspace is exactly the same as the text contained within the forward-slashes then the the instructions within the braces will be executed. If not, then the next instruction executed will be the first instruction after the matching closing brace. - "begins with" doc:test-begin.txt.html test: :> {...} If the text in the workspace begins with the text contained in the angle brackets, then the code in the braces will be executed. - "ends with" doc:test-begin.txt.html test: :> (sometext) {...} If the text in the workspace begins with the text contained in the angle brackets, then the code in the braces will be executed. - "list" doc:test-list.txt.html test: :> =file.name= {...} This test checks the workspace against the lines a the text file 'file.name'. If the @workspace matches any one of the lines then the test will return true and the instructions within the braces will be executed. - "tape = workspace" doc:test-class.txt.html test: :> == {...} tests if the current element of the tape is equal to the current value of the workspace - "class" doc:test-class.txt.html test: :> [aeiuo] {...} If the text in the workspace is exactly any one of the characters contained within the square braces then the test will return true. - "end of input stream" doc:test-eof.txt.html test: :> <> {...} This test checks to see if the end of the input stream has been reached. If the input stream has no more characters to read then the test returns true and the code within the braces will be executed. All of the above tests can be "negated"by placing the "!" character -before- the test. For example, if the script contains the code :> !/sometext/ {...} then the instructions within the braces will be executed only if the @workspace buffer is -not- the text "sometext". ** Comments in scripts Comments, which will be ignored by the interpreter may be placed in a script file by -enclosing- the comment in "#" characters. Notice that this is different from the normal unix syntax of beginning each line with a hash character. For example --> # This comment will be ignored by the interpreter Comments may span lines and must end with a hash character # /a/ { pop; } --< && Compiling the chomski interpreter && a virtual machine cycle Each statement in a given script is -not- only executed once. It is executed once for each cycle of the program and the number of cycles is determined by the number of characters in the input stream. In some implementations an implicit @read operation will be performed for each cycle of the engine but in other implementations no implicit read may be performed and each script will have to include a read command at the beginning of the script. If this is not done the script may loop infinitely. ** sed the stream editor The concept of "cycles" is drawn directly from the sed language or tool (sed is a unix utility). In the sed language each statement in a sed script is executed once for each -line- in a given input stream. In other words there is a kind of implicit "loop" which goes around the sed script. This loop in some fictional programming language might look like: --> while more input lines do sed script loop --< In the current parse-language the cycles are executed for each -character- in the input stream (as opposed to line). These cycles are also the reason that an interpreter for the language (see the "Parser.cpp" file and the "Program.cpp" file) cannot interpret the language in the normal line-at-a-time fashion. It would be very slow, because the interpreter would have to parse the input script once for each character in the input stream, and there might be thousands or millions of characters in the input stream. A modern computer could probably do this without being excessively slow, but a better approach is needed. The better approach is to "compile" the input script into some kind of array of instructions (with jumps and tests included). This is what the Parser and Program cpp classes do. ** shifting and reducing There is one complicating factor which is the concept of multiple shift-reduces during the "shift-reduce parsing2 (wikipedia) one cycle or the interpreter. This concept has already been treated within the @flag command documentation. Another tricky concept is grammar rule precedence, in other words, which grammar rule shift-reduction should be applied first or with greater predence. In terms of any concrete application the order of the script statements determines precedence. ... ** a description of the c++ language interpreter files In general the file endings are D- dev: the devcpp project files - name.main.cpp: a program to test the name.cpp class - name.cpp: a class which is part of the -machine- implementation D- file:/machine/Instruction.cpp : represents one instruction in a compiled Program. - file:/machine/Instruction.dev : the bloodshed "devcpp" http://www.bloodshed.net/devcpp.html project file for the Instruction class - file:/machine/Instruction.exe : an ms-windows executable to test the Instruction class This file is the result of compiling the Instruction.dev project in the devcpp compiler. - file:/machine/Instruction.h : the header file for the Instruction class - file:/machine/Instruction.main.cpp : a c++ program to test the Instruction class - file:/machine/Machine.cpp : The implementation of a @virtual-machine which is the basis of the parsing script language. See the file:/machine/java directory for an older implentation in java - file:/machine/Machine.dev : the project file for the virtual Machine class - file:/machine/Machine.exe : an ms-windows exe to test the Machine class - file:/machine/Machine.h : The c++ header file for the Machine class - file:/machine/Machine.main.cpp : a program to test the Machine class. Allows operations to be performed on the machine and to watch the changes in the state of the machine. - file:/machine/makefile.win : a ms-windows specific makefile which was generated by the devcpp compiler. could be used as the basis of a unix make file. - file:/machine/Parser.cpp : A class which parses and compiles the script language. this class parses a script written in the language and compiles it into an array format using the Program class. This parsing and compiling operation removes the need to reparse the script for each character in the input stream. - file:/machine/Parser.dev : the devcpp project file for the Parser class - file:/machine/Parser.exe : this is an ms-windows executable which tests the Parser class. The executable also acts as an interpreter for a given script since it invokes the "run" method on the Program object which is created by the parser. - file:/machine/Parser.h : - file:/machine/Parser.main.cpp : This is a program to test the Parser class which also acts as an interpreter for the script language. - file:/machine/Program.cpp : represents a compiled script. - file:/machine/Program.dev : the devcpp project file. - file:/machine/Program.exe : - file:/machine/Program.h : - file:/machine/Program.main.cpp : a program to test the Program class. This is very useful for learning about the parse language and ":virtual-machine" because it provides a command interpreter which allows the user to add instructions to a test program and execute them while watching the changes in the state of the virtual machine. - file:/machine/Tape.cpp : represents the @tape data structure within the ":virtual-machine" The tape is essentially an array with a pointer. - file:/machine/Tape.exe : an ms-windows executable to test the Tape class - file:/machine/Tape.h : As can be seen by the above files, any of the classes can be tested by compiling the ".main.cpp" class which provides a program to test the class. To compile the .main.cpp class on a windows computer the bloodshed devcpp compiler can be used and by loading the ".dev" project file. For example, to compile the file:/machine/Machine.main.cpp file load the file:/machine/Machine.dev project file in the devcpp compiler and use the menu compile option. By loading the ".dev" project file all dependency class will be automatically loaded into the compiler. ** other files D-file:/machine/Interpret.cpp : a class which parses and generates c++ source code for a given script. The source code can the be compiled in order to execute the script. - file:/machine/Interpret.dev : The "devcpp" http://www.bloodshed.net/devcpp.html project file - file:/machine/Interpret.exe : An ms-windows executable - file:/machine/Interpret.h : - file:/machine/Interpret.main.cpp : - file:/machine/Interpretor.dev : - file:/machine/Compile.cpp : - file:/machine/Parser.all.cpp : - file:/machine/MachineBasic.cpp : an older version of the @virtual-machine && A Glossary for the pp language -------------------------------------- parse: find the structure of a text pattern compile: transform a pattern into another machine: a virtual machine which is used to execute a script pattern: a formal language, in the mathematical or linguistic sense. shift-reduce: a method of parsing. && How to compile the pp tool ----------------------------------------- These instructions will most probably change as the c source will be divided into small files for the sake of maintainability. -On an x86 computer- a- download the "tiny c" http://fabrice.bellard.free.fr/tcc/ - unzip the tiny c distribution in some folder - download file:library.c and file:chomski.c and place the two files in the "tcc" folder (which was created when the tiny c distribution was unzipped. - compile with a command line such as :> \tcc\tcc -I\tcc\include -L\tcc\lib chomski.c - The resulting executable is around 40k -On a non-x86 computer- a- download file://library.c and file://chomski.c - download or obtain a c compiler, for example the GNU c compiler. - compile the code. The exact proceedure I dont know because I havent done it. ** un directori de documentacio Aquest directori conte documentacio per un llenguatge script que esta dissenyat per a implementar analitzadors sintactics. De fet, aixo es una explicacio molt simplificada, perque el llenguagte tambe pot ser utilizada per traduir un format de text a un altre. Aquest llenguagte te similituds amb "sed" [->wikipedia] que es un editor de fluxes de text. El llenguatge descrit aqui tambe es un editor de fluxes de text pero funciona amb caracters i no pas lineas de text La majoria de documents estan escrits en angles. Si un document tambe esta escrit en catala te un nom com "nomarxiu.ca.txt" i ho podeu veure teclejant "nomarxiu.ca.txt.html" com una pagina web. D- "commands" file:commands.txt.html : una llista de paraulas claus del llenguagte - "maquina virtual" file:virtual-machine.txt.html : descriu els components de la maquina virtual ** Altres documents D- file:how-to-compile.txt.html: com compilar el codi font del programa - file:file-description.txt.html : descriu els arxius de codi "c++" (->wikipedia:ca) que estan pel directori file:/machine - file:language.txt.html : una descripcio general del llenguagte - file:script-trace.txt.html : intenta demostrar com funciona el llenguatge i maquina virtual junts. ** documentacio en html per visualitzar un document com una pagina web cal afegir ".html" al final del nom de arxiu per exemple "nom.txt.html" && The input stream of a script ------------------------------- Each script in the for the pattern parser assumes that there is an input stream which is to be processed by the instructions or commands contained within the script. In this sense the parse-language follows the unix tradition of using "pipes" to process text streams using small programs. Programs such as "wc (unix)" and "grep" are examples of Unix style programs which process text input streams and produce text output stream. The 'pp' tool works in the same way as these above mentioned programs and can form part of a normal "unix pipe" line (These pipes are also available within the ms-windows or "Dos" command line) Let us look at an example. A very simple script in the parse language would be >> /"/ { clear; add "'"; } print; clear; This script just replaces each instance of " within a text file with the character ' (which is not very useful since there are already many programs which can do this) The above script can be saved in a file called "quote.txt" for example and then executed. It is executed as the ms-windows or *nix command console by typing >> type file.txt | Parser quote.txt | more ##(ms-windows) or >> cat file.txt | Parser quote.txt | less ##(unix) These commands lines are equivalent and demonstrate the use of pipes and input streams and output streams. The file "file.txt" which contains normal text with " characters in it, forms the input stream for the parse-language script (quote.txt) which in turn creates an output stream which is consumed by the "more" or "less" programs. (these programs just display the output one page at a time). The user will just see the original file.txt will all its " double-quotes changed to ' single-quotes, but the actual "file.txt" is unchanged, since the script operates on the input stream and not on the file itself. THE PATTERN PARSER SCRIPT LANGUAGE The parsing script language is a "sed" like language which is designed to parse and transform "context free languages" or in non-technical terminology, is designed to re-format text patterns. NAMING OF THE LANGUAGE The tool is named 'pp' for Pattern Parser, and the virtual machine may be refered to as the virtual machine. I originally named this language 'chomski' but decided to follow the Unix tradition of short names. AN EXAMPLE SCRIPT >>> /=/ { put; clear; add "operator*"; push; } pop; /operator*number*/ { clear; add "equation*"; push; } push; <<< As can be seen the language has a number of things in common with sed. u- it is a cycle based language; that is, each command in the script is executed once for each cycle of the interpreter. - it uses "tests" in the form /text/ - it has the concept of a "workspace" or "pattern space" like sed, being a central text buffer against which tests and manipulations occur - it also uses other buffers, like the sed @hold-space but in the case of the parse language the buffers are more capable because they include a @stack and a ":tape". THE PURPOSE OF THE LANGUAGE The language is designed to allow the simple creation of parsers and translators, without the necessity to become involved in the complexities of something like Lex or Yacc. Since the interpreter for the language is based on a virtual machine, the language is platform independant and has a level of abstraction. The language combines the features of a tokenizer and parser and translator. AVAILABLE IMPLEMENTATIONS http://bumble.sf.net/books/pars/gh.c and other related files in the "gh/" folder such as asm.pp compile.pss compile.ccode.pss ... TRACE OF A SCRIPT This section provides a trace of a script to show how the virtual machine changes. a script which "lexes" ---------- read; '"' { put; clear; add "quote*"; push; } "a", "b" { put; clear; add "letter*"; push; } clear; ,,,, And here is the input stream (text) :> "a","b" ** the trace -machine- -stack: -tape:-->, ... -workspace: -peep:" -notes- u- the @peep buffer contains the next character in the input stream (in this case a quote character) - the dots withing the @tape structure indicate that the tape is an "infinite" array. - the arrow in the first cell of the tape indicates that the @tape-pointer is currently pointing to that cell (since the tape pointer starts at zero. - the @stack and @workspace are initially empty -machine- -stack: -tape:-->,... -workspace:" -peep:a -notes- u- the peep buffer is a single character buffer and now contains the next character in from the input stream - in each @cycle of the interpreter the peep character is -added- to the end of the workspace so the workspace is now " script statement: :> /"/ This statement is a @test statement. It checks to see if the workspace buffer is exactly the text withing the "/" characters. In this cas the condition is true so the statements within the braces are executed :> put; This statement puts the contents of the @workspace buffer into the current element of the @tape data structure. The current element is the element where the pointer is pointing. So the machine now looks like -machine- -stack: -tape:--> ",... -workspace:" -peep:a -notes- u- The first element of the tape now contains the character ", which was the contents of the workspace. The workspace has -not- changed because the @put operation does not delete or change in any way the workspace - also the tape pointer is still pointing to the first element because the put operation does not change the ":tape-pointer" statement: :> clear; This statement simple clears the workspace buffer of all text that is, sets it to a zero length string, so the machine now looks like -machine- -stack: -tape:--> ",... -workspace: -peep:a next statement: :> add "quote*"; This is also a straightforward statement, it simply adds the given text to the workspace buffer. So the machine is now -machine- -stack: -tape:--> ",... -workspace:quote* -peep:a -notes- u- the * character at the end of the text is important because it indicates that what is being @add ed to the workspace is actually a token name. - all statements or commands in the language -must- end with a semi-colon, unlike sed next statement: :> push; The @push command pushes the contents of the workspace upto and including the first star character onto the stack so the machine now looks like -machine- -stack:quote* -tape: ",-->, ... -workspace: -peep:a -notes- u- the workspace is now empty because the @push command does change the workspace unlike the -put- command. - The tape pointer is now pointing to the second item in the tape, because the push command automatically increments the tape pointer. - The @stack now has one element, namely the previous contents of the workspace buffer. - The @peep buffer is unchanged because no new charater will be read from the input stream until all the script commands have been processes. - The character which is in the first element of the tape in a sense "corresponds" to the first element of the stack, in the sense that the quote character is the -value- of the "parse token" [->wikipedia] namely "quote*". In general the stack is for parse tokens that the tape is for values. But the machine can be used for any purpose which the script writer desires. -statement- : :> /a/ This is another test, it checks to see if the workspace is exactly the text "a" (no "regular expressions" [->wikipedia] are involved). In this case the answer is no. -statement- : :> /b/ another test and the result is once again false because the workspace currently empty. ** a new cycle so... now the all the script statements have been processes (because the second block of braces was skipped because the tests returned false. Now begins a new ":cycle". This means that a new character is read from the input stream into the @peep buffer and the previous peep character is added to the end of the workspace. Also the interpreter starts processes the script from the beginning again. so after the new cycle the machine now looks like -machine- -stack:quote* -tape:",-->, ... -workspace:a -peep:" The interpreter starts from the first script statement again. :> /"/ This test now returns -false- because the workspace contains an "a" character not a quote character. (The test is -not- performed against the peep buffer, only the workspace buffer) The test returns false and the brace block is skipped. The next statement is: :> /a/ another test, but this time it returns -true- because the workspace -does- equal 'a'. The statements in the block are executed. The next statement is :> put; The contents of the workspace are put into the element of the tape where the tape pointer is pointing so the machine now is -machine- -stack:quote* -tape:",-->a, ... -workspace:a -peep:" The next statement: :> clear; results in the ":virtual-machine" -machine- -stack:quote* -tape:",-->a, ... -workspace: -peep:" and the next statment :> add "letter*"; just adds the given text to the @workspace so the machine is now -machine- -stack:quote* -tape:",-->a, ... -workspace:letter* -peep:" -statement- : :> push; once again the contents of the workspace up to the first star are pushed onto the stack to the machine is now -machine- -stack:quote*,letter* -tape:",a,-->, ... -workspace: -peep:" -notes- u- the tape pointer is incremented - the workspace is deleted upto and including the first star (in this case, all of it) - the stack gets another element - the @peep is unchanged because a new character is only read at the beginning of a cycle or because of and explicit @read command. ** and so on ... COMPARISON BETWEEN PP AND SED The "pp" tool and virtual machine was largely inspired by the "sed" stream editor. Sed is a program designed to find and replace patterns in text files. The patterns which Sed replaces are "regular expressions". The pp tool includes a more powerful virtual machine which is capable of parsing and compiling many "context-free" languages, whereas sed is mainly used to manipulate "regular" languages. ** Similarities * The workspace: Both sed and pp have a 'workspace' buffer (which in sed is called the "pattern space"). This workspace is the area where manipulations of the text input stream are carried out. * The script loop or cycle: The languages are based on an implicit loop or cycle. That is to say that each command in a sed or a chomski script is executed once for each line (sed) or once for each character (pp) * Syntax: The syntax of sed and chomski are similar. Statement blocks are surrounded by curly braces {} and statements are terminated with a semi-colon ; .Also the sed idea of "tests" based on the contents of the "pattern space" (or @workspace ) is used in the chomski language. * Text streams: Both sed and 'pp' are text stream based utilities, like many other unix tools. This means that both sed and 'pp' consume an input text stream and produce as output, an output text stream. These streams are directed to the programs using "pipes" | in a console window on both unix and windows systems. For example, for sed we could write in a console window >> echo "abcabcabc" | sed "s/b/B/g" and the output would be >> aBcaBcaBc In the case of the pattern parser the command line could be written >> echo "abcabcabc" | pp -e "'b'{d;add'B';t;d;}t;d;" and the output, once again will be :> aBcaBcaBc In other words the sed and the pp script do the same thing. ** Differences * Lines vs characters: Sed is a "line" based language where as chomski is a character based script language. This means that the sed script is \ executed once for each line of the input text stream, but in the case of "pp", the script is executed once for each character of the input text stream, except for the "begin" block * Strict syntax: The syntax of the pp language is stricter than that of sed. For example, in a chomski script all commands must end in a semi-colon and all statements after a test must be enclosed in curly braces. In sed, it is not always necessary to terminate commands with a semi-colon. For example, both of the following are valid sed statements. >> s/b/B/g >> s/b/B/g; * White space: In the pattern-parse language (as implemented in "asm.pp" or compile.pss) whitespace is insignificant. In sed, this is not the case. For example in pp one can write: ---- "text" { print; } ,,,, That is, the opening brace is on a different line to the test. This would not be a legal syntax in sed. * Commands names: Sed uses single character "memnonics" for its commands. For example, "p" is print, "s" is substitute, "d" is delete. In the pp language, in contrast, commands are have a long form as well such as "print" for print the workspace, "clear" for clear the workspace, or "pop" in order to pop the last element of the stack onto the workspace. While the sed approach is useful for writing very short, terse scripts, the readability of the scripts not good. * The virtual machines: While both sed and the pattern parser are in a sense based on "virtual machines", the machine for pp is more extensive. The sed machine essentially has 2 buffers, the pattern space, and the hold space. The pp virtual-machine, however, has a workspace, a stack, a tape, a counter variable and various other registers. * Regular expressions: The tests or "ranges" in sed, as well as the substitutions are based on regular expressions. In pp, however, no regular expressions are used. The reason for this difference is that pp is designed to parse and transform a different set of patterns than sed. The patterns that pp is designed to deal with are referred to formally as "context free languages" * Negation of tests: The negation operator ! in the pp language is placed before the test to which is applies, whereas in sed, the negation operator comes after the test or range. So, in the pattern-parse language it is correct to write >> !"tree" { add ': not a tree'; print; clear;} But in sed the correct syntax is >> /tree/! { s/$/not a tree/; } ** The self referentiality of the parse-language The parse-language is capable of parsing and compiliing itself with one or two caveats. This is because the parse-language is designed to parse and transform, translate or compile context free languages (but not necessarily all context free languages) and the language itself is a context free language. SHIFT REDUCTIONS ON THE STACK .... Imagine we have a 'pp' script as follows: ---- pop;pop; "command*command*", "commandset*command*" { clear; add 'commandset*'; push; } "word*colon*" { clear; add 'command*'; push; } push; push; ,,, This script corresponds directly to the (e)bnf grammar rules --- commandset := command , command; commandset := commandset , command; command := word , colon; ,,,, But in the script above there is a problem; that the first rule needs to be applied after the second rule. But it seems now that another reduction should occur namely >> if-block --> if-statement block which can be simply implemented in the language using the statements: ----- pop; pop; "if-statements*block*" { clear; add "if-block*"; push; } ,,,, But the crucial question is, what happens if the statements just written come before the statements which were presented earlier on? The problem is that the second reduction will not occur because the script has already passed the relevant statements. This is the reason for the .reparse command TASKS TO DO PATTERN PARSER VIRTUAL MACHINE AND TOOL * create a java version of the machine? * create a proper "check.syntax.pss" script which will expand the syntax checks in compile.pss IDEAS FOR CHANGING THE PATTERN PARSER * add a new command "untiltape" which has no arguments, which reads the input stream until the workspace ends with the text contained in the current cell of the tape. * add an execute command, which executes the workspace as a shell command and inserts the output in the output stream. This add unlimited power to the language but is it really necessary???? PARSE TOKENS OVERVIEW The parsing script language is a "sed" like language which is designed to parse and transform "context free languages" or in non-technical terminology, is designed to re-format text patterns. NAMING OF THE LANGUAGE The tool is named 'pp' for Pattern Parser, and the virtual machine may be refered to as the parsing virtual machine. I originally named this language 'chomski' but decided to follow the Unix tradition of short names. AN EXAMPLE SCRIPT ---- read; "=" { put; clear; add "operator*"; push; } parse> pop;pop; "operator*number*" { clear; add "equation*"; push; .reparse } push; push; ,,,, As can be seen the language has a number of things in common with sed. * it is a cycle based language; that is, each command in the script is executed once for each cycle of the interpreter. * it uses "tests" in the form "text" [abcd] [a-d] [:space:] ![:alpha:] * it has the concept of a "workspace" or "pattern space" like sed, being a central text buffer against which tests and manipulations occur * it also uses other buffers, like the sed "hold-space" but in the case of the parse language the buffers are more capable because they include a stack and a "tape" (which is an array of text cells). THE PURPOSE OF THE LANGUAGE The language is designed to allow the simple creation of parsers and translators, without the necessity to become involved in the complexities of something like Lex or Yacc. Since the interpreter for the language is based on a virtual machine, the language is platform independant and has a level of abstraction. The language combines the features of a tokenizer and parser and translator. AVAILABLE IMPLEMENTATIONS http://bumble.sourceforge.net/books/pars/object/ This folder contains a working implementation in the c language. The code can be compiled with the bash functions in the file "helpers.gh.sh" http://bumble.sourceforge.net/books/pars/object.java A reasonably complete machine object in java. We can use this in conjunction with the script language to create compilable java code (the script "compilable.c.pss" provides an example of how to do this for the c language). COMPARISON BETWEEN PP AND SED The pp tool was largely inspired by the "sed" stream editor. Sed is a program designed to find and replace patterns in text files. The patterns which Sed replaces are "regular expressions" SIMILARITIES .... * The workspace: Both sed and chomski have a 'workspace' buffer (which in sed is called the "pattern space"). This workspace is the area where manipulations of the text input stream are carried out. * The script cycle: The languages are based on an implicit cycle. That is to say that each command in a sed or a chomski script is executed once for each line (sed) or once for each character (chomski) * Syntax: The syntax of sed and chomski are similar. Statement blocks are surrounded by curly braces {} and statements are terminated with a semi-colon ; .Also the sed idea of "tests" based on the contents of the "pattern space" (or @workspace ) is used in the chomski language. * Text streams: Both sed and chomski are text stream based utilities, like many other unix tools. This means that both sed and chomski consume an input text stream and produce as output an output text stream. These streams are directed to the programs using "pipes" | in a console window on both unix and windows systems. For example, for sed we could write in a console window >> echo abcabcabc | sed "s/b/B/g" and the output would be >> aBcaBcaBc In the case of "chomski", the command line could be written >> echo abcabcabc | chomski -s "/b/{clear;add'B';print;clear;}print;clear;" and the output, once again will be :> aBcaBcaBc In other words the sed and the chomski script to the same thing. - DIFFERENCES .... * Lines vs characters: Sed is a "line" based language where as chomski is a character based script language. This means that the sed script is \ executed once for each line of the input text stream, but in the case of chomski, the script is executed once for each character of the input text stream. * Strict syntax: The syntax of the parsing language is stricter than that of sed. For example, in a parsing script all commands must end in a semi-colon (except .reparse, parse> and .restart - which affect program flow) and all statements after a test must be enclosed in curly braces. In sed, it is not always necessary to terminate commands with a semi-colon. For example, both of the following are valid sed statements. >> s/b/B/g >> s/b/B/g; * White space: In the parsing language whitespace is completely insignificant, even within quoted arguments (?). In sed, this is not the case. For example in pp one can write ------ "text" { print; } ,,,, That is, the opening brace is on a different line to the test. This would not be a legal syntax in sed. * Commands names: Sed uses single character "memnonics" for its commands. For example, "p" is print, "s" is substitute, "d" is delete. In the chomski language, in contrast, commands are full english words, such as @print for print the workspace, @clear for clear the workspace, or @pop in order to pop the last element of the @stack onto the workspace. While the sed approach is useful for writing very short, terse scripts, the readability of the scripts not good. Chomski sacrifices conciseness for readability * The virtual machines: While both sed and chomski are in a sense baed on "virtual machines" [->wikipedia], the machine for chomski is more extensive. The sed machine essential has 2 buffers, the pattern space, and the hold space. The chomski @virtual-machine ,however has a workspace, a stack, a tape, an counter variable amongst other things. * Regular expressions: The tests or "ranges" in sed, as well as the substitutions are based on regular expressions. In pp, however, no regular expressions are used. The reason for this difference is that pp is designed to parse and transform a different set of patterns than sed. The patterns that pp is designed to deal with are referred to formally as "context free languages" [->wikipedia]. * Negation of tests: The negation operator ! in the "chomski" language is placed before the test to which is applies, where as in sed the negation operator comes after the test or range. So, in "chomski" it is correct to write >> !/tree/ { ... } But in sed the correct syntax is >> /tree/! { ... } COMMANDS ** Test Structures All tests may be negated by placing the '!' character -before- the test. For example :> !// { add 'x'; } If the workspace is not empty then add the character 'x' to the workspace Multiple equal tests may be used before a brace block. "text","word" { print; } --< If the workspace is either 'text' or 'word' then print the workspace. D- "equals" doc:test-is.txt.html test: :> /sometext/ {...} If the @workspace is exactly the same as the text contained within the forward-slashes then the the instructions within the braces will be executed. If not, then the next instruction executed will be the first instruction after the matching closing brace. - "begins with" doc:test-begin.txt.html test: :> {...} If the text in the workspace begins with the text contained in the angle brackets, then the code in the braces will be executed. - "ends with" doc:test-begin.txt.html test: :> (sometext) {...} If the text in the workspace begins with the text contained in the angle brackets, then the code in the braces will be executed. - "list" doc:test-list.txt.html test: :> =file.name= {...} This test checks the workspace against the lines a the text file 'file.name'. If the @workspace matches any one of the lines then the test will return true and the instructions within the braces will be executed. - "tape = workspace" doc:test-class.txt.html test: :> == {...} tests if the current element of the tape is equal to the current value of the workspace - "class" doc:test-class.txt.html test: :> [aeiuo] {...} If the text in the workspace is exactly any one of the characters contained within the square braces then the test will return true. - "end of input stream" doc:test-eof.txt.html test: :> <> {...} This test checks to see if the end of the input stream has been reached. If the input stream has no more characters to read then the test returns true and the code within the braces will be executed. All of the above tests can be "negated"by placing the "!" character -before- the test. For example, if the script contains the code :> !/sometext/ {...} then the instructions within the braces will be executed only if the @workspace buffer is -not- the text "sometext". ** Comments in scripts Comments, which will be ignored by the interpreter may be placed in a script file by -enclosing- the comment in "#" characters. Notice that this is different from the normal unix syntax of beginning each line with a hash character. For example --> # This comment will be ignored by the interpreter Comments may span lines and must end with a hash character # /a/ { pop; } --< BACKUS-NAUR FORM AND THE PATTERN PARSER Backus-naur form is a way of expressing grammar rules for formal languages. A variation of BNF is "EBNF" which modifies slightly the syntax of bnf. Sometimes on this site I use (yet another) syntax for bnf or ebnf rules but the idea is the same. There is close relationship between the syntax of the 'pp' language and a bnf grammar. For example >> "word*colon*" { clear; add "command*"; push } corresponds to the grammar rule >> command := word colon PP LANGUAGE PARSING ITSELF An important problem for the pp language is to "generate itself" from a set of bnf rules. In other words given rules such as ----- command := word semicolon; command := word quotedtext semicolon; ,,,, then it should be possible to write a script in the language which generates the output as follows ---- pop;pop; "word*semicolon*" { clear; add "command*"; push; .reparse } pop; "word*quotedtext*semicolon*" { clear; add "command*"; push; .reparse } push;push;push; ,,,, TOKEN ATTRIBUTE TRANSFORMATIONS No, this is not correct. We can write complete translators/compilers with the script language. However it may be nice to have a more expressive format like the one shown below. After the above problem is solved it is necessary to add a syntax for transforming the values of the tokens, which is what actually will allow the writing of compilers or translators. For example: * a more expressive compiling language built on top of the parse language ---- command := word semicolon { $0 = "" $1 " $2; }; ... The $n structure will fetch the tape-cell for the corresponding identifier in the bnf rule at the start of the brace block. REPARSE COMMAND AND PARSE LABEL This command makes the interpreter jump back to the @@@ label. The label must occur before the check command, not after. The .reparse command takes no arguments and is written >> .reparse ** Multiple shift reductions The reason for using this command lies in the need to apply multiple "shift reduce" operations during one cycle of a script. See shift-reduce.txt for a more complete explanation. SHIFT REDUCTIONS ON THE STACK Imagine we have a "recogniser" pp script as follows: ----- read; ";" { clear; add "semicolon*"; push; } [a-z] { while [a-z]; clear; add "word*"; push; } parse> pop;pop; "command*command*" { clear; add 'commandset*'; push; .reparse } "word*semicolon*" { clear; add 'command*'; push; .reparse } push; push; (eof) { pop; pop; push; !"" { clear; add "incorrect syntax! \n"; print; quit; } pop; "command*", "commandset*" { clear; add "Correct syntax! \n"; print; quit; } } ,,,,, This script recognises a language which consists of a series of "commands" (which are lower case words) terminated in semicolons. At the end of the script, there should only be one token on the stack (either command* or commandset*). This script corresponds directly to the bnf grammar rules ---- word := [a-z]+; semicolon := ';'; commandset := command command; command := word colon; ,,,, But in the script above there is a problem; that the first rule needs to be applied after the second rule. The above statements in the bumble language execute a reduction according to the grammar rule written above. Let us examine the state of the machine before and after the script statements above: ==:: virtual machine .. stack, if-statement, open-brace*, statements*, close-brace* .. tape, if(a==b), {, a=1; b=2*a; ..., } .. workspace, .., and afterwards ==:: virtual machine -stack: if-statement, block* -tape: if(a==b), { a=1; b=2*a; ... } -workspace: .., But it seems now that another reduction should occur namely :> if-block --> if-statement block which can be simply implemented in the language using the statements: pop; pop; /if-statements*block*/ { clear; add "if-block*"; push; } .., But the question is, what happens if the statements just written come *before* the statements which were presented earlier on? The problem is that the second reduction will not occur because the script has already passed the relevant statements. This is the reason for the .reparse If this explanation has not clarified matters, it should be said that the best way to understand the workings of the machine is to look at a trace of its execution which can be obtained from the Program class. SELF REFERENTIALITY The parse script language is a language which is designed to parse/compile/translate languages. This means that it can recognise/parse/compile, and translate itself. The script books/pars/compilable.c.pss is an example of this. Another interesting application of this self-referentiality is creating a new compiling system in a different target language. NEGATION OF TESTS a test such as >> "some-text" { #* commands *# } can be modified with the logic operator "!" (not) as in >> !"some-text" { #* commands *# } Note that the negation operator '!' must come before the test which it modifies, instead of afterwards as in sed. Multiple negations are not allowed >> !!"some-text" {...} # incorrect ** Examples If the workspace is not a single digit then clear the workspace >> ![0-9] { clear; } If the workspace does not begin with the text 'apple' then print the workspace >> !B"apple" { print; } # not implemented (yet, august 2019) If the workspace does not match any of the lines of the file 'verbs.txt' then add the text 'noun' to the workspace >> !=verbs.txt= { add 'noun'; } If the end of the input stream has not been reached then push the contents of the workspace onto the stack >> !(eof) { push; } # not implemented (august 2019) If the value of the workspace is exactly equal to the value of the current element of the tape, then exit the script immediately >> !(==) { quit; } # not implemented (august 2019) COMMENTS IN THE PARSER LANGUAGE Both single line, and multiline comments are available in the parser script language as implemented in books/pars/asm.pp and compile.pss * single and muliline examples ---- #* check if the workspace is the text "drib" *# "drib" { clear; # clear the workspace buffer } ,,,, The script books/pars/compile.pss attempts to preserve comments in the output "assembler" code to make that code more readable. COMMANDS ADD COMMAND .... The "add" command adds a piece of text to the -end- of the 'workspace' buffer No other register or buffer within the virtual machine is affected. The command is written >> add 'text'; or >> add "text"; ** arguments for the command ** examples * Add a space after every character of the input >> add ' '; print; clear; /style*type*/ { clear; add "command*"; push; } The script above does a shift-reduce operation while parsing some hypothetical language. The "add" command is used to add a new token name to the 'workspace' buffer which is then pushed onto the stack (using the 'push' operation, naturally). In the above script the text added can be seen to be a token name (as opposed to some other arbitrary piece of text) because it ends in the "*" character. Actually any character could be used to delimit the token names, put "*" is the default implementation. The add command takes *one* and only one parameter, or argument. QUIT COMMAND .... This command immediately exits out of the script without processing any more script commands or input stream characters. This command is similar to the "Sed" command 'q' INCREMENT COMMAND .... This command is simple but important. The command adds -one- to the 'tape'-pointer. The command is written >> ++; Notice that the only part of the machine state which has changed is that the tape-pointer has been incremented by one cell. This command allows the tape to be accessed as an array of strings. Being able to increment and decrement the tape pointer allows the script writer and the virtual machine to access any value on the tape using the 'get' and 'put' commands. It should be remembered that the 'pop' command also automatically increments the tape pointer, in order to keep the tape pointer and the stack in synchronization. Since the get command is the only way to retrieve values from the tape the ++; and --; commands are necessary. the tape cannot be accessed using some kind of array index because the get and 'put' commands to not have any arguments. An example, the following script will print the string associated with the "value" token. >> pop;pop; "field*value*" { ++; get; --; print; } MINUS COMMAND .... The minus command decreases the counter variable by one. This command takes no arguments and is written :> minus; PEEP BUFFER The peep buffer is a single character buffer which stores the next character in the input stream. When a 'read' command is performed the current value of the peep buffer is appended to the 'workspace' buffer and the next character from the input stream is placed into the peep buffer. The 'test-eof' <> checks to see if the peep buffer contains the end of input stream marker. The 'while' command reads from the input stream while the peep buffer is or is not some set of characters For example :> while 'abc'; reads the input stream while the peep buffer is any one of the characters 'abc'. = The plus command = This command increments the machine counter variable by one. It is written >> a+; The plus command takes no arguments. Its counter part is the 'minus' command. ** See also D- 'count' : adds the value of the counter to the end of the 'workspace' buffer. - 'minus' : decrements the internal counter by one. - 'zero' : sets the internal counter to zero - NOTES The pop command is usually employed in the parsing phase of the script (not the lexing phase). DETAILS .... If the stream parser language is being used to parse and translate a language then the script writer needs to ensure that each token ends with a star character in order for the push and pop commands to work correctly. The pop command also affects the "tape" of the virtual machine in that the tape-pointer is automatically decremented by one once for each pop command. This is convenient because it means that the tape pointer will be pointing to the corresponding element for the token. In other words in the context of parsing and compiling a "formal language" the tape will be pointing to the "value" or "attribute" for the the token which is currently in the workspace. PUSH COMMAND .... The push command pushes part of the workspace buffer onto the virtual machine stack. The command is written >> push; The "push" command is the inverse of the "pop" command. The command takes no arguments. The push command also automatically increments the 'tape' pointer by one. The tape pointer is incremented so that the push command can be used in conjunction with the 'put' command. ** The virtual machine and push The delimiter character (by default '*') is important here, because it is used to separate the token names when as they are being pushed onto the 'stack' . That is, the machine reads the workspace up to the first "*" and pushes that portion of the workspace onto the stack and the workspace is diminished by that text. In the diagrams above the arrow in the tape indicates the position of the tape pointer. The "push" operation increments or decrements the tape in order to keep the pointer "synchronized" with the stack. It is important to note that after a push operation the pointer is actually pointing to one element -after- the last item on the tape. ** When the workspace is empty When the workspace contains no text then the push operation actually does nothing. This is desirable because it allows the script writer to include possibly redundant "push" operations without affecting the validity of the script. For example ---- pop; "token*" { clear; add "idea*"; push; .reparse } ,,, The script above pops the stack once onto the workspace and then checks if the workspace matches the text "token*". If so, it clears the workspace, adds the text "idea*" and then pushes the text onto the stack. After this push operation the workspace will be empty and so the push which occurs after the final closing brace will not do anything. UNSTACK COMMAND .... Pop the entire stack as a prefix onto the workspace. This may be useful for displaying the state of the stack at the end of parsing or when an error has occurred. Currently (Aug 2019) the tape pointer is not affected by this command. STACK COMMAND .... Push the entire workspace onto the stack regardless of token delimiters. PUT COMMAND .... The put command places the entire contents of the workspace into the current tape cell, overwriting any previous value which that cell might have had. The command is written >> put; The contents of the workspace are not changed. * put the text "one" in the current tape cell >> clear; add "one"; put; ++; After a put command the workspace buffer is unchanged. This contrasts with the stack "push" command which pushes a certain amount of text from the workspace onto the stack and deletes the same amount of text from the workspace buffer. The put command is the reverse of the 'get' command which retrieves or gets the contents of the current item of the tape data structure. Since the tape is generally designed for storing the values or the attributes of parse tokens, the put command is essentiallly designed to store values of attributes. The put command can be used in conjunction with the 'increment' ++ and 'decrement' -- commands to store values in the tape which are not the current tape item. READ COMMAND The read command reads one character from the input stream and places that character in the 'peep' buffer. The character which -was- in the peep buffer is added to the -end- of the 'workspace' buffer. The read command is the fundamental mechanism by which the input stream is "tokenized" also known as "lexical analysis". The commands which also perform tokenization are "until", "while" and "whilenot". These commands perform implicit read operations. An implicit read operation may also be performed at the beginning of each "cycle" of the script interpreter. This depends apon the implementation and is an issue which I havent resolved yet. If an implicit read is not performed then each script will need to have an explicit read command at the beginning of the script. REPLACE COMMAND .... This command replaces one piece of text with another in the workspace I wanted to keep the machine very simple and only allow the 'workspace' to be altered with the 'clear', 'add', and 'indent' commands (and also implicitly with the 'push' , 'pop' , 'get' , 'put' ... etc commands). But for some applications a replace command seems to be necessary. && The 'until' command ---------------------- SUMMARY until 'text'; until 'text' not 'words'; until; reads the input stream until the workspace ends with the given text or with the text of the current tape element. EXAMPLES .... * print any text that occurs between '<' and '>' characters >> read; "<" { until ">"; print; } clear; * print text between quote characters (excluding the quotes) >> read; '"' { until '"'; clip; clop; print; } clear; * create a parse token 'quoted' from quoted text ------ read; '"' { until '"'; clip; clop; put; clear; add 'quoted*'; push; .reparse } clear; ,,, * print quoted text, ignoring 'escaped' quotes >> /"/ { until '"' not '\"'; print; } clear; The 'while' and 'whilenot' commands are similar to the until command but they depend on the value of the 'peep' virtual machine buffer (which is a single-character buffer) rather than on the contents of the 'workspace' buffer like the until command. NOTES The 'until' command usually will form part of the 'lexing' phase of a script. That is, the until command permits the script to turn text patterns into 'tokens'. While in traditional parsing tools (such as Lex and Yacc) the lexing and parsing phases are carried out by seperate tools, with the 'pp' tool the two functions are combined. The until command essentially performs multiple read operations or commands and after each read checks to see whether the workspace meets the criteria specified in the argument. PROPOSED IMPLEMENTATION CHANGES A variant of the until command may be implemented, without any argument. That is >> until; will read the input stream until the workspace ends with the same text as is contained in the current tape element. This form of the 'until' command would be useful for parsing, for example, a sed substitution command such as 's@this@that@g' or 's/this/that/g' where the delimiter character is whatever character follows the 's' character. WHILE COMMAND .... The 'while' command in the pp language reads the input stream while the 'peep' buffer is any one of the characters or character sets mentioned in the argument. The command takes 1 parameter and it written in a script: >> while 'cdef'; The command takes one argument. This argument may also include character classes as well as literal characters. >> while [:space:]; reads the input stream while the peep buffer is a digit. The read characters are appended to the 'workspace' buffer. NEGATION The while command also supports the negation operator ! and this is written as follows :> while !'text'; where text is a sequence of characters and character classes. WHILENOT COMMAND .... This is a command that was dreamed up in a rash moment. It is yet to be seen if it really is necessary. It reads into the workspace characters from the input stream -while- the 'peep' buffer is -not- a certain character. This is a "tokenizing" [->wikipedia] command and allows the input stream to be parsed up to a certain character without reading that character. an example ----- '"' { clear; whilenot '"'; print; } ,,,, The above script tokenizes a quoted text section and prints the contents to the standard output. ** See also 'while', 'until', 'virtual-machine' ZERO COMMAND .... The "zero" command sets the internal counter to zero. This counter may be used to keep track of nesting during parsing processes, or used by other mundane purposes such as numbering lines or instances of a particular string or pattern. * >> read; "x" { zero; } GRAMMAR AND SCRIPT CONSTRUCTION My knowledge of formal grammar theory is quite limited. I am more interested in practical techniques. But there is a reasonably close correlation between bnf-type grammar rules and script construction. COMPILATION TECHNIQUES One of the enjoyable aspects of this parsing/compiling machine is discovering interesting practical "heuristic" techniques for compiling syntactical structures, within the limitation of the machine capabilities. This section will detail some techniques. REPETITIONS .... The trick below only creates a new list token if the preceding token is not a list of the same type. * a technique for building a list token from repeated items ---- # ebnf rules: # alist := a {a} # blist := b {b} read; # terminal symbols "a","b" { add "*"; push; } !"" { put; clear; add "incorrect character '"; get; add "'"; add " at position "; cc; add "\n"; add " only a's and b's allowed. \n"; print; quit; } parse> # 1 token (with extra token) pop; "a*" { pop; !"a*".!"alist*a*" { push; } clear; add "alist*"; push; .reparse } "b*" { pop; !"b*".!"blist*b*" { push; } clear; add "blist*"; push; .reparse } push; { unstack; put; clear; add "parse stack is: "; get; print; quit; } ,,,, OFFSIDE OR INDENT PARSING .... Some languages use indentation to indicate blocks of code, or compound statements. Python is an important example. These languages are parsed using "indent" and "outdent" or "dedent" tokens. Currently (Aug 2019) the parse machine cannot parse these languages. A simple extension to the machine would be to add "marks" to tape cells as well as a "go" command that jumps to a previously set mark. This would allow parsing of offside languages. OPTIONALITY .... The parse machine cannot directly encode the idea of an optional "[...]" element in a bnf grammar. * a rule with an optional element >> r := 'a' ['b'] . In some cases we can just factor out the optional into alternation "|" >> r := 'a' | 'a' 'b' . However once we have more than 2 or 3 optional elements in a rule, this becomes impractical, for example >> r := ['a'] ['b'] ['c'] ['d'] . In order to factor out the optionality above we would end up with about 4 + (3+2+1) + (2+1) + 1 = 14 rules another approach is to encode some state in a parse token. * a strategy for parsing optionalities ---------- # parse the ebnf rule # rule := ['a'] ['b'] ['c'] ['d'] ';' . begin { add "0.rule*"; push; } read; [:space:] { clear; } "a","b","c","d",";" { add "*"; push; .reparse } !"" { add " unrecognised character."; print; quit; } parse> pop; pop; E"rule*a*" { B"0" { clear; add "1.rule*"; push; .reparse } clear; add "misplaced 'a' \n"; print; quit; } E"rule*b*" { B"0",B"1" { clear; add "2.rule*"; push; .reparse } clear; add "misplaced 'b' \n"; print; quit; } E"rule*c*" { B"0",B"1",B"2" { clear; add "3.rule*"; push; .reparse } clear; add "misplaced 'c' \n"; print; quit; } E"rule*d*" { B"0",B"1",B"2",B"3" { clear; add "4.rule*"; push; .reparse } clear; add "misplaced 'd' \n"; print; quit; } E"rule*;*" { clear; add "rule*"; push; } push; push; (eof) { pop; "rule*" { add " its a rule!"; print; quit; } } ,,,, REPETITION PARSING .... Similar to the notes above about parsing grammar rules containing optional elements, we have a difficulty when parsing elements or tokens which are enclosed in a "repetition" structure. In ebnf syntax this is usually represented with either braces "{...}" or with a kleene star "*". We can use a similar technique to the one above to parse repeated elements within a rule. The rule parsed below is equivalent to the regular expression >> /a?b*c*d?;/ So the script below acts as a recogniser for the above regular expression. I wonder if it would be possible to write a script that turns simple regular expressions into parse-scripts? In the code below we dont have any separate blist clist tokens * parsing repetitions within a grammar rule ---------- # parse the ebnf rule # rule := ['a'] {'b'} {'c'} ['d'] ';' . # equivalent regular expression: /a?b*c*d?;/ begin { add "0/rule*"; push; } read; [:space:] { clear; } "a","b","c","d",";" { add "*"; push; .reparse } !"" { add " unrecognised character."; print; quit; } parse> # ------------ # 1 token pop; # ------------ # 2 tokens pop; E"rule*a*" { B"0" { clear; add "a/rule*"; push; .reparse } clear; add "misplaced 'a' \n"; print; quit; } E"rule*b*" { B"0",B"a",B"b" { clear; add "b/rule*"; push; .reparse } unstack; add " << parse stack.\n"; add "misplaced 'b' \n"; print; quit; } E"rule*c*" { B"0",B"a",B"b",B"c" { clear; add "c/rule*"; push; .reparse } clear; add "misplaced 'c' \n"; unstack; add " << parse stack.\n"; print; quit; } E"rule*d*" { B"0",B"a",B"b",B"c" { clear; add "d/rule*"; push; .reparse } clear; add "misplaced 'd' \n"; print; quit; } E"rule*;*" { clear; add "rule*"; push; } push; push; (eof) { pop; "rule*" { add " its a rule!"; print; quit; } unstack; add " << parse stack.\n"; print; quit; } ,,,, PL ZERO .... Pl/0 is a minimalistic language created by Niklaus Wirth, for teaching compiler construction. In this section, I will explore converting the pl/0 grammar into a form that can be used by the parsing machine and language. I will do this in stages, first parsing "expressions" and then "conditions" etc. Wirths grammar is designed to be used in a recursive descent parser/compiler, so it will be interesting to see if it can be adapted for the machine which is essentially a LR shift-reduce parser/compiler. The grammar below seems to be adequate for LR parsing, except for the expression tokens (expression, term, factor etc). Apart from expression we should be able to factor out the various [] {} and () constructs and create a machine parse script. * the pl/0 grammar in wsn/ebnf form ------- program = block "." . block = [ "const" ident "=" number {"," ident "=" number} ";"] [ "var" ident {"," ident} ";"] { "procedure" ident ";" block ";" } statement . statement = [ ident ":=" expression | "call" ident | "?" ident | "!" expression | "begin" statement {";" statement } "end" | "if" condition "then" statement | "while" condition "do" statement ]. condition = "odd" expression | expression ("="|"#"|"<"|"<="|">"|">=") expression . expression = [ "+"|"-"] term { ("+"|"-") term}. term = factor {("*"|"/") factor}. factor = ident | number | "(" expression ")". ,,, PROGRAMS AND BLOCKS IN PLZERO .... I will try to keeps wirths grammatical structure, but factor the 2 rules and introduce new parse tokens for readability, such "constdec*" for a constant declaration, and "vardec* for a variable declaration. I will also try to keep Wirth's rule names for reference * the program and block grammar rules -------- program = block "." . block = [ "const" ident "=" number {"," ident "=" number} ";"] [ "var" ident {"," ident} ";"] { "procedure" ident ";" block ";" } statement . ,,, Once this script is written we can check if it is parsing correctly, and then move on to variable declarations, proceedure declarations and so forth (but not the forth language, which doesnt require a grammar). * a script just for pl/0 constant and variable declarations --------- # a constant declaration rule in Wirths ebnf grammar # constdec = "const" ident "=" number {"," ident "=" number} ";" . # the script below factors out the {} repetition construct. read; [:alpha:] { while [:alpha:]; # keywords in pl/0 "const","var","if","then","while","do","begin","end" { put; add "*"; push; .reparse } put; clear; add "ident*"; push; .reparse } [0-9] { while [0-9]; put; clear; add "number*"; push; .reparse } # literal tokens ",","=",";" { add "*"; push; } parse> # or parse this like varset/vardec "ident*=*number*" { clear; add "equalityset*"; push; .reparse } "equalityset*,*ident*=*number*" { clear; add "equalityset*"; push; .reparse } "const*equalityset*;*" { clear; add "constdec*"; push; .reparse } # to actually compile we might have to split this rule "var*ident*","varset*,*ident*" { clear; add "varset*"; push; .reparse } "varset*;*" { clear; add "vardec*"; push; .reparse } { pop; pop; "varset*","vardec*" { add " declaration \n"; print; clear; quit; } } ,,, PLZERO EXPRESSIONS .... * wsn/ebnf grammar for pl/0 expressions --------- expression = [ "+"|"-"] term { ("+"|"-") term}. term = factor {("*"|"/") factor}. factor = ident | number | "(" expression ")". ,,, The parse-machine cannot directly implement repetitions {} optionals [] or groupings () so the expression grammar needs to be "factored" so that only alternations remain. * a factored grammar --------- factor = ident | number | "(" expression ")". mulfactorset = "*" factor | "/" factor. mulfactorset = mulfactorset mulfactorset. term = factor | factor mulfactorset. addtermset = "+" term | "-" term. addtermset = addtermset addtermset. expression = term | term addtermset expression = "+" term | "-" term. expression = "+" term addtermset | "-" term addtermset. ,,, But this grammar has a problem for the parse-machine!! For example "term" will reduce immediately to "expression" without looking for extra factors. So this grammar is probably not suitable for an LR parser.... There are a number of extra rules to compensate for the lack of ebnf constructs. We also have to split rules that have a different number of parse tokens in each alternation. (eg factor) TRIGGER RULES .... ',' orset '{' ::= ',' test '{' ; NEGATED CLASSES .... Often when creating a language or data format, we want to be able to negate operators or tests (so that the test has the opposite effect to what it normally would). In the parse script language I use the prefixed "!" character as the negation operator. We can create a whole series of negated "classes" or tokens in the following way. So the "negation" logic is actually stored on the stack, not on the tape. This is useful because we can compile all these negated tokens in a similar way (see the new "quoteset*" implementation. * example of creating a set of negated classes --------- "!*quote*","!*class*","!*begintext*", "!*endtext*" { # save the second token (quote/class/begintext) on the # tape (but not the current tape cell which has other stuff in # it push; ++; put; --; clear; pop; clear; add "not"; # extract the saved token name from the tape ++; ++; get; --; --; # now we have a token "notquote*" / "notclass*" / "notbegintext* push; get; --; put; ++; clear; .reparse } ,,, LOOKAHEAD AND REVERSE REDUCTIONS .... Quotesets have been replaced in the current (aug 2019) implementation of compile.pss with 'ortestset' and 'andtestset'. But the compilation techniques are similar to those shown below. The current implementation of quotesets in compile.pss seems quite interesting. It waits until the stack contains a brace token "{*" until it starts reducing the quoteset list. * a quoteset >> 'a','b','c','d' { nop; } so the compile.pss script actually parses "'c','d' { " first, and then resolves the other quotes ('a','b'). This is good because the script can work out the jump-target for the forward true jump. (the accumulator is used to keep track of the forward true jump). It also uses the brace as a lookahead, and then just pushes it back on the stack, to be used later when parsing the whole brace block. * bnf rules for parsing quotesets >> quoteset '{' := quote ',' quote '{' ; >> quoteset '{' := quote ',' quoteset '{' ; But this has 2 elements on the left-hand side. This works but is not considered good grammar (?) QUOTESET REFINEMENTS .... quoteset should be renamed to something else, eg "testset*" because it will contain class tests, begin-tests, end-tests etc. Also we will parse like this * some rules for parsing/compiling quotesets/testsets tokens -------- quoteset '{' := quote '{' ; quoteset '{' := begintest '{' ; quoteset '{' := endtest '{' ; .... (more test types) ... quoteset '{' := quote ',' quoteset '{' ; quoteset '{' := begintest ',' quoteset '{' ; quoteset '{' := begintest ',' quoteset '{' ; .... (more test types) ... ,,, We will also be able to get rid of various parse rules that are currently in the script compile.pss The consequence of this parsing technique, is that we will be able to include any type of test in a "testset*" sequence ... These multiple testset sequences may be used as a poor-womans regular expression pattern matcher. * test if the workspace begins with 'http://', ends with '.txt' and * doesnt contain the letter 'x' ------- B"http://", E".txt", ![x] { add " (found text file link!) \n"; print; clear; } ,,,, It doesnt really make sense to combine a text-equals test with any other test, but the other combinations are useful.f RABBIT HOPS .... The "quoteset" token syntax parses a string such as >> 'a','b','c','d' { nop; } The comma is the equivalent of the alternation operator (|) in bnf syntax. In my first attempt to parse "quoteset" tokens with the "compile.pss" script compiler, I used a rabbit hop technique. From an efficiency and compiled code size point-of-view this is a very bad way to compile the code, but it seems that it may be useful in other situations. It provides a way to generate functional code when the final jump-target is not know at "shift-reduce" time. * create rabbit hops for the true jump ------ "quote*,*quote*" { clear; add "testis "; get; # just jump over the next test add "\njumptrue 3 \n"; ++; ++; add "testis "; get; add "\n"; # add the next jumptrue when the next quote is found --; --; put; clear; add "quoteset*"; push; # always reparse/compile .reparse } # quoteset ::= quoteset , quote ; "quoteset*,*quote*" { clear; get; ++; ++; add "jumptrue 4 \n "; add "jumptrue 3 \n "; add "testis "; get; add "\n"; # add the next jumptrue when/if the next quote is found --; --; put; clear; add "quoteset*"; push; # always reparse/compile .reparse } ,,,, YACC AND LEX COMPARISON The tools "yacc" and "lex" and the very numerous clones, rewrites and implementations of those tools are very popular in the implementation of parsers and compilers. This section discusses some of the important differences between the parse-machine and language and those tools. The pp language and machine is (deliberately) a much more limited system than a lex/yacc combination. a lex/yacc-type system often produces c-code or some other language code which is then compiled or run to implement the parser/compiler. The pp system, on the other hand, is a "text-filter"; it simply transforms one text format into another. For this reason, it cannot perform the complex programmatic "actions" that a yacc-tool can achieve. ADVANTAGES COMPARED TO LEX YACC SYSTEM .... While clearly more limited than a lex-yacc style system, in my opinion the current machine has some advantages: * It is much simpler and there should be much easier to understand * It does not make use of shift-reduce tables. * It should be possible to implement it on much smaller systems. * Because it is a text-filter, it should be more accessible for "playing around" or experimentation. Perhaps it lacks the psychological barrier that a lex-yacc system has for a non-specialist programmer. * It may be simpler and faster to port the system to new platforms/ languages. All that is required is a parse-machine "object" in the target language and a script similar to "compilable.c.pss" NAMING OF SYSTEM I dont have a good name for this virtual machine and script language yet. The executable is called 'pp' for pattern-parse, and the source file is called gh.c for no particular reason. The source files are split into .c files where each one corresponds to a particular "object" (data structure) within the machine (eg tapecell, tape, buffer (with stack and workspace). The name should be short, easy to type, and hopefully positive and fun sounding... HISTORY OF IDEA The file object/gh.c contains detailed information about the development of this idea. EVOLUTION OF THE MACHINE AND LANGUAGE * I initially called this the "chomski" machine and language but decided to change it to something more "unixy". * The a+ and a- commands were initially called "plus" and "minus" * The "ll" and "cc" (line number and character number) registers and commands are recent, but very important additions, because they allow script error messages to pinpoint the line and character number of the error. IMPORTANT FILES EXAMPLE SCRIPTS .... The folder "eg/" contains a limited collection of example scripts to show the capabilities of the language and machine. COMPILE PSS .... This is the main script compiler. It has replaced the handcoded asm.pp file because it is easier to write and maintain. ASM PP .... This file is important because it actually implements the parsing vm scripting language. It is a text file which consists of a series of "instructions" or commands for the parsing virtual machine. These instructions include instructions which alter the registers of the virtual machine; tests, which set the flag register of the machine to true if the test returns true, or else false; and conditional and unconditional jumps which change the instruction pointer for the machine if the flag register is true. Asm.pp also contains labels (lines ending in :). These labels make it much easier to write code containing jumps (a label can be used instead of an instruction number. Because of the similarity of this format to many "assembly" languages I refer to this as assembly language for the parse virtual machine. LEXICAL PHASE .... Like most systems which are designed to parse and compile context-free languages Asm.pp has 2 destinct phases: A "lexing" phase and a parse/compile phase. This is shown by the separation of the unix tools "lex" and "yacc" where lex performs the lexical analysis (which consists of the recognition and creation of lexical tokens) and yacc performs parsing and compilation of tokens. In the asm.pp program, as with many scripts that can be written in the parse language, the lexical and compilation phases are combined into the same script. In the first phase, the program performs lexical analysis of the input (which is an uncompiled script in the parse script language), and converts certain patterns into "tokens". In this system, a "token" is just a string (text) terminated in an asterix (*) character. Asm.pp constructs these tokens by using the "add" machine instruction, which appends text to the workspace buffer. Next the parse token is "pushed" onto the "stack". I have quoted these words because the stack buffer is implemented as a string (char *) buffer and "pushing" and "popping" the stack really just involves moving the workspace pointer back and forth between asterix characters. I used this implementation because I thought that it would be fast and simple to implement in c. It also means that we dont have to worry about how much memory is allocated for the stack buffer or each of its items. As long as there is enough memory allocated for the workspace buffer (which is actually just the end of the stack buffer) there will be enough room for the stack. The lexical phase of asm.pp also involves preserving the "attribute" of the parse token. For example if we have some text such as "hannah" then our parse token may be "quoted.text*" and the attribute is the actual text between the quotes 'hannah'. The attribute is preserved on the machine tape data structure, which is array of string cells (with no fixed size), in which data can be inserted at any point by using the tape pointer. PARSING PHASE .... The parsing phase of the asm.pp compiler involves recognising and shift-reducing the token sequences that are on the machine "stack". These tokens are just strings post-delimited with the '*' character. Because the tokens are text, they popped onto the workspace buffer and then manipulated using the workspace text commands. TEST.COMMANDS.PSS .... The file test.commands.pss has a set of examples of correct and incorrect syntax as determined by the asm.pp compiler. It also contains comments about the use of the syntax elements VIM AND PP Since I usually edit with vim (since I dont have a copy of "sam" or "acme"), here are some techniques for using vim with the pp tool. The vim mappings and commands below are useful for checking that pp "one-liners" and pp scripts and script fragments that are within a text document, actually compile and run. Perhaps this is clumsy way of achieving "literate programming" (Knuth). The mapping below can only run the script with a static input "abc" which is not very useful, but at least tests if the script compiles properly. The compiled script will be saved in "sav.pp" The multiline snippet is embedded in "---" and ",,," tags which are both on an empty line. * create a vim mapping to run a script embedded in a text document >> map ,pp :?^ *---?+1,/^ *,,,/-1w! test.pss \| !pp -f test.pss -i "abc" * create a vim mapping to execute the current line as a pp "one-liner" >> map ,pl :.w !sed 's/^ *>>//' \| bash * create a vim command to execute the current line as a pp "one-liner" >> command! Ppl .w !sed 's/^ *>>//' | bash The mappings and commands are for putting in the vimrc file. To create them within the editor pre-pend a ":" to each mapping etc. >> :command! Ppl .w !sed 's/^ *>>//' | bash * create a vim command to run a script embedded in a text document * with input provided as an argument to the vim command >> com! -nargs=1 Pp ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pp -f test.pss -i "" This is useful since scripts always need some input. * the same as above but interactive. NOT working!! >> com! -nargs=1 Ppi ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pp -If test.pss -i "" STATUS 25 august 2019 great progress has been made. compile.pss has all sorts of nice new syntax like negated text= tests !"abc"{ ... } Almost all tests can be negated. There is now a AND concatentation operator (.). begin blocks, begintests in quotesets (in progress). compile.pss has replaced asm.handcode.pp for compiling scripts. HISTORY see object/gh.c for detailed development history 2019 For a number of years I have been working on a project to write a virtual machine for pattern parsing. The source code is located at ~/sf/htdocs/books/pars/gh.c or https://bumble.sf.net/books/pars/gh.c This virtual machine can (and is) used to implement a script language for parsing and compiling some context-free languages. (The implementation is in 'asm.pp') The project is now at a stage where useful scripts can be written in the parse-script-language. The purpose of the virtual machine is to be able to parse and transform patterns which cannot normally be dealt with through "regular expressions". I.e patterns which are not "regular languages". Possibly the simplest example of one of these would be palindromes (eg "aba", "hannah", "anna"). The machine also allows a script language to describe patterns and transformations, and this language has similarities to sed and to awk. In fact the whole idea was inspired by sed and its limitations for context free languages. DESIGN PHILOSOPHY FOR THE MACHINE When designing the parse machine, I wanted to make it capabilities as limited as possible, while still being able to properly parse and translate context-free languages. Related to this idea, was the aim to make the machine implementable in the smallest possible way. Also, I deliberately excluded the use of regular expressions, so that the script writer would not be tempted to try to "parse" context-free patterns with them. The general design of the syntax and command-line usage is inspired by various unix tools, such as sed/grep/awk/perl. IDEAS Palindromes are also interesting because they can be parsed with the simplest possible recursive descent parser. * pseudo code --------- parsePalindrome (text) { n = text.firstCharacter, m = text.lastCharacter if n <> m { return false } newText = text - first and last characters parsePalindrome(newText) return true } ,,, But the "pp" machine does not use recursive descent parsing. In fact the gh machine was written because recursive descent parsing seemed aesthetically unpleasing. The machine is still not finished, but it is getting there. The actual virtual machine has been written in c along with an interpreting loop that allows the machine to be inspected and tested as well as programs loaded and saved. The machine uses a type of "assembly language" for saving machine instructions into a text file. There are more comprehensive comments in the c source code file gh.c DOCUMENT HISTORY 4 September 2019 Adding some ideas about parsing optional elements. 23 August 2019 trying to organise this document.