;" in a parse script >> "article*noun*" { clear; add "nounphrase*"; push; } TRICKS This section contains tips about how to perform specific tasks within the limitations of the parse machine (which does not have regular expressions, nor any kind of arithmetic). See the example eg/plzero.pss for an example of reducing high token rules before low token rules, in order to resolve precedence issues. * check if accumulator is equal to 4 ---- read; a+; put; clear; count; "4" { clear; add "4th char is '"; get; "'\n"; print; clear; } ,,, * print only if number 3 digits or greater ------- # check if the input matches the regex /[0-9]{3,}/ r; (eof) { [0-9] { put; clip; clip; clip; !"" { clear; get; print; } } } ,,, * print the length of each word in input ---- read; ![:space:] { nochars; whilenot [:space:]; add " ("; chars; add ") "; print; clear; } # ignore whitespace !"" { clear; } ,,,, * another way to print the length of each word in input ---- read; E" ",E"\n",(eof) { add "("; chars; add ") "; print; clear; nochars; } ,,,, The script below uses a trick of using the replace command with the "tape equals workspace" test (==) to check if the workspace contains a particular string. * print only lines that contain the text 'puma' ---- whilenot [\n]; put; replace "puma" ""; !(==) { clear; get; print; } (eof) { quit; } ,,,, MULTIPLEXING TOKEN SEQUENCES .... Sometimes it is useful to have a long list of token sequences before a brace block. One way to reduce this list is as follows * using nested tests to reduce token sequence lists ---- pop; pop; B"aa*","bb*","cc*" { E"xx*","yy*","zz*" { # process tokens here. nop; } } # equivalent long token sequence list "aa*xx*","aa*yy*","aa*zz*","bb*xx*","bb*yy*","bb*zz*" "cc*xx*","cc*yy*","cc*zz*" { nop; } ,,, NOTES FOR A REGEX PARSER .... * parse a regex between / and / ------- #* tokens for the regex parser class: [^-a\][bc1-5+*()] spec: the list and ranges in [] classes char: one character *# begin { while [:space:]; clear; } read; !"/" { clear; add "error"; print; quit; } # special characters for regex can be literal tokens [-/\]()+*$^.?] { "*" { put; clear; add "star*"; push; .reparse } add '*'; push; .reparse } # the start of class tests [^ ... ] "[" { read; # empty class [] is an error "]" { clear; add "Empty class test [] at char "; chars; print; quit; } # negated class test "^" { clear; add "[neg*"; push; .reparse } # not negated clear; add "[*char*"; push; push; .reparse } # just get the next char after the escape char "\\ " { (eof) { clear; add "error!"; print; quit; } clear; read; [ntfr] { "n" { clear; add "\n"; } "t" { clear; add "\t"; } "f" { clear; add "\f"; } "r" { clear; add "\r"; } put; clear; add "char*"; push; } } !"" { put; add "char*"; push; .reparse } parse> pop; pop; "char*star*" { clear; add "pattern*"; .reparse } "char*char*" { clear; add "pattern*char*"; push; push; .reparse } "[*]*" { clear; add "error!"; print; quit; } # 3 tokens pop; # parsing class tests eg: [ab0-9A-C^&*] or [^-abc] # so the only special characters in classes are []^(negation) # - indicates a range, except when it is the 1st char in the brackets # # sequences like: [*char*char* [*-*char* B"[*",B"[neg*" { # in brackets, a-b is a range !E"]*".!E"-*" { # manip attributes clear; add "[*spec*char*"; push; push; push; .reparse } } # token sequences, eg: spec/char/? spec/+/+ spec/./char B"spec*" { !E"]*".!E"-*" { # manip attributes get; ++; get; --; put; clear; add "spec*char*"; push; push; .reparse } } "[*spec*]*","[*char*]*" { ++; get; --; put; clear; add "class*"; push; .reparse } "[neg*spec*]*","[neg*char*]*" { ++; get; --; put; clear; add "negclass*"; push; .reparse } # end of class parsing # a pattern can be simple, like 'a*' or complex like # '[1-9abc\n]+' also a pattern can be a sequence of # patterns "pattern*pattern*" { # this is a recogniser or checker. so just copy the patterns # across. get; ++; get; --; put; clear; add "pattern*"; push; .reparse } push; push; (eof) { add "Parse stack is: "; print; clear; unstack; print; quit; } ,,, COMPILATION TECHNIQUES One of the enjoyable aspects of this parsing/compiling machine is discovering interesting practical "heuristic" techniques for compiling syntactical structures, within the limitations of the machine capabilities. This section details some of these techniques as I discover them. LOOKAHEAD .... the script eg/mark.latex.pss contains a clumsy token lookahead. But I may try to convert it to a new technique. * a technique for lookahead parsing (json) ------ pop; pop; pop; pop; # the test below ensures that there are 4 tokens in the workspace # the final token will normally be ]* or ,* (this is a json array) # but we dont have to worry about it B"items*,item*.!"items*item*" { replace "items*,item*" "items*"; push; push; .reparse } ,,, RULE ORDER .... In a script, after the parse> label we can parse rules in the order of the number of tokens. Or we can group the rules by token. There are some traps, for example: "pop; pop;" doesnt guarantee that there are 2 tokens in the workspace. RECOGNISERS AND CHECKERS .... A recogniser is a parser that only determines if a given string is a valid "word" in the given language. We can extend a recogniser to be an error checker for a given string, so that it determines at what point (character or line number) in the string, the error occurs. The error-checker can also give a probably reason for the error (such as the missing or excessive syntactic element) This is much more practically useful than a recogniser * examples with error messages ... EMPTY START TOKEN .... When the start symbol is an array of another token, it may often simplify parsing to create an empty start token in a "begin" block ebnf: text = word* * using an empty start token ----- begin { add "text*"; push; } read; # ignore whitespace [:space:] { while [:space:]; clear; } !"" { whilenot [:space:]; put; clear; add "word*"; push; } parse> pop; pop; "text*word*" { clear; add "text*"; push; .reparse } (eof) { # check for start symbol 'text* here } push; push; ,,, Without the empty "text*" token we would have to write >> "text*word*","word*word*" { } This is not such a great disadvantage, but it does lead to inefficient compiled code, because the "word*word*" token sequence only occurs once when running the script (at the beginning of the input stream) * compiled code for "text*word*","word*word*" { ... } ------- ,,,, Secondly, in other circumstances, there are other advantages of the empty start symbol. See the pars/eg/history.pss script for an example. END OF STREAM TOKEN .... This is an analogous technique to the "empty start symbol". In many cases it may simplify parsing to create a "dummy" end token when the end-of-stream is reached. This token should be created immediately after the parse> label * example of use of "end" token to parse dates in text ------- # tokens: day month year word # rules: # date = day month year # date = day month word # date = day month end read; ![:space:] { whilenot [:space:]; put; clear; [0-9] { # matches regex: /[0-9][0-9]*/ clip; clip; !"" { } add "word*"; push; } parse> (eof) { } ,,,, PALINDROMES .... Palindromes are an interesting exercise for the machine because they may be the simplest context-free language. The script below is working but also prints single letters as palindromes. See the note below for a solution. * print only words that are palindromes -------- read; # the code in this block builds 2 buffers. One with # the original word, and the other with the word in reverse # Later, the code checks whether the 2 buffers contain the # same text (a palindrome). ![:space:] { # save the current character ++; ++; put; --; --; get; put; clear; # restore the current character ++; ++; get; --; --; ++; swap; get; put; clear; --; } # check for palindromes when a space or eof found [:space:], { # clear white space [:space:] { while [:space:]; clear; } # check if the previous word was a palindrome get; ++; # if the word is the same as its reverse and not empty # then its a palindrome. (==) { # make sure that palindrome has > 2 characters clip; clip; !"" { clear; get; add "\n"; print; } } # clear the workspace and 1st two cells clear; put; --; put; } ,,,, LINE BY LINE TOKENISATION .... See the example below and adapt * a simple line tokenisation example -------- read; [\n] { clear; } whilenot [\n]; put; clear; add "line*"; push; parse> pop; pop; "line*line*", "lines*line*" { clear; get; add "\n"; ++; get; --; put; clear; add "words*"; push; .reparse } push; push; (eof) { pop; "lines*" { clear; get; print; } } ,,,, * remove all lines that contain in a particular text ---- until " ,,, WORD BY WORD TOKENISATION .... A common task is to treat the input stream as a series of space delimited words. * a simple word tokenisation example, print one word per line -------- read; [:space:] { clear; } whilenot [:space:]; put; clear; add "word*"; push; parse> pop; pop; "word*word*", "words*word*" { clear; get; add "\n"; ++; get; --; put; clear; add "words*"; push; .reparse } push; push; (eof) { pop; "words*" { clear; get; print; } } ,,,, REPETITIONS .... The parse machine cannot directly encode rules which contain the ebnf repetition construct {}. The trick below only creates a new list token if the preceding token is not a list of the same type. * a technique for building a list token from repeated items ---- # ebnf rules: # alist := a {a} # blist := b {b} read; # terminal symbols "a","b" { add "*"; push; } !"" { put; clear; add "incorrect character '"; get; add "'"; add " at position "; chars; add "\n"; add " only a's and b's allowed. \n"; print; quit; } parse> # 1 token (with extra token) pop; "a*" { pop; !"a*".!"alist*a*" { push; } clear; add "alist*"; push; .reparse } "b*" { pop; !"b*".!"blist*b*" { push; } clear; add "blist*"; push; .reparse } push; { unstack; put; clear; add "parse stack is: "; get; print; quit; } ,,,, PLZERO CONSTANT DECLARATIONS .... An example of parsing a repeated element occurs in pl/0 constant declarations. The rule in Wirth's ebnf syntax, is: >> constdec = "const" ident "=" number {"," ident "=" number} ";" Examples of valid constant declarations are ---- const g = 300, h=2, height = 200; const width=3; ,,, It is necessary to factor the ebnf rule to remove the repetition element (indicated by the braces {} ). In the script below, the repeated element is factored into the "equality" and "equalityset" parse tokens. The script below may seem very verbose compared to the ebnf rule but it lexes and parses the input stream, recognises keywords and punctuation and provides error messages. * recognise pl/0 constant declarations -------- begin { add ' recognising pl/0 constant decs in the form: "const g = 300, h=2, height = 200;" "const width=3;" \n'; print; clear; } read; [:alpha:] { while [:alpha:]; # keywords in pl/0 "const","var","if","then","while","do","begin","end" { put; add "*"; push; .reparse } put; clear; add "ident*"; push; .reparse } [0-9] { while [0-9]; put; clear; add "number*"; push; .reparse } # literal tokens ",", "=", ";" { add "*"; push; } # ignore whitespace [:space:] { clear; } !"" { add " << invalid character at position "; chars; add ".\n"; print; quit; } parse> pop; pop; pop; "ident*=*number*" { clear; add "equality*"; push; .reparse } "equality*,*equality*","equalityset*,*equality*" { clear; add "equalityset*"; push; .reparse } "const*equality*;*", "const*equalityset*;*" { clear; add "constdec*"; push; } push; push; push; { pop; pop; "constdec*" { clear; add " Valid PL/0 constant declaration!\n"; print; quit; } push; push; add " Invalid PL/0 constant declaration!\n"; add " Parse stack: "; print; clear; unstack; print; quit; } ,,,, OFFSIDE OR INDENT PARSING .... Some languages use indentation to indicate blocks of code, or compound statements. Python is an important example. These languages are parsed using "indent" and "outdent" or "dedent" tokens. The mark/go commands should allow parsing of indented languages. completely untested and incomplete... the idea is to issue "outdent" or "indent" tokens by comparing the current leading space to a previous space token. But the code below is a mess. The tricky thing is that we can have multiple "outdent*" tokens from one space* token eg ---- if g==x: while g<100: g++ g:=0; ,,,, * a basic indent parsing procedure -------- # incomplete!! read; begin { mark "b"; add ""; ++; } [\n] { clear; while [ ]; put; mark "here"; go "b"; # indentation is equal so, do nothing (==) { clear; go "here"; .reparse } add " "; (==) { clear; add "indent*"; push; go "here"; .reparse } clip; clip; clip; clip; (==) { clear; add "outdent*"; push; go "here"; .reparse } put; clear; add "lspace*"; push; mark "b"; } parse> ,,,, OPTIONALITY .... The parse machine cannot directly encode the idea of an optional "[...]" element in a bnf grammar. * a rule with an optional element >> r := 'a' ['b'] . In some cases we can just factor out the optional into alternation "|" >> r := 'a' | 'a' 'b' . However once we have more than 2 or 3 optional elements in a rule, this becomes impractical, for example >> r := ['a'] ['b'] ['c'] ['d'] . In order to factor out the optionality above we would end up with a large number of rules which would make the parse script very verbose. Another approach is to encode some state into a parse token. * a strategy for parsing rules containing optional elements ---------- # parse the ebnf rule # rule := ['a'] ['b'] ['c'] ['d'] ';' . begin { add "0.rule*"; push; } read; [:space:] { clear; } "a","b","c","d",";" { add "*"; push; .reparse } !"" { add " unrecognised character."; print; quit; } parse> pop; pop; E"rule*a*" { B"0" { clear; add "1.rule*"; push; .reparse } clear; add "misplaced 'a' \n"; print; quit; } E"rule*b*" { B"0",B"1" { clear; add "2.rule*"; push; .reparse } clear; add "misplaced 'b' \n"; print; quit; } E"rule*c*" { B"0",B"1",B"2" { clear; add "3.rule*"; push; .reparse } clear; add "misplaced 'c' \n"; print; quit; } E"rule*d*" { B"0",B"1",B"2",B"3" { clear; add "4.rule*"; push; .reparse } clear; add "misplaced 'd' \n"; print; quit; } E"rule*;*" { clear; add "rule*"; push; } push; push; (eof) { pop; "rule*" { add " its a rule!"; print; quit; } } ,,,, REPETITION PARSING .... Similar to the notes above about parsing grammar rules containing optional elements, we have a difficulty when parsing elements or tokens which are enclosed in a "repetition" structure. In ebnf syntax this is usually represented with either braces "{...}" or with a kleene star "*". We can use a similar technique to the one above to parse repeated elements within a rule. The rule parsed below is equivalent to the regular expression >> /a?b*c*d?;/ So the script below acts as a recogniser for the above regular expression. I wonder if it would be possible to write a script that turns simple regular expressions into pep scripts? In the code below we don't have any separate "blist" or "clist" tokens. The code below appears very verbose for a simple task. * parsing repetitions within a grammar rule ---------- # parse the ebnf rule # rule := ['a'] {'b'} {'c'} ['d'] ';' . # equivalent regular expression: /a?b*c*d?;/ begin { add "0/rule*"; push; } read; [:space:] { clear; } "a","b","c","d",";" { add "*"; push; .reparse } !"" { add " unrecognised character."; print; quit; } parse> # ------------ # 2 tokens pop; pop; E"rule*a*" { B"0" { clear; add "a/rule*"; push; .reparse } clear; add "misplaced 'a' \n"; print; quit; } E"rule*b*" { B"0",B"a",B"b" { clear; add "b/rule*"; push; .reparse } unstack; add " << parse stack.\n"; add "misplaced 'b' \n"; print; quit; } E"rule*c*" { B"0",B"a",B"b",B"c" { clear; add "c/rule*"; push; .reparse } clear; add "misplaced 'c' \n"; unstack; add " << parse stack.\n"; print; quit; } E"rule*d*" { B"0",B"a",B"b",B"c" { clear; add "d/rule*"; push; .reparse } clear; add "misplaced 'd' \n"; print; quit; } E"rule*;*" { clear; add "rule*"; push; } push; push; (eof) { pop; "rule*" { clear; add "text is in regular language /a?b*c*d?;/ \n"; print; quit; } push; add "text is not in regular language /a?b*c*d?;/ \n"; add "parse stack was:"; print; clear; unstack; print; quit; } ,,,, PL ZERO .... Pl/0 is a minimalistic language created by Niklaus Wirth, for teaching compiler construction. In this section, I will explore converting the pl/0 grammar into a form that can be used by the parsing machine and language. I will do this in stages, first parsing "expressions" and then "conditions" etc. Wirths grammar is designed to be used in a recursive descent parser/compiler, so it will be interesting to see if it can be adapted for the machine which is essentially a LR shift-reduce parser/compiler. The grammar below seems to be adequate for LR parsing, except for the expression tokens (expression, term, factor etc). Apart from expression we should be able to factor out the various [] {} and () constructs and create a machine parse script. * the pl/0 grammar in wsn/ebnf form ------- program = block "." . block = [ "const" ident "=" number {"," ident "=" number} ";"] [ "var" ident {"," ident} ";"] { "procedure" ident ";" block ";" } statement . statement = [ ident ":=" expression | "call" ident | "?" ident | "!" expression | "begin" statement {";" statement } "end" | "if" condition "then" statement | "while" condition "do" statement ]. condition = "odd" expression | expression ("="|"#"|"<"|"<="|">"|">=") expression . expression = [ "+"|"-"] term { ("+"|"-") term}. term = factor {("*"|"/") factor}. factor = ident | number | "(" expression ")". ,,, PROGRAMS AND BLOCKS IN PLZERO .... I will try to keeps wirths grammatical structure, but factor the 2 rules and introduce new parse tokens for readability, such "constdec*" for a constant declaration, and "vardec* for a variable declaration. I will also try to keep Wirth's rule names for reference * the program and block grammar rules -------- program = block "." . block = [ "const" ident "=" number {"," ident "=" number} ";"] [ "var" ident {"," ident} ";"] { "procedure" ident ";" block ";" } statement . ,,, Once this script is written we can check if it is parsing correctly, and then move on to variable declarations, proceedure declarations and so forth (but not the forth language, which doesnt require a grammar). PLZERO EXPRESSIONS .... * wsn/ebnf grammar for pl/0 expressions --------- expression = [ "+"|"-"] term { ("+"|"-") term}. term = factor {("*"|"/") factor}. factor = ident | number | "(" expression ")". ,,, TRIGGER RULES .... >> ',' orset '{' ::= ',' test '{' ; NEGATED CLASSES .... Often when creating a language or data format, we want to be able to negate operators or tests (so that the test has the opposite effect to what it normally would). In the parse script language I use the prefixed "!" character as the negation operator. We can create a whole series of negated "classes" or tokens in the following way. So the "negation" logic is actually stored on the stack, not on the tape. This is useful because we can compile all these negated tokens in a similar way . * example of creating a set of negated classes --------- # the ! operator is used as a literal token "!*quote*","!*class*","!*begintext*", "!*endtext*", "!*eof*","!*tapetest*" { replace "!*" "not"; push; # now transfer the token value into the correct tape cell get; --; put; ++; clear; .reparse } ,,, LOOKAHEAD AND REVERSE REDUCTIONS .... The so called "quotesets" have been replaced in the current (aug 2019) implementation of compile.pss with 'ortestset' and 'andtestset'. But the compilation techniques are similar to those shown below. The old implementation of the "quotesets" token in old versions of compile.pss seems quite interesting. It waits until the stack contains a brace token "{*" until it starts reducing the quoteset list. * a "quoteset", or a set of tests with OR logic >> 'a','b','c','d' { nop; } so the compile.pss script actually parses "'c','d' { " first, and then resolves the other quotes ('a','b'). This is good because the script can work out the jump-target for the forward true jump. (the accumulator is used to keep track of the forward true jump). It also uses the brace as a lookahead, and then just pushes it back on the stack, to be used later when parsing the whole brace block. * bnf rules for parsing quotesets >> quoteset '{' := quote ',' quote '{' ; >> quoteset '{' := quote ',' quoteset '{' ; But this has 2 elements on the left-hand side. This works but is not considered good grammar (?) Multiple testset sequences may be used as a poor-persons regular expression pattern matcher. * print words beginning with 'z', ending with '.txt' and * not containing the letter 'o' ------- r; E" ",E"\n",(eof) { !(eof) { clip; } B"z".E".txt".![o] { add " (yes!) \n"; print; } clear; } ,,,, It doesnt really make sense to combine a text-equals test with any other test, but the other combinations are useful. The "set" token syntax parses a string such as >> 'a','b','c','d' { nop; } The comma is the equivalent of the alternation operator (|) in bnf syntax. RABBIT HOPS .... In my first attempt to parse "quoteset" tokens with the "compile.pss" script compiler, I used a "rabbit hop" technique. From an efficiency and compiled code size point-of-view this is a very bad way to compile the code, but it seems that it may be useful in other situations. It provides a way to generate functional code when the final jump-target is not know at "shift-reduce" time. * create "rabbit hops" for the true jump ------ "quote*,*quote*" { clear; add "testis "; get; # just jump over the next test if true flag add "\njumptrue 3 \n"; ++; ++; add "testis "; get; add "\n"; # add the next jumptrue when the next quote is found --; --; put; clear; add "quoteset*"; push; # always reparse/compile .reparse } # quoteset ::= quoteset , quote ; "quoteset*,*quote*" { clear; get; ++; ++; add "jumptrue 4 \n "; add "jumptrue 3 \n "; add "testis "; get; add "\n"; # add the next jumptrue when/if the next quote is found --; --; put; clear; add "quoteset*"; push; # always reparse/compile .reparse } ,,,, ASSEMBLY FORMAT AND FILES The implementation of the pep script language uses an intermediary "assembly" phase when loading scripts. asm.pp is responsible for converting the script into an assembly (text) format. "asm.pp" is itself an "assembly-format" file. I refer to this format as "assembly" or "assembler" because it is similar to other assembly languages: It has one instruction per line. These files consist of "instructions" on the virtual machine, along with "parameters", jumps, tests and jump labels, (which make writing assembly files much easier since line numbers do not have to be used). So the asm.pp file actually implements the pep script language. The proof is in the pudding: The implementation of the pep script language shows that the pep system (and the pep virtual machine) is capable of implementing code languages and data languages (or at least simple ones). Normally, it wil not be necessary to write any assembler code, since the script language is much more readable. However, it is useful for debugging scripts to view the assembly listing as it is loaded into the machine (the -I switch allows this). The assembler file syntax is similar to other machine assemblers: 1 command per line, leading space is insignificant. Labels are permitted and end in a ":" character. * an example compilation of a basic script ------- # pep script source # read; "abc" { nop; } # 'assembly' equivalent of the above script start: read testis "abc" jumpfalse block.end.21 nop block.end.21: jump start ,,, * another example showing begin-block compilation --------- # pep script source # begin { whilenot [:space:]; clear; } read; [:space:] { d; } # compilation: whilenot [:space:] clear start: read testclass [:space:] jumpfalse block.end.60 clear block.end.60: jump start ,,,, COMPARISON WITH OTHER COMPILER COMPILERS As far as I am aware, all other compiler compiler systems take some kind of a grammar as input, and produce source code as output. The produced source code acts as a "recogniser" for strings which conform to the given grammar. YACC AND LEX COMPARISON .... The tools "yacc" and "lex" and the very numerous clones, rewrites and implementations of those tools are very popular in the implementation of parsers and compilers. This section discusses some of the important differences between the pep machine and language and those tools. The pep language and machine is (deliberately) a much more limited system than a "lex/yacc" style combination. A lex/yacc-type system often produces "c" language code or some other language code which is then compiled and run to implement the parser/compiler. The pep system, on the other hand, is a "text stream filter"; it simply transforms one text format into another. For this reason, it cannot perform the complex programmatic "actions" that tools such as lex/yacc bison or antlr can achieve. While clearly more limited than a lex-yacc style system, in my opinion the current machine has some advantages: * It may be simpler and therefore should be easier to understand * It does not make use of shift-reduce tables. * It should be possible to implement it on computer environments with modest resources (data/code memory). * Because it is a text-filter, it should be more accessible for "playing around" or experimentation. Perhaps it lacks the psychological barrier that a lex-yacc system has for a non-specialist programmer. * Using translation scripts, we can translate any pep script into a number of other languages, such as python, ruby, java, go and plain c. These translation scripts (eg tr/translate.python.pss) can also translate themselves, thus creating a complete stand-alone pep system in the target language. * It is relatively easy to create a pep translation script for a new language: a class/object/data-structure is created representing the pep virtual machine, and then an existing translation script is adapted for the new language. STATUS As of june 2022, the interpreter and debugger written in c (i.e /books/pars/object/pep.c ) works well. This implementation is not unicode-aware and has a "tape" array of fixed size, but these problems are somewhat obviated by the existance of the translation scripts in the tr/ folder. A number of interesting and/or useful examples have been written using the "pep" script language and are in the "eg/" folder Several translation scripts have been written and are largely bug-free such as for the languages java, go, ruby, python, tcl, and c. These scripts can be tested with the "pep.tt" helper function in the helpers.pars.sh bash script. I would like to finish the translation scripts for swift, c++, javascript and rust. NAMING OF THE PEP SYSTEM The executable is called 'pep' standing for "Parse Engine for Patterns" The folder is called /books/pars/ The source file is called 'pep.c' for no particularly good reason. Pep scripts are given a ".pss" file extension, and files in the "assembler format have a ".pp" file extension. The source files are split into .c files where each one corresponds to a particular "object" (data structure) within the machine (eg tapecell, tape, buffer with stack and workspace). 'pep' is is not an "evocative" name (unlike, for example, "lisp"), but it fits with standard short unix tool naming. Another possible name for the system could be "nom" which is a slight reference to "noam" and also an indo-european (?) root for "name" LIMITATIONS AND BUGS - The main interpreter 'pep' (source /books/pars/object/pep.c) is written using plain c byte characters. This seemed a big limitation, but the scripts translate.xxx.pss may be a simple way to accomodate unicode characters without rewriting the code in pep.c - loadScript() does not look for the "asm.pp" in the PPASM folder, which means that all scripts have to be run from the 'pars' folder. This is a bug. - some segmentation faults may remain in pep.c - the whilenot command may not be well implemented in pep.c - the pep tool cannot receive the input-stream from stdin. This is very un-unix-like but is unavoidable because the "pep" executable allows interactive debugging. The solution is to separate pep into 2 tools, one which contains a debugger and the other which dedicate "stdin" to the input. But I dont think it is worthwhile to do this work until pep can deal directly with wide characters (eg wchar) IDEAS . a simple language which can generate xcode swift and android java for writing apps. A json-like layout language to replace android xml layouts. . parsing regular expressions shouldn't be that difficult rules: [0-9abcd]n*a+ . A vim command to compile and run a fragment with translate.java.pss . A script to turn a bash history file with comments into a python or perl array of objects (so that we can easily eliminate duplicated commands). And eliminate simple commands immediately with pep . An indent parser like this tokens: space newline word words leading.space = nl space .... CANDIDATES FOR NEW COMMANDS OR SYNTAX FOR PEP The commands and new syntax below have not been implemented but might solve a range of problems. Here are some possible future changes to the machine. * abbreviations for character classes eg [:S:] for [:space:] (already in translate.java.pss and some other "transpilers" but not gh.c) *- replacetape command: would allow unique lists to be constructed (ie replace in workspace text in the current tape cell with a constant string.) * a "length" command that sets the accumulator to the length of the current workspace?? * some accumulator based tests might be good. eg: :: >n { } # check if accumulator greater than n :: } # check if accumulator less than n :: setchars # set accumulator = character counter :: setlines # set accumulator = line counter * create a java/javascript/python/ruby/forth version of the machine * I may separate the error checking code which is currently in compile.pss into a separate script error.pss . This will allow the same code to be used in other scripts such as translate.java.pss * add a new command "untiltape" which has no arguments, which reads the input stream until the workspace ends with the text contained in the current cell of the tape. eg: untiltape; One application of this command would be parsing gnu sed syntax, where the pattern delimiter is what ever character follows the "s" for example: >> s/a*b/c/ >> s@a*b@c@ >> s#a*b#c# * a new command "replacetape" which replaces text in the workspace with the contents of the current tape cell. eg: replace all newlines in the workspace with current cell contents >> replacetape "\n"; * remove "bail" the command. Instead allow the "quit" command to return an exit code. HISTORY OF IDEA The file /books/pars/object/gh.c contains detailed information about the development of this idea. DESIGN PHILOSOPHY FOR THE MACHINE When designing the parse machine, I wanted to make its capabilities as limited as possible, while still being able to properly parse and translate "most" context-free languages and some context-sensitive languages. Related to this idea, was the aim to make the machine implementable in the smallest possible way. Also, I deliberately excluded the use of regular expressions, so that the script writer would not be tempted to try to "parse" context-free patterns with them. The general design of the syntax and command-line usage is inspired by some old unix tools, such as sed, grep and awk REGULAR EXPRESSIONS OR LACK THEREOF .... As oft-repeated in this document, the parse machine and language does not support regular expressions. This may seem a strange decision, considering that all existing "lexers" (tools that perform the lexing phase of compilation) support regexes (as far as I know). I omitted regular expressions from the machine so that the machine could be implemented in a minimal size and also, so that it would run quickly. I am still hopeful that it is possible to implement the machine on embedded architectures, with very limited resources. EVOLUTION OF THE MACHINE AND LANGUAGE * The a+ and a- commands were initially called "plus" and "minus" * The "lines" and "chars" (line number and character number) registers and commands are recent, but very important additions, because they allow script error messages to pinpoint the line and character number of the error. * The "mark" and "go" commands are also new additions, and were at first added to try to allow "indent" parsing, (also called "offside" parsing, such as is using in the Python language) * june 2021 - nochars and nolines added to object/pep.c (object/machine.interp.c) although they - have been in pars/tr/translate.java.pss for a while upper, lower, and cap (capital case) IMPORTANT FILES AND FOLDERS This section describes some of the key files and folders within the parse-machine implementation at http://bumble.sourceforge.net/books/pars/ EXAMPLE SCRIPTS .... The folder /books/pars/eg/ contains a set of scripts to demonstrate the utility of the parse-script language and machine. Here is a description of some of these scripts. - mark.html.pss This converts a particular plain text document format into html An example of this format is the current file 'pars-book.txt' - exp.tolisp.pss formats simple arithmetic expressions, of the form "a+b*c+(d/e)" into a lisp-style syntax. - history.pss parses a bash history file which may contain comments for a particular command as well as the timestamp (either before or after the timestamp) - json.parse.pss parses and checks json data (but currently only recognises integer numbers). - .... COMPILE DOT PSS .... This is the script compiler and also the compiler compiler. It has replaced the handcoded /books/pars/asm.pp file because it is easier to write and maintain. ASM DOT PP .... This file implements the "pep" scripting language. It is a text file which consists of a series of "instructions" or commands for the pep virtual machine. These instructions include instructions which alter the registers of the virtual machine; tests, which set the flag register of the machine to true if the test returns true, or else false; and conditional and unconditional jumps which change the instruction pointer for the machine if the flag register is true. 'Asm.pp' also contains labels (lines ending in ":"). These labels make it much easier to write code containing jumps (a label can be used instead of an instruction number. Because of the similarity of this format to many "assembly" languages I refer to this as assembly language for the pep virtual machine. "asm.pp" is now generated from /books/pars/compile.pss with >> pep -f compile.pss compile.pss > asm.new.pp; cp asm.new.pp asm.pp; (and then delete the final print statement at the end of asm.pp) This is a good example of the utility of scripts compiling themselves. In fact, all the "translate.xxx.pss" scripts could be used in this way. For example: >> pep -f translate.java.pss translate.java.pss > Machine.java This creates a java source file which, when compiled with "javac" is able to compile scripts into java. VIM AND PEP I usually edit with the "vim" text editor (although "sam" or "acme" might be worthwhile alternatives)). Here are some techniques for using vim with the pep tool. The vim mappings and commands below are useful for checking that pep "one-liners" and pep scripts or script fragments contained within a text document, actually compile and run. This may be a way of approximating Knuth's "literate programming" idea. The multiline snippets are contained in a plain text document within "---" and ",,," tags, which are both on an otherwise empty line. * create a vim command to run a script embedded in a text document * with input provided as an argument to the vim command >> com! -nargs=1 Ppm ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pep -f test.pss -i "" * create a vim command to compile to "assembly" format, an embedded script >> com! Ppcc ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pep -f compile.pss test.pss (The assembly compilation will be printed to stdout) * compile a one line script to assembly format and save as test.asm >> com! -nargs=1 Pplcc .w !sed 's/^ *>>//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pep -f compile.pss test.pss > test.asm * run a one line script embedded in a text document, input stream as arg >> com! -nargs=1 Ppl .w !sed 's/^ *>>//' > test.pss; ./pep -f test.pss -i "" Given a one line script such as the following >> read; "'" { until "'"; print; } clear; Typing ":Ppl one'two'three" within the "Vim" text editor, with the cursor positioned on the same line (the line beginning with ">>"), will execute the script with the text as input. There will be quoting problems if the input contains " characters. * run a multiline script embedded in a text document * with the input given as an argument >> com! -nargs=1 Ppm ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; /home/rowantree/sf/htdocs/books/pars/pep -f test.pss -i "" * run a multiline script embedded in a text document * with the file pars-book.txt as the input stream >> com! Ppf ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pep -f test.pss pars-book.txt * run a one line script embedded in a text document * with the file "pars-book.txt" as the input stream. >> com! Ppll .w !sed 's/^ *>>//' > test.pss; /Users/baobab/sf/htdocs/books/pars/pep -f test.pss pars-book.txt The mapping below can only run the script with a static input "abc" which is not very useful, but at least it tests if the script compiles properly. The compiled script will be saved in "sav.pp" * create a vim mapping to run a script embedded in a text document >> map ,pp :?^ *---?+1,/^ *,,,/-1w! test.pss \| !pp -f test.pss -i "abc" * create a vim mapping to execute the current line as a bash "one-liner" >> map ,pl :.w !sed 's/^ *>>//' \| bash * create a vim command to execute the current line as a pep "one-liner" >> command! Ppl .w !sed 's/^ *>>//' | bash The mappings and commands are for putting in the vimrc file. To create them within the editor prepend a ":" to each mapping etc. >> :command! Ppl .w !sed 's/^ *>>//' | bash CONVERT AND RUN WITH JAVA .... The vim commands below work because 'translate.java.pss' and 'pep' and pars-book.txt (this document) are all in the same folder. The paths below would have to be ajusted if that were not the case. The commands below are very useful for testing the soundness of the 'translate.java.pss' script. * convert to java and run a multiline script embedded in a text document * with the input given as an argument >> com! -nargs=1 Ppmj ?^ *---?+1,/^ *,,,/-1w !sed 's/^//' > test.pss; echo "[translating to java and compiling]"; ./pep -f translate.java.pss test.pss > Machine.java; javac Machine.java; echo "[running code]"; echo "" | java Machine * convert to java a script embedded in a text document, input stream as arg >> com! -nargs=1 Pplj .w !sed 's/^ *>>//' > test.pss; echo "[translating to java and compiling]"; ./pep -f translate.java.pss test.pss > Machine.java; javac Machine.java; echo "[running code]"; echo "" | java Machine HISTORY See /books/pars/object/pep.c for detailed development history of the script interpreter (written in c). 22 june 2022 continued work on translator scripts (perl, js) and on examples 13 march 2020 made "chars" and "lines" aliases for cc and ll in compile.pss 2 November 2019 Need to write tr/translate.c.pss to create executable code. This can be based on translate.java.pss . Also need to write translate.php.pss so that scripts can easily be run on a web-server. Also, translate.python.pss since python is an important modern language. translate.swift.pss translate.ruby.pss translate.forth.pss Need to fix mark.html.pss to produce acceptable html output from this booklet file. Also need to write mark.latex.pss , based on mark.html.pss so that I can create a decent pdf booklet. Then need to print the booklet with some images and send to people who may be interested in this. 27 september 2019 translate.javascript.pss is nearing completion... seems to be able to compile many scripts to javascript. 25 august 2019 Great progress has been made. compile.pss has all sorts of nice new syntax like negated text= tests !"abc"{ ... } Almost all tests can now be negated. There is now an AND concatentation operator (.). begin blocks, begintests in ortestsets. compile.pss has replaced asm.handcode.pp for compiling scripts. 2019 For a number of years I have been working on a project to write a virtual machine for pattern parsing. The code is located at https://bumble.sf.net/books/pars/ is) used to implement a script language for parsing and compiling some context-free languages. (The implementation is in 'asm.pp') The project is now at a stage where useful scripts can be written in the parse-script language. The purpose of the virtual machine is to be able to parse and transform patterns which cannot normally be dealt with through "regular expressions". I.e patterns which are not "regular languages". Possibly the simplest example of one of these would be palindromes (eg "aba", "hannah", "anna"). The machine also allows a script language to describe patterns and transformations, and this language has similarities to sed and to awk. In fact the whole idea was inspired by sed and its limitations for context free languages. PALINDROMES Palindromes are also interesting because they can be parsed with the simplest possible recursive descent parser. * pseudo code --------- parsePalindrome (text) { n = text.firstCharacter, m = text.lastCharacter if n <> m { return false } newText = text - first and last characters parsePalindrome(newText) return true } ,,, But the "pep" machine does not use recursive descent parsing. In fact the pep machine was written because recursive descent parsing seemed aesthetically un-pleasing. DOCUMENT HISTORY 22 July 2021 13 march 2020 revisiting eg/mark.html.pss in order to format this booklet into html, LaTeX and pdf for printing. Also, some revisions of the booklet. 1 november 2019 Revising this book file and attempting to make the examples work and more useful. 13 sept 2019 some editing. 4 September 2019 Adding some ideas about parsing optional elements. 23 August 2019 trying to organise this document. 131517 a