#* tr/translate.c.pss This is a parse-script which translates parse-scripts into c code, using the 'pep' tool. The script creates a standalone compilable c program. The virtual machine and engine is implemented in plain c at http://bumble.sf.net/books/pars/pep.c. This implements a script language with a syntax reminiscent of sed and awk (much simpler than awk, but more complex than sed). STATUS july 2022 testing with pep.trtest c is mainly working 1st and 2nd gen NOTES Or use goto for restart, reparse We use labelled loops and break/continue to implement the parse> label and .reparse .restart commands. Breaks are also used to implement the quit and bail commands. TODO Parse [...] tests into ranges a-z lists abcd and classes :alnum: and then call the appropriate c function (not the general function workspaceInClassType) Convert the parsing code to a method which takes an input stream as a parameter. This way the same parser/compiler can be used with a string/file/stdin etc and can also be used by other classes/objects. SEE ALSO At http://bumble.sf.net/books/pars/ tr/translate.tcl.pss A very similar script for compiling scripts into tcl translate.py.pss A script translator for python. compile.pss compiles a script into an "assembly" format that can be loaded and run on the parse-machine with the -a switch. This performs the same function as "asm.pp" TESTING Use the pep.tt function in helpers.pars.sh to extensively test 1st and 2nd generation. This uses the test input in tr.test.txt Things to test: .restart .reparse before and after parse> mark/go. Multiline add. try eg/natural.language.pss Not working below because [-] doesnt parse well. * try ---- pep -f translate.c.pss eg/mark.latex.pss > eg/c/mark.latex.c gcc mark.latex.c; chmod a+x a.out cat pars-book.txt | ./a.out ,,,, GOTCHAS I was trying to run >> pep -e "r;a'\\';print;d;" -i "abc" and I kept getting an unterminated quote message, which I thought I had fixed in machine.interp.c (until code). But the problem was actually the bash shell which resolves \\ to \ in double quotes, but not single quotes! BUGS When translating eg/mark.latex.pss into c and running on pars-book.txt code blocks are not being recognised (i.e between ---- and ,,,, ) This is caused by [-] { ... } not translating properly- or a bug in the c function "workspaceInClass" Segmentation fault when the tape gets too big, as would be expected. Still getting "malloc" error with pep.cff lines.with.pss lines.with.pss The c translation doesn't work with eg/lines.with.pss There is a reference to "machine->tapePointer" which is incorrect. "nottapetest" was wrong This test [\]abc] crashes the c translator because c wont accept \] as an escape sequence. "Unescape" wont work because the function expects a parameter, not a char. See escapeChar in machine.methods.c for the solution to that. Doing pep.cf eg/multiline produces nothing! no output. mysterious bug. After stepping through with -I switch it started to work! problems with while/whilenot, probably need different code for [a-z] and [[:alpha:]] style tests, no? Are multiline strings allowed in replace and other commands? or only in "add" The parse label parse> just after the begin block, or after all commands crashes the script. This bug probably exists in all the translation scripts. Its a bit strange to talk about a multicharacter string being "escaped" (eg when calling 'until') but this is allowed in the pep engine. add "\{"; will generate an "illegal escape character" error when trying to compile the generated c code. I need to consider what to do in this situation (eg escape \ to \\ ?) Check "go/mark" code. what happens if the mark is not found?? The script should exit with an error if the mark is not found. Need a "goToMark()" function. SOLVED BUGS unstack goes into an eternal loop, just like tr.tcl.pss did as well. found and fixed a bug in java whilenot/while. The code exits if the character is not found, which is not correct. The "delimiter" character was hardcoded in push. Solved an "until" bug where the java code did not read at least one character. HISTORY 19 jul 2022 Revising. The way that [] is parsed is not good and doesn't work with [-]{...} for example. It needs to be rewritten. 20 aug 2021 1st and 2nd gen working. continuing to debug, wrote escapeChar to make escape command work and recompiled libmachine. 18 july 2021 more debugging of while/whilenot. eg/natural.language.pss appears to translate, compile and run. 17 july 2021 rewriting the while/whilenot code for classes, much more efficient now. But need to write some error checking. 14 july 2021 checked the 'until' code in the methods file, update to the same as machine.parse.c (in exec) wrote some helper scripts in helpers.pars.sh which translate scripts into c, compile them into eg/clang/, and run them with input. Some very simple scripts are compiling and running. The bash function peplib compiles the library archive required to compile the standalone executable. 10 july 2021 Began to adapt from the java translator *# read; #-------------- # in general, just ignore space [:space:] { # reset char counter each line, so that character counter is # relative to the current line. This is helpful for syntax error # messages. [\n] { nochars; } clear; !(eof) { .restart } .reparse } #--------------- # We can ellide all these single character tests, because # the stack token is just the character itself with a * # Braces {} are used for blocks of commands, ',' and '.' for concatenating # tests with OR or AND logic. 'B' and 'E' for begin and end # tests, '!' is used for negation, ';' is used to terminate a # command. "{", "}", ";", ",", ".", "!", "B", "E" { put; add "*"; push; .reparse } #--------------- # format: "text" "\"" { # save the start line number (for error messages) in case # there is no terminating quote character. clear; add "line "; lines; add " (character "; chars; add ") "; put; clear; add '"'; until '"'; !E'"' { clear; add 'Unterminated quote character (") starting at '; get; add ' !\n'; print; quit; } put; clear; add "quote*"; push; .reparse } #--------------- # format: 'text', single quotes are converted to double quotes # but we must escape embedded double quotes. "'" { # save the start line number (for error messages) in case # there is no terminating quote character. clear; add "line "; lines; add " (character "; chars; add ") "; put; clear; until "'"; !E"'" { clear; add "Unterminated quote (') starting at "; get; add '!\n'; print; quit; } clip; escape '"'; # unescape isnt implemented in machine.methods.c hence this hack replace "\\'" "'"; put; clear; add "\""; get; add "\""; put; clear; add "quote*"; push; .reparse } #--------------- # formats: [:space:] [a-z] [abcd] [:alpha:] etc # should class tests really be multiline??! "[" { # save the start line number (for error messages) in case # there is no terminating bracket character. clear; add "line "; lines; add " (character "; chars; add ") "; put; clear; add "["; until "]"; "[]" { clear; add "pep script error at line "; lines; add " (character "; chars; add "): \n"; add " empty character class [] \n"; print; quit; } !E"]" { clear; add "Unterminated class text ([...]) starting at "; get; add " class text can be used in tests or with the 'while' and 'whilenot' commands. For example: [:alpha:] { while [:alpha:]; print; clear; } "; print; quit; } # need to escape quotes so they dont interfere with the # enclosing quotes. escape '"'; # the caret is not a negation operator in pep scripts # but the c code doesnt use regexs so should need to escape # it. #replace "^" "\\\\^"; # save the class on the tape put; clop; clop; !B"-" { # not a range class, eg [a-z] so need to escape '-' chars clear; get; #replace '-' '\\-'; put; } B"-" { # a range class, eg [a-z], check if it is correct clip; clip; !"-" { clear; add "Error in pep script at line "; lines; add " (character "; chars; add "): \n"; add " Incorrect character range class "; get; add " For example: [a-g] # correct [f-gh] # error! \n"; print; clear; quit; } } clear; get; # restore class text B"[:".!E":]" { clear; add "malformed character class starting at "; get; add '!\n'; print; quit; } B"[:".!"[:]" { clip; clip; clop; clop; # use c type functions in c # Also, abbreviations (not implemented in gh.c yet.) "alnum","N" { clear; add ":alnum"; } "alpha","A" { clear; add ":alpha"; } "ascii","I" { clear; add ":ascii"; } "blank","B" { clear; add ":blank"; } "cntrl","C" { clear; add ":cntrl"; } "digit","D" { clear; add ":digit"; } "graph","G" { clear; add ":graph"; } "lower","L" { clear; add ":lower"; } "print","P" { clear; add ":print"; } "punct","T" { clear; add ":punct"; } "space","S" { clear; add ":space"; } "upper","U" { clear; add ":upper"; } "xdigit","X" { clear; add ":xdigit"; } !B":" { put; clear; add "[error] Pep script syntax error near line "; lines; add " (character "; chars; add "): \n"; add "Unknown character class '"; get; add "'\n"; print; clear; quit; } # the workspaceInClassType function in machine.methods.c # can handle classes ranges and lists put; clear; add "["; get; add ":]"; } #* alnum - alphanumeric like [0-9a-zA-Z] alpha - alphabetic like [a-zA-Z] blank - blank chars, space and tab cntrl - control chars, ascii 000 to 037 and 177 (del) digit - digits 0-9 graph - graphical chars same as :alnum: and :punct: lower - lower case letters [a-z] print - printable chars ie :graph: + space punct - punctuation ie !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~. space - all whitespace, eg \n\r\t vert tab, space, \f upper - upper case letters [A-Z] xdigit - hexadecimal digit ie [0-9a-fA-F] *# put; clear; # (must match the whole string, not just one character) #add '"'; get; add '"'; put; clear; add "class*"; push; .reparse } #--------------- # formats: (eof) (EOF) (==) etc. "(" { clear; until ")"; clip; put; "eof","EOF" { clear; add "eof*"; push; .reparse } "==" { clear; add "tapetest*"; push; .reparse } add " << unknown test near line "; lines; add " of script.\n"; add " bracket () tests are \n"; add " (eof) test if end of stream reached. \n"; add " (==) test if workspace is same as current tape cell \n"; print; clear; quit; } #--------------- # multiline and single line comments, eg #... and #* ... *# "#" { clear; read; "\n" { clear; .reparse } # checking for multiline comments of the form "#* \n\n\n *#" # these are just ignored at the moment (deleted) "*" { # save the line number for possible error message later clear; lines; put; clear; until "*#"; E"*#" { # convert to /* ... */ c multiline comment clip; clip; put; clear; add "/*"; get; add "*/"; # create a "comment" parse token put; clear; # comment-out this line to remove multiline comments from the # compiled c. # add "comment*"; push; .reparse } # make an unterminated multiline comment an error # to ease debugging of scripts. clear; add "unterminated multiline comment #* ... *# \n"; add "stating at line number "; get; add "\n"; print; clear; quit; } # single line comments. some will get lost. put; clear; add "//"; get; until "\n"; clip; put; clear; add "comment*"; push; .reparse } #---------------------------------- # parse command words (and abbreviations) # legal characters for keywords (commands) ![abcdefghijklmnopqrstuvwxyzBEKGPRUWS+-<>0^] { # error message about a misplaced character put; clear; add "!! Misplaced character '"; get; add "' in script near line "; lines; add " (character "; chars; add ") \n"; print; clear; quit; } # my testclass implementation cannot handle complex lists # eg [a-z+-] this is why I have to write out the whole alphabet while [abcdefghijklmnopqrstuvwxyzBEOFKGPRUWS+-<>0^]; #---------------------------------- # KEYWORDS # here we can test for all the keywords (command words) and their # abbreviated one letter versions (eg: clip k, clop K etc). Then # we can print an error message and abort if the word is not a # legal keyword for the parse-edit language # make ll an alias for "lines" and cc an alias for chars "ll" { clear; add "lines"; } "cc" { clear; add "chars"; } # one letter command abbreviations "a" { clear; add "add"; } "k" { clear; add "clip"; } "K" { clear; add "clop"; } "D" { clear; add "replace"; } "d" { clear; add "clear"; } "t" { clear; add "print"; } "p" { clear; add "pop"; } "P" { clear; add "push"; } "u" { clear; add "unstack"; } "U" { clear; add "stack"; } "G" { clear; add "put"; } "g" { clear; add "get"; } "x" { clear; add "swap"; } ">" { clear; add "++"; } "<" { clear; add "--"; } "m" { clear; add "mark"; } "M" { clear; add "go"; } "r" { clear; add "read"; } "R" { clear; add "until"; } "w" { clear; add "while"; } "W" { clear; add "whilenot"; } "n" { clear; add "count"; } "+" { clear; add "a+"; } "-" { clear; add "a-"; } "0" { clear; add "zero"; } "c" { clear; add "chars"; } "l" { clear; add "lines"; } "^" { clear; add "escape"; } "v" { clear; add "unescape"; } "z" { clear; add "delim"; } "S" { clear; add "state"; } "q" { clear; add "quit"; } "s" { clear; add "write"; } "o" { clear; add "nop"; } "rs" { clear; add "restart"; } "rp" { clear; add "reparse"; } # some extra syntax for testeof and testtape "","" { put; clear; add "eof*"; push; .reparse } "<==>" { put; clear; add "tapetest*"; push; .reparse } "jump","jumptrue","jumpfalse", "testis","testclass","testbegins","testends", "testeof","testtape" { put; clear; add "The instruction '"; get; add "' near line "; lines; add " (character "; chars; add ")\n"; add "can be used in pep assembly code but not scripts. \n"; print; clear; quit; } # show information if these "deprecated" commands are used "Q","bail" { put; clear; add "The instruction '"; get; add "' near line "; lines; add " (character "; chars; add ")\n"; add "is no longer part of the pep language (july 2020). \n"; add "use 'quit' instead of 'bail', and use 'unstack; print;' \n"; add "instead of 'state'. \n"; print; clear; quit; } "add","clip","clop","replace","upper","lower","cap","clear","print", "pop","push","unstack","stack","put","get","swap", "++","--","mark","go","read","until","while","whilenot", "count","a+","a-","zero","chars","lines","nochars","nolines", "escape","unescape","delim","quit","state", "write","nop","reparse","restart" { put; clear; add "word*"; push; .reparse } #------------ # the .reparse command and "parse label" is a simple way to # make sure that all shift-reductions occur. It should be used inside # a block test, so as not to create an infinite loop. There is # a "goto" in c but we will use labelled loops to # implement .reparse/parse> anyway "parse>" { clear; count; !"0" { clear; add "script error:\n"; add " extra parse> label at line "; lines; add ".\n"; print; quit; } clear; add "// parse>"; put; clear; add "parse>*"; push; # use accumulator to indicate after parse> label a+; .reparse } # -------------------- # implement "begin-blocks", which are only executed # once, at the beginning of the script (similar to awk's BEGIN {} rules) "begin" { put; add "*"; push; .reparse } put; clear; add "[pep syntax error] unknown command '"; get; add "'\n"; add " near line "; lines; add " (char "; chars; add ")"; add " of source file or input. \n"; print; clear; quit; # ---------------------------------- # PARSING PHASE: # Below is the parse/compile phase of the script. Here we pop tokens off the # stack and check for sequences of tokens eg "word*semicolon*". If we find a # valid series of tokens, we "shift-reduce" or "resolve" the token series eg # word*semicolon* --> command* # # At the same time, we manipulate (transform) the attributes on the tape, as # required. # # parse block parse> #------------------------------------- # 2 tokens #------------------------------------- pop; pop; # All of the patterns below are currently errors, but may not # be in the future if we expand the syntax of the parse # language. Also consider: # begintext* endtext* quoteset* notclass*, !* ,* ;* B* E* # It is nice to trap the errors here because we can emit some # (hopefully not very cryptic) error messages with a line number. # Otherwise the script writer has to debug with # pep -a asm.pp -I scriptfile # "word*word*","word*}*","word*begintext*","word*endtext*", "word*!*", "word*,*","quote*word*", "quote*class*", "quote*state*", "quote*}*", "quote*begintext*", "quote*endtext*", "class*word*", "class*quote*", "class*class*", "class*state*", "class*}*", "class*begintext*", "class*endtext*", "class*!*", "notclass*word*", "notclass*quote*", "notclass*class*", "notclass*state*", "notclass*}*" { add " (Token stack) \nValue: \n"; get; add "\nValue: \n"; ++; get; --; add "\n"; add "Error near line "; lines; add " (char "; chars; add ")"; add " of pep script (missing semicolon?) \n"; print; clear; quit; } "{*;*", ";*;*", "}*;*" { push; push; add "Error near line "; lines; add " (char "; chars; add ")"; add " of pep script: misplaced semi-colon? ; \n"; print; clear; quit; } ",*{*" { push; push; add "Error near line "; lines; add " (char "; chars; add ")"; add " of script: extra comma in list? \n"; print; clear; quit; } "command*;*","commandset*;*" { push; push; add "Error near line "; lines; add " (char "; chars; add ")"; add " of script: extra semi-colon? \n"; print; clear; quit; } "!*!*" { push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: \n double negation '!!' is not implemented \n"; add " and probably won't be, because what would be the point? \n"; print; clear; quit; } "!*{*","!*;*" { push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: misplaced negation operator (!)? \n"; add " The negation operator precedes tests, for example: \n"; add " !B'abc'{ ... } or !(eof),!'abc'{ ... } \n"; print; clear; quit; } ",*command*" { push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: misplaced comma? \n"; print; clear; quit; } "!*command*" { push; push; add "error near line "; lines; add " (at char "; chars; add ") \n"; add " The negation operator (!) cannot precede a command \n"; print; clear; quit; } ";*{*", "command*{*", "commandset*{*" { push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: no test for brace block? \n"; print; clear; quit; } "{*}*" { push; push; add "error near line "; lines; add " of script: empty braces {}. \n"; print; clear; quit; } "B*class*","E*class*" { push; push; add "error near line "; lines; add " of script:\n classes ([a-z], [:space:] etc). \n"; add " cannot use the 'begin' or 'end' modifiers (B/E) \n"; print; clear; quit; } "comment*{*" { push; push; add "error near line "; lines; add " of script: comments cannot occur between \n"; add " a test and a brace ({). \n"; print; clear; quit; } "}*command*" { push; push; add "error near line "; lines; add " of script: extra closing brace '}' ?. \n"; print; clear; quit; } #* E"begin*".!"begin*" { push; push; add "error near line "; lines; add " of script: Begin blocks must precede code \n"; print; clear; quit; } *# #------------ # The .restart command jumps to the first instruction after the # begin block (if there is a begin block), or the first instruction # of the script. ".*word*" { clear; ++; get; --; "restart" { clear; add "continue;"; # not required because we have a "goto" in c # continue works both before and after the parse> label # "0" { clear; add "continue script;"; } # "1" { clear; add "break lex;"; } put; clear; add "command*"; push; .reparse } "reparse" { clear; count; # check accumulator to see if we are in the "lex" block # or the "parse" block and adjust the .reparse compilation # accordingly. "0" { clear; add "goto parse;"; } "1" { clear; add "goto parse;"; } put; clear; add "command*"; push; .reparse } push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: \n"; add " misplaced dot '.' (use for AND logic or in .reparse/.restart \n"; print; clear; quit; } #--------------------------------- # Compiling comments so as to transfer them to the c output "comment*command*","command*comment*","commandset*comment*" { clear; get; add "\n"; ++; get; --; put; clear; add "command*"; push; .reparse } "comment*comment*" { clear; get; add "\n"; ++; get; --; put; clear; add "comment*"; push; .reparse } # ----------------------- # negated tokens. # # This is a new more elegant way to negate a whole set of # tests (tokens) where the negation logic is stored on the # stack, not in the current tape cell. We just add "not" to # the stack token. # eg: ![:alpha:] ![a-z] ![abcd] !"abc" !B"abc" !E"xyz" # This format is used to indicate a negative test for # a brace block. eg: ![aeiou] { add "< not a vowel"; print; clear; } "!*quote*","!*class*","!*begintext*", "!*endtext*", "!*eof*","!*tapetest*" { # a simplification: store the token name "quote*/class*/..." # in the tape cell corresponding to the "!*" token. replace "!*" "not"; push; # this was a bug?? a missing ++; ?? # now get the token-value get; --; put; ++; clear; .reparse } #----------------------------------------- # format: E"text" or E'text' # This format is used to indicate a "workspace-ends-with" text before # a brace block. "E*quote*" { clear; add "endtext*"; push; get; '""' { # empty argument is an error clear; add "pep script error near line "; lines; add " (character "; chars; add "): \n"; add ' empty argument for end-test (E"") \n'; print; quit; } --; put; ++; clear; .reparse } #----------------------------------------- # format: B"sometext" or B'sometext' # A 'B' preceding some quoted text is used to indicate a # 'workspace-begins-with' test, before a brace block. "B*quote*" { clear; add "begintext*"; push; get; '""' { # empty argument is an error clear; add "pep script error near line "; lines; add " (character "; chars; add "): \n"; add ' empty argument for begin-test (B"") \n'; print; quit; } --; put; ++; clear; .reparse } #-------------------------------------------- # ebnf: command := word, ';' ; # formats: "pop; push; clear; print; " etc # all commands need to end with a semi-colon except for # .reparse and .restart # "word*;*" { clear; # check if command requires parameter get; "add", "until", "while", "whilenot", "mark", "go", "escape", "unescape", "delim", "replace" { put; clear; add "'"; get; add "'"; add " << command needs an argument, on line "; lines; add " of script.\n"; print; clear; quit; } "clip" { clear; add "/* clip */ \n"; add "if (*mm->buffer.workspace != 0) \n"; add " { mm->buffer.workspace[strlen(mm->buffer.workspace)-1] = '\\0'; }"; put; } "clop" { clear; add "clop(mm);"; put; } "clear" { clear; add "mm->buffer.workspace[0] = '\\0'; /* clear */"; put; } "upper" { clear; add "char *s = mm->buffer.workspace; /* upper */\n"; add "while (*s) { *s = toupper((unsigned char) *s); s++; } "; put; } "lower" { clear; add "char *s = mm->buffer.workspace; /* lower */ \n"; add "while (*s) { *s = tolower((unsigned char) *s); s++; } "; put; } "cap" { clear; add "char *s = mm->buffer.workspace; /* cap */ \n"; add "if (*s) { *s = toupper((unsigned char) *s); s++; } \n"; add "while (*s) { *s = tolower((unsigned char) *s); s++; } "; put; } "print" { clear; add 'printf("%s", mm->buffer.workspace); /* print */'; put; } # this is using colours at the moment, not necessary. "state" { clear; add 'state(mm); /* state */'; put; } "pop" { clear; add "pop(mm);"; put; } "push" { clear; add "push(mm);"; put; } "unstack" { clear; add "while (pop(mm)) {} /* unstack */"; put; } "stack" { clear; add "while (push(mm)) {} /* stack */"; put; } "put" { clear; add "put(mm);"; put; } "get" { clear; add "get(mm);"; put; } "swap" { clear; add "swap(mm);"; put; } "++" { clear; add "increment(mm); /* ++ */ "; put; } "--" { clear; add "if (mm->tape.currentCell > 0) mm->tape.currentCell--; /* -- */"; put; } "read" { clear; add "if (mm->peep == EOF) { break; } else { readChar(mm); } /* read */"; put; } "count" { clear; add "count(mm);"; put; } "a+" { clear; add "mm->accumulator++; /* a+ */"; put; } "a-" { clear; add "mm->accumulator--; /* a- */"; put; } "zero" { clear; add "mm->accumulator = 0; /* zero */"; put; } "cc","chars" { clear; add "chars(mm);"; put; } "ll","lines" { clear; add "lines(mm);"; put; } "nochars" { clear; add "mm->charsRead = 0; /* nochars */"; put; } "nolines" { clear; add "mm->linesRead = 0; /* nolines */"; put; } # use a labelled loop to quit script? "quit" { clear; add "exit(0);"; put; } "write" { #clear; add "mm.writeToFile();"; put; clear; add 'FILE * f = fopen("sav.pp", w);\n'; add 'fprintf(f, "%s", mm->buffer.workspace); /* write */'; add "fclose(f);"; put; } # just eliminate since it does nothing. "nop" { clear; add "/* nop: eliminated */"; put; } clear; add "command*"; push; .reparse } #----------------------------------------- # ebnf: commandset := command , command ; "command*command*", "commandset*command*" { clear; add "commandset*"; push; # format the tape attributes. Add the next command on a newline --; get; add "\n"; ++; get; --; put; ++; clear; .reparse } #------------------- # here we begin to parse "test*" and "ortestset*" and "andtestset*" # #------------------- # eg: B"abc" {} or E"xyz" {} # transform and markup the different test types "begintext*,*","endtext*,*","quote*,*","class*,*", "eof*,*","tapetest*,*", "begintext*.*","endtext*.*","quote*.*","class*.*", "eof*.*","tapetest*.*", "begintext*{*","endtext*{*","quote*{*","class*{*", "eof*{*","tapetest*{*" { B"begin" { clear; # startswith in c # if(strncmp(a, b, strlen(b)) == 0) return 1; add "strncmp(mm->buffer.workspace, "; get; add ", strlen("; get; add ")) == 0"; } B"end" { clear; add "endsWith(mm->buffer.workspace, "; get; } B"quote" { clear; add "0 == strcmp(mm->buffer.workspace, "; get; } # probably could make this faster by simplifying the # workspaceInClassType func, just pass a fn pointer.... B"class" { # classes dont have quotes around them. clear; add 'workspaceInClassType(mm, "'; get; add '"'; } # clear the tapecell for testeof and testtape because # they take no arguments. B"eof" { clear; add "mm->peep == EOF"; } B"tapetest" { clear; # mm->tape.cells[mm->tape.currentCell].text add "strcmp(mm->buffer.workspace, \n"; add " mm->tape.cells[mm->tape.currentCell].text) == 0"; # add mm->tape[mm->tapePointer]) == 0"; } !B"mm->peep".!B"str" { add ")"; } put; #* # maybe we could ellide the not tests by doing here B"not" { clear; add "!"; get; put; } *# clear; add "test*"; push; # the trick below pushes the right token back on the stack. # eg either .* or ,* or "{*" get; add "*"; push; .reparse } #------------------- # negated tests # eg: !B"xyz {} !(eof) {} !(==) {} # !E"xyz" {} # !"abc" {} # ![a-z] {} "notbegintext*,*","notendtext*,*","notquote*,*","notclass*,*", "noteof*,*","nottapetest*,*", "notbegintext*.*","notendtext*.*","notquote*.*","notclass*.*", "noteof*.*","nottapetest*.*", "notbegintext*{*","notendtext*{*","notquote*{*","notclass*{*", "noteof*{*","nottapetest*{*" { B"notbegin" { clear; # startswith in c # if(strncmp(a, b, strlen(b)) == 0) return 1; add "strncmp(mm->buffer.workspace, "; get; add ", strlen("; get; add ")) != 0"; } B"notend" { clear; add "!endsWith(mm->buffer.workspace, "; get; } B"notquote" { clear; add "0 != strcmp(mm->buffer.workspace, "; get; } B"notclass" { clear; add '!workspaceInClassType(mm, "'; get; add '"'; } # clear the tapecell for testeof and testtape because # they take no arguments. B"noteof" { clear; add "mm->peep != EOF"; } B"nottapetest" { clear; # check this logic! add "strcmp(mm->buffer.workspace, \n"; add " mm->tape.cells[mm->tape.currentCell].text) != 0"; #add "strcmp(mm->buffer.workspace, mm->tape[mm->tapePointer]) == 0"; } !B"mm->peep".!B"str" { add ")"; } put; clear; add "test*"; push; # the trick below pushes the right token back on the stack. get; add "*"; push; .reparse } #------------------- # 3 tokens #------------------- pop; #----------------------------- # some 3 token errors!!! # not a comprehensive list of 3 token errors "{*quote*;*","{*begintext*;*","{*endtext*;*","{*class*;*", "commandset*quote*;*", "command*quote*;*" { push; push; push; add "[pep error]\n invalid syntax near line "; lines; add " (char "; chars; add ")"; add " of script (misplaced semicolon?) \n"; print; clear; quit; } # to simplify subsequent tests, transmogrify a single command # to a commandset (multiple commands). "{*command*}*" { clear; add "{*commandset*}*"; push; push; push; .reparse } # errors! mixing AND and OR concatenation ",*andtestset*{*", ".*ortestset*{*" { # push the tokens back to make debugging easier push; push; push; add " error: mixing AND (.) and OR (,) concatenation in \n"; add " in pep script near line "; lines; add " (character "; chars; add ") \n"; add ' For example: B".".!E"/".[abcd./] { print; } # Correct! B".".!E"/",[abcd./] { print; } # Error! \n'; print; clear; quit; } # arrange the parse> label loops. This is simple in c # because we have a goto statement (eof) { "commandset*parse>*commandset*","command*parse>*commandset*", "commandset*parse>*command*","command*parse>*command*" { clear; # dont have to indent both code blocks # add " "; get; replace "\n" "\n "; put; clear; ++; ++; # add " "; get; replace "\n" "\n "; put; clear; --; --; # dont need a lex block, because of goto #add "lex:\n"; get; #add "\n}\n"; ++; ++; # indent code block # add " "; get; replace "\n" "\n "; put; clear; add "\nparse: \n"; get; --; --; put; clear; add "commandset*"; push; .reparse } } #-------------------------------------------- # ebnf: command := keyword , quoted-text , ";" ; # format: add "text"; "word*quote*;*" { clear; get; "replace" { # error add ": command requires 2 parameters, not 1 \n"; add "near line "; lines; add " of script. \n"; print; clear; quit; } # check whether argument is single character, otherwise # throw an error "delim","escape","unescape","while","whilenot" { # This is trickier than I thought it would be. clear; ++; get; --; # check that arg not empty, (but an empty quote is ok # for the second arg of 'replace' '""' { clear; add "[pep error] near line:char "; lines; add ":"; chars; add " \n"; add "The command '"; get; add '\' cannot have an empty argument ("") \n'; print; quit; } # quoted text has the quotes still around it. # also handle escape characters like \n \r etc clip; clop; clop; clop; # B "\\" { clip; } clip; !"" { clear; add "Pep script error near line "; lines; add " (character "; chars; add "): \n"; add " command '"; get; add "' takes only a single character argument. \n"; print; quit; } clear; get; } "mark" { clear; add "strcpy(mm->tape.cells[mm->tape.currentCell].mark, "; ++; get; --; add "); /* mark */"; put; clear; add "command*"; push; .reparse } "go" { clear; ++; get; --; # remove quotes from around the mark clip; clop; put; clear; add "/* go */ \n"; add "int found = 0;\n"; add "for (int nn = 0; nn < mm->tape.capacity; nn++) { \n"; add " if (strcmp(mm->tape.cells[nn].mark, \""; get; add "\") == 0) { \n"; add " mm->tape.currentCell = nn; found = 1; break; \n"; add " }\n"; add "}"; add "if (!found) {\n"; add ' printf("badmark \''; get; add "'!\");\n"; add " exit(1);\n"; add "}"; put; clear; add "command*"; push; .reparse } "delim" { clear; # remove the quotes from around the delimiter and escape ' # because c uses single quotes for chars ++; get; clip; clop; "'" { clear; add "\\'"; } put; clear; # only the first character of the delimiter argument is used. add "mm->delimiter = '"; get; --; add "'; /* delim */ "; put; clear; add "command*"; push; .reparse } "add" { clear; add "add(mm, "; ++; get; --; # handle multiline text, check! replace "\n" '"); \nadd(mm, "\\n'; add "); "; put; clear; add "command*"; push; .reparse } # what is the meaning of while/whilenot with a quote argument?? "while","whilenot" { clear; add "[error] sorry the c translator does not \n"; add " accept a quoted text argument for the '"; get; add "'\n"; add " command. In anycase, it would not be very useful.\n"; add " try while [a-n]; or while [:space:]; or while [aeiou]; \n"; add " (At line "; lines; add ")\n"; print; quit; } "until" { clear; add "until(mm, "; ++; get; --; # error until cannot have empty argument 'until(mm, ""' { clear; add "Pep script error near line "; lines; add " (character "; chars; add "): \n"; add " empty argument for 'until' \n"; add " For example: until '.txt'; until \">\"; # correct until ''; until \"\"; # errors! \n"; print; quit; } # handle multiline argument replace "\n" "\\n"; add ');'; put; clear; add "command*"; push; .reparse } # But really, can't the "replace" command just be used # instead of escape/unescape?? This seems a flaw in the # machine design. Unescape wont work yet. "escape","unescape" { put; clear; # remove double quotes from argument (to replace with '') # and escape ' because its going in single quotes ++; get; clip; clop; escape "'"; put; clear; --; get; add "Char(mm, '"; ++; get; --; add "');"; put; clear; add "command*"; push; .reparse } # error, superfluous argument add ": command does not take an argument \n"; add "near line "; lines; add " of script. \n"; print; clear; quit; } #---------------------------------- # format: "while [:alpha:] ;" or whilenot [a-z] ; "word*class*;*" { clear; get; # what is the meaning of peep with a quote argument?? # with some tricks I think I can ellide "whilenot" here # as well. eg: store "!" or "" in cell, then get it to # negate the logic! "while","whilenot" { # a trick to negate tests replace "while" ""; replace "not" "!"; put; clear; # 3 different cases: [a-z] [acx.] [:alpha:] ++; get; --; # check if [a-z] range B"[".E"]" { clip; clip; clop; clop; "-" { clear; ++; get; # a trick: turn [a-z] into 'a') && ('z' then insert # in code replace "[" "'"; replace "]" "'"; replace "-" "') && ('"; put; clear; add "while ("; # here we get the c negation operator "!" which # was earlier stored in the cell --; get; ++; add "((mm->peep >= "; get; --; add " >= mm->peep)) && readc(mm)) {} /* while */"; put; clear; add "command*"; push; .reparse } # the char class names and function names are the same # luckily. "alnum","alpha","blank","cntrl","digit","graph", "lower","print","punct","space","upper","xdigit" { ++; put; --; clear; add "while ("; # insert negation operator, if any get; ++; add "is"; get; --; add "(mm->peep) && readc(mm)) {} /* while */"; put; clear; add "command*"; push; .reparse } # bug: \x will crash this because hex digits are # expected by the compiler after it clear; ++; get; replace "[" '"'; replace "]" '"'; put; clear; # insert negation operator, if any. add "while ("; --; get; ++; add "(strchr("; get; --; add ", mm->peep) != NULL) && readc(mm)) {} /* while */"; put; clear; add "command*"; push; .reparse #if (!readc(mm)) return; } put; clear; add "[error] strange char class "; get; add "!!"; print; quit; #add "command*"; push; .reparse } # error add " < command cannot have a class argument \n"; add "line "; lines; add ": error in script \n"; print; clear; quit; } # ------------------------------- # 4 tokens # ------------------------------- pop; #------------------------------------- # bnf: command := replace , quote , quote , ";" ; # example: replace "and" "AND" ; "word*quote*quote*;*" { clear; get; "replace" { #--------------------------- # a command plus 2 arguments, eg replace "this" "that" # requires a helper function (in buffer.c). clear; add "replace(mm, "; ++; get; add ", "; ++; get; add "); /* replace */"; --; --; put; clear; add "command*"; push; .reparse } add "[error] pep script error on line "; lines; add " (character "; chars; add "): \n"; add " command does not take 2 quoted arguments. \n"; print; quit; } #------------------------------------- # format: begin { #* commands *# } # "begin" blocks which are only executed once (they # will are assembled before the "start:" label. They must come before # all other commands. # "begin*{*command*}*", "begin*{*commandset*}*" { clear; ++; ++; get; --; --; put; clear; add "beginblock*"; push; .reparse } # ------------- # parses and compiles concatenated tests # eg: 'a',B'b',E'c',[def],[:space:],[g-k] { ... # these 2 tests should be all that is necessary "test*,*ortestset*{*", "test*,*test*{*" { clear; get; add " || "; ++; ++; get; --; --; put; clear; add "ortestset*{*"; push; push; .reparse } # dont mix AND and OR concatenations # ------------- # AND logic # parses and compiles concatenated AND tests # eg: 'a',B'b',E'c',[def],[:space:],[g-k] { ... # it is possible to elide this block with the negated block # for compactness but maybe readability is not as good. # negated tests can be chained with non negated tests. # eg: B'http' . !E'.txt' { ... } "test*.*andtestset*{*", "test*.*test*{*" { clear; get; add " && "; ++; ++; get; --; --; put; clear; add "andtestset*{*"; push; push; .reparse } #------------------------------------- # we should not have to check for the {*command*}* pattern # because that has already been transformed to {*commandset*}* "test*{*commandset*}*", "andtestset*{*commandset*}*", "ortestset*{*commandset*}*" { clear; # indent the generated c code for readability ++; ++; add " "; get; replace "\n" "\n "; put; --; --; clear; add "if ("; get; add ") {\n"; ++; ++; get; add "\n}"; --; --; put; clear; add "command*"; push; # always reparse/compile .reparse } # ------------- # multi-token end-of-stream errors # not a comprehensive list of errors... (eof) { E"begintext*",E"endtext*",E"test*",E"ortestset*",E"andtestset*" { add " Error near end of script at line "; lines; add ". Test with no brace block? \n"; print; clear; quit; } E"quote*",E"class*",E"word*"{ put; clear; add "Error at end of pep script near line "; lines; add ": missing semi-colon? \n"; add "Parse stack: "; get; add "\n"; print; clear; quit; } E"{*", E"}*", E";*", E",*", E".*", E"!*", E"B*", E"E*" { put; clear; add "Error: misplaced terminal character at end of script! (line "; lines; add "). \n"; add "Parse stack: "; get; add "\n"; print; clear; quit; } } # put the 4 (or less) tokens back on the stack push; push; push; push; (eof) { print; clear; # create the virtual machine object code and save it # somewhere on the tape. add ' /* c code generated by "tr/translate.c.pss" */ /* note: this c engine cannot handle unicode! */ #include #include #include #include #include "colours.h" #include "tapecell.h" #include "tape.h" #include "buffer.h" #include "charclass.h" #include "command.h" #include "parameter.h" #include "instruction.h" #include "labeltable.h" #include "program.h" #include "machine.h" #include "exitcode.h" #include "machine.methods.h" int main() { struct Machine machine; struct Machine * mm = &machine; newMachine(mm, stdin, 100, 10);\n'; # save the code in the current tape cell put; clear; #--------------------- # check if the script correctly parsed (there should only # be one token on the stack, namely "commandset*" or "command*"). pop; pop; "commandset*", "command*" { clear; # indent generated code (6 spaces) for readability. add " "; get; replace "\n" "\n "; put; clear; # restore the c preamble from the tape ++; get; --; add ' script: while (!mm->peep != EOF) {\n'; get; add "\n }"; add "\n}\n"; # put a copy of the final compilation into the tapecell # so it can be inspected interactively. put; print; clear; quit; } "beginblock*commandset*", "beginblock*command*" { clear; # indent begin block code add " "; get; replace "\n" "\n "; put; clear; # indent main code for readability. ++; add " "; get; replace "\n" "\n "; put; clear; --; # get c preamble from tape ++; ++; get; --; --; get; add "\n"; ++; # a labelled loop for "quit" (but quit can just exit?) add " script: \n"; add " while (!mm->peep != EOF) {\n"; get; add "\n }"; add "\n}\n"; # put a copy of the final compilation into the tapecell # for interactive debugging. put; print; clear; quit; } push; push; # try to explain some more errors unstack; B"parse>" { put; clear; add "[error] pep syntax error:\n"; add " The parse> label cannot be the 1st item \n"; add " of a script \n"; print; quit; } put; clear; clear; add "[error] After compiling with 'tr/translate.c.pss' (at EOF): \n "; print; clear; unstack; put; clear; add "Parse stack: "; get; add "\n"; add " * debug script "; add " >> pep -If script -i 'some input' \n "; add " * debug compilation. \n "; add " >> pep -Ia asm.pp script' \n "; print; clear; quit; } # not eof # there is an implicit .restart command here (jump start)