#* tr/translate.go.pss This is a parse-script which translates parse-scripts into "go" (golang) code, using the 'pep' tool. The script creates a standalone go program The virtual machine and engine is implemented in plain c at http://bumble.sf.net/books/pars/pep.c. This implements a script language with a syntax reminiscent of sed and awk (simpler than awk, but more complex than sed). STATUS 18 feb 2025 failing some tests now (maybe new tests) eg unescape, tricky characters. 28 aug 2021 1st and 2nd gen working in pep.tt go (tr/tr.test.txt) NOTES Really I should remove the syntax from the script language because this can be expressed with . This will reduce complexity. Also, in the case we should translate as "for mm.peep=='x' { read } maybe should use go's unicode.IsPunct('.') functions instead of regular expressions, that should be faster also that solves our multi-return value problem. strings.ContainsAny("Hello World", ",|or")) for [xyz] * So for "while" command it will be >> while strings.ContainsAny(mm.peep, "abc") { read } * for tests eg [abc] {...} it will be >> if strings.ContainsAny(mm.work, "abc") { } * write function isType() with function argument --- type fn func(rune) bool // eg unicode.IsLetter('x') func isType(type fn, s string) { loop through each char in s } // call if isType(unicode.IsLetter, mm.work) while isType(unicode.IsLetter, mm.peep) read ,,, unicode.IsLetter('x') for [:alpha:] * see if all chars in workspace are in range --- f := func(r rune) bool { return r < 'A' || r > 'z' } if strings.IndexFunc(mm.work, f) != -1 { fmt.Println("Found special char") } ,,, * or pass the function straight in? if strings.IndexFunc(mm.work, func(r rune) bool { return r<'A'||r>'z'}) != -1 { fmt.Println("Found special char") } if square, _ := squareAndCube(n); square > m { this is tricky, because cant have anonymous value from * regexp syntax in go. >> match, _ := regexp.MatchString("p([a-z]+)ch", "peach") fmt.Println(match) In other translation scripts, we use labelled loops and break/continue to implement the parse> label and .reparse .restart commands. Breaks could be used to implement the quit command but arent. Does go support labelled loops? yes We can use "run once" loops eg " for true do ... break; end " an example is in the translate.tcl.pss script. SEE ALSO At http://bumble.sf.net/books/pars/ tr/translate.java.pss, tr/translate.py.pss tr/translate.rb.pss very similar scripts for compiling scripts into java and python, ruby and more compile.pss compiles a script into an "assembly" format that can be loaded and run on the parse-machine with the -a switch. This performs the same function as "asm.pp" TESTING Comprehensive testing can be done with >> pep.tt go A simple "state" command maybe useful for debugging these translation scripts and the corresponding machines. test begin blocks. parse> .reparse .restart Try 2nd generation --- pep -f tr/translate.go.pss tr/translate.go.pss > eg/go/translate.go.go echo "r;[a-d]{t;}t;d;" | eg/go/translate.go.go > test.go echo "abxy" | ./test.go # and the output is "aabbxy" ,,,, So the script translates itself into go, then the new go translator translates another script into go. * use a helper script to test begin blocks, stack delimiter, and pushing >> pep.gos 'begin { delim "/";} r; add "/";push; state; d;' >> pep.gos 'begin { delim "/";} r; add "/";push; state; d;' "abcd" * a simple test procedure --------- pep -f translate.go.pss -i "r;t;t;d;" > test.go go build test.go echo "abc" | ./test # should print 'aabbcc' ,,, * use the bash helper functions to test (from helpers.pars.sh) >> pep.gof eg/json.check.pss '{"here":2}' The line above compiles the script to go in the folder pars/eg/go/json.check.pss and runs it with the input. check multiline text with 'add' and 'until' * one comprehensive test is to run the script on itself >> pep -f translate.go.pss translate.go.pss > eg/go/translate.go.go >> cd eg/go/; go build translate.go.go >> echo "r;t;t;d;" | eg/go/translate.go WATCH OUT FOR treatment of regexes is different (for while whilenot etc). Eg in ruby [[:space:]] is unicode aware but \s is not make sure .reparse and .restart work before and after the parse> label. Make sure escaping and multiline arguments work. BUGS isInList logic is incorrect. unescapechar must be fixed, looking at java code. will reparse or restart work in a begin block? parse> label just after begin block or after all code. multiline add not working? mark code may not be correct SOLVED BUGS TO WATCH FOR * the line below was throwing an error, problem was in compile.pss >> add '", "\\'; get; add '")'; --; put; clear; Java needs a double escape \\\\ before some chars, but ruby doesnt languages no. escape needs to use the machine escape char. found and fixed a bug in java whilenot/while. The code exits if the character is not found, which is not correct. Found and fixed a bug in the (==) code ie in java (stringa == stringb) doesnt work. "until" bug where the code did not read at least one character. Read must exit if at end of stream, but while/whilenot/until, no. TASKS HISTORY 17 June 2022 Trying to make the tape dynamically growable, which is necessary for scripts like eg/palindrome.pss 28 aug 2021 fixing class tests and while class code, using helper functions isInClass etc 26 aug 2021 continuing to debug. need to convert class regex syntax. Using "pep.tt go" to find errors. 15 july 2021 continued the work of syntax conversion, but scripts are not yet compiling with 'go build test.go' etc. I made some helper scripts in helpers.pars.sh for testing. *# read; #-------------- [:space:] { clear; .reparse } #--------------- # We can ellide all these single character tests, because # the stack token is just the character itself with a * # Braces {} are used for blocks of commands, ',' and '.' for concatenating # tests with OR or AND logic. 'B' and 'E' for begin and end # tests, '!' is used for negation, ';' is used to terminate a # command. "{", "}", ";", ",", ".", "!", "B", "E" { put; add "*"; push; .reparse } #--------------- # format: "text" "\"" { # save the start line number (for error messages) in case # there is no terminating quote character. clear; add "line "; lines; add " (character "; chars; add ") "; put; clear; add '"'; until '"'; !E'"' { clear; add 'Unterminated quote character (") starting at '; get; add ' !\n'; print; quit; } put; clear; add "quote*"; push; .reparse } #--------------- # format: 'text', single quotes are converted to double quotes # but we must escape embedded double quotes. "'" { # save the start line number (for error messages) in case # there is no terminating quote character. clear; add "line "; lines; add " (character "; chars; add ") "; put; clear; until "'"; !E"'" { clear; add "Unterminated quote (') starting at "; get; add '!\n'; print; quit; } # empty quotes '' may be legal, for example as the second arg # to replace. clip; escape '"'; unescape "'"; put; clear; add "\""; get; add "\""; put; clear; add "quote*"; push; .reparse } #--------------- # formats: [:space:] [a-z] [abcd] [:alpha:] etc # should class tests really be multiline??! "[" { # save the start line number (for error messages) in case # there is no terminating bracket character. clear; add "line "; lines; add " (character "; chars; add ") "; put; clear; add "["; until "]"; "[]" { clear; add "pep script error at line "; lines; add " (character "; chars; add "): \n"; add " empty character class [] \n"; print; quit; } !E"]" { clear; add "Unterminated class text ([...]) starting at "; get; add " class text can be used in tests or with the 'while' and 'whilenot' commands. For example: [:alpha:] { while [:alpha:]; print; clear; } "; print; quit; } # need to escape quotes? escape '"'; # the caret is not a negation operator in pep char classes # but dont have to escape caret because not using regexs # replace "^" "\\^"; # save the class on the tape put; clop; clop; !B"-" { # not a range class, eg [a-z] but dont need to escape '-' chars # because not using regexs #clear; get; replace '-' '\\-'; put; nop; } B"-" { # a range class, eg [a-z], check if it is correct clip; clip; !"-" { clear; add "Error in pep script at line "; lines; add " (character "; chars; add "): \n"; add " Incorrect character range class "; get; add " For example: [a-g] # correct [f-gh] # error! \n"; print; clear; quit; } # correct format, eg: [a-z] now translate to a # format that can be used by a go function clear; get; clip; clop; put; clear; add "'"; get; add "'"; # but if the range contains '-' this is a bug! replace '-' "','"; # now='a','z' put; clear; add "isInRange("; get; put; clear; add "class*"; push; .reparse } clear; get; # restore class text B"[:".!E":]" { clear; add "malformed character class starting at "; get; add '!\n'; print; quit; } # class in the form [:digit:] B"[:".!"[:]" { clip; clip; clop; clop; # unicode posix character classes # Also, abbreviations (not implemented in pep.c yet.) # classes like [[:alpha:]] are only ascii in golang, but # see also unicode.IsLower('x'); # fix! "alnum","N" { clear; add "isInClass(unicode.IsLetter"; } #"alpha","A" { clear; add "[[:alpha:]]"; } "alpha","A" { clear; add "isInClass(unicode.IsLetter"; } # check! # non-standard posix class 'word' and ascii # check! "ascii","I" { clear; add "isInRange(rune(0), rune(unicode.MaxASCII) "; } "word","W" { clear; add "isInClass(unicode.IsLetter"; } # fix! "blank","B" { clear; add "isInClass(unicode.IsSpace"; } "cntrl","C" { clear; add 'isInClass(unicode.IsControl'; } "digit","D" { clear; add "isInClass(unicode.IsDigit"; } "graph","G" { clear; add 'isInClass(unicode.IsGraphic'; } "lower","L" { clear; add 'isInClass(unicode.IsLower'; } "print","P" { clear; add "isInClass(unicode.IsPrint"; } "punct","T" { clear; add 'isInClass(unicode.IsPunct'; } "space","S" { clear; add "isInClass(unicode.IsSpace"; } "upper","U" { clear; add 'isInClass(unicode.IsUpper'; } "xdigit","X" { clear; add 'isInList("0123456789abcdefABCDEF"'; } !B"isIn".!B"[" { put; clear; add "pep script error at line "; lines; add " (character "; chars; add "): \n"; add "Unknown character class '"; get; add "'\n"; print; clear; quit; } put; clear; add "class*"; push; .reparse } #* alnum - alphanumeric like [0-9a-zA-Z] alpha - alphabetic like [a-zA-Z] blank - blank chars, space and tab cntrl - control chars, ascii 000 to 037 and 177 (del) digit - digits 0-9 graph - graphical chars same as :alnum: and :punct: lower - lower case letters [a-z] print - printable chars ie :graph: + space punct - punctuation ie !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~. space - all whitespace, eg \n\r\t vert tab, space, \f upper - upper case letters [A-Z] xdigit - hexadecimal digit ie [0-9a-fA-F] *# # must be a list eg [abcdefg] clear; get; clip; clop; unescape "]"; put; clear; add '"'; get; add '"'; put; clear; add "isInList("; get; put; clear; add "class*"; push; .reparse } #--------------- # formats: (eof) (EOF) (==) etc. "(" { clear; until ")"; clip; put; "eof","EOF" { clear; add "eof*"; push; .reparse } "==" { clear; add "tapetest*"; push; .reparse } add " << unknown test near line "; lines; add " of script.\n"; add " bracket () tests are \n"; add " (eof) test if end of stream reached. \n"; add " (==) test if workspace is same as current tape cell \n"; print; clear; quit; } #--------------- # multiline and single line comments, eg #... and #* ... *# "#" { clear; read; "\n" { clear; .reparse } # checking for multiline comments of the form "#* \n\n\n *#" # these are just ignored at the moment (deleted) "*" { # save the line number for possible error message later clear; lines; put; clear; until "*#"; E"*#" { # convert to go comments (/*...*/ and //) # or just one multiline clip; clip; replace "\n" "\n//"; put; clear; # create a "comment" parse token # comment-out this line to remove multiline comments from the # translated golang code # add "comment*"; push; .reparse } # make an unterminated multiline comment an error # to ease debugging of scripts. clear; add "unterminated multiline comment #* ... *# \n"; add "stating at line number "; get; add "\n"; print; clear; quit; } # single line comments. some will get lost. put; clear; add "//"; get; until "\n"; clip; put; clear; # comment out this below to remove single line comments # from the output add "comment*"; push; .reparse } #---------------------------------- # parse command words (and abbreviations) # legal characters for keywords (commands) ![abcdefghijklmnopqrstuvwxyzBEKGPRUWS+-<>0^] { # error message about a misplaced character put; clear; add "!! Misplaced character '"; get; add "' in script near line "; lines; add " (character "; chars; add ") \n"; print; clear; quit; } # my testclass implementation cannot handle complex lists # eg [a-z+-] this is why I have to write out the whole alphabet while [abcdefghijklmnopqrstuvwxyzBEOFKGPRUWS+-<>0^]; #---------------------------------- # KEYWORDS # here we can test for all the keywords (command words) and their # abbreviated one letter versions (eg: clip k, clop K etc). Then # we can print an error message and abort if the word is not a # legal keyword for the parse-edit language # make ll an alias for "lines" and cc an alias for chars "ll" { clear; add "lines"; } "cc" { clear; add "chars"; } # one letter command abbreviations "a" { clear; add "add"; } "k" { clear; add "clip"; } "K" { clear; add "clop"; } "D" { clear; add "replace"; } "d" { clear; add "clear"; } "t" { clear; add "print"; } "p" { clear; add "pop"; } "P" { clear; add "push"; } "u" { clear; add "unstack"; } "U" { clear; add "stack"; } "G" { clear; add "put"; } "g" { clear; add "get"; } "x" { clear; add "swap"; } ">" { clear; add "++"; } "<" { clear; add "--"; } "m" { clear; add "mark"; } "M" { clear; add "go"; } "r" { clear; add "read"; } "R" { clear; add "until"; } "w" { clear; add "while"; } "W" { clear; add "whilenot"; } "n" { clear; add "count"; } "+" { clear; add "a+"; } "-" { clear; add "a-"; } "0" { clear; add "zero"; } "c" { clear; add "chars"; } "l" { clear; add "lines"; } "^" { clear; add "escape"; } "v" { clear; add "unescape"; } "z" { clear; add "delim"; } "S" { clear; add "state"; } "q" { clear; add "quit"; } "s" { clear; add "write"; } "o" { clear; add "nop"; } "rs" { clear; add "restart"; } "rp" { clear; add "reparse"; } # some extra syntax for testeof and testtape "","" { put; clear; add "eof*"; push; .reparse } "<==>" { put; clear; add "tapetest*"; push; .reparse } "jump","jumptrue","jumpfalse", "testis","testclass","testbegins","testends", "testeof","testtape" { put; clear; add "The instruction '"; get; add "' near line "; lines; add " (character "; chars; add ")\n"; add "can be used in pep assembly code but not scripts. \n"; print; clear; quit; } # show information if these "deprecated" commands are used "Q","bail" { put; clear; add "The instruction '"; get; add "' near line "; lines; add " (character "; chars; add ")\n"; add "is no longer part of the pep language. \n"; add "use 'quit' instead of 'bail'' \n"; print; clear; quit; } "add","clip","clop","replace","upper","lower","cap","clear","print","state", "pop","push","unstack","stack","put","get","swap", "++","--","mark","go","read","until","while","whilenot", "count","a+","a-","zero","chars","lines","nochars","nolines", "escape","unescape","delim","quit", "write","nop","reparse","restart" { put; clear; add "word*"; push; .reparse } #------------ # the .reparse command and "parse label" is a simple way to # make sure that all shift-reductions occur. It should be used inside # a block test, so as not to create an infinite loop. There is # no "goto" in go so we need to use labelled loops to # implement .reparse/parse> "parse>" { clear; count; !"0" { clear; add "script error:\n"; add " extra parse> label at line "; lines; add ".\n"; print; quit; } clear; add "# parse> parse label"; put; clear; add "parse>*"; push; # use accumulator to indicate after parse> label a+; .reparse } # -------------------- # implement "begin-blocks", which are only executed # once, at the beginning of the script (similar to awk's BEGIN {} rules) "begin" { put; add "*"; push; .reparse } add " << unknown command on line "; lines; add " (char "; chars; add ")"; add " of source file. \n"; print; clear; quit; # ---------------------------------- # PARSING PHASE: # Below is the parse/compile phase of the script. Here we pop tokens off the # stack and check for sequences of tokens eg "word*semicolon*". If we find a # valid series of tokens, we "shift-reduce" or "resolve" the token series eg # word*semicolon* --> command* # # At the same time, we manipulate (transform) the attributes on the tape, as # required. # parse> #------------------------------------- # 2 tokens #------------------------------------- pop; pop; # All of the patterns below are currently errors, but may not # be in the future if we expand the syntax of the parse # language. Also consider: # begintext* endtext* quoteset* notclass*, !* ,* ;* B* E* # It is nice to trap the errors here because we can emit some # (hopefully not very cryptic) error messages with a line number. # Otherwise the script writer has to debug with # pep -a asm.pp -I scriptfile # "word*word*","word*}*","word*begintext*","word*endtext*", "word*!*", "word*,*","quote*word*", "quote*class*", "quote*state*", "quote*}*", "quote*begintext*", "quote*endtext*", "class*word*", "class*quote*", "class*class*", "class*state*", "class*}*", "class*begintext*", "class*endtext*", "class*!*", "notclass*word*", "notclass*quote*", "notclass*class*", "notclass*state*", "notclass*}*" { add " (Token stack) \nValue: \n"; get; add "\nValue: \n"; ++; get; --; add "\n"; add "Error near line "; lines; add " (char "; chars; add ")"; add " of pep script (missing semicolon?) \n"; print; clear; quit; } "{*;*", ";*;*", "}*;*" { push; push; add "Error near line "; lines; add " (char "; chars; add ")"; add " of pep script: misplaced semi-colon? ; \n"; print; clear; quit; } ",*{*" { push; push; add "Error near line "; lines; add " (char "; chars; add ")"; add " of script: extra comma in list? \n"; print; clear; quit; } "command*;*","commandset*;*" { push; push; add "Error near line "; lines; add " (char "; chars; add ")"; add " of script: extra semi-colon? \n"; print; clear; quit; } "!*!*" { push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: \n double negation '!!' is not implemented \n"; add " and probably won't be, because what would be the point? \n"; print; clear; quit; } "!*{*","!*;*" { push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: misplaced negation operator (!)? \n"; add " The negation operator precedes tests, for example: \n"; add " !B'abc'{ ... } or !(eof),!'abc'{ ... } \n"; print; clear; quit; } ",*command*" { push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: misplaced comma? \n"; print; clear; quit; } "!*command*" { push; push; add "error near line "; lines; add " (at char "; chars; add ") \n"; add " The negation operator (!) cannot precede a command \n"; print; clear; quit; } ";*{*", "command*{*", "commandset*{*" { push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: no test for brace block? \n"; print; clear; quit; } "{*}*" { push; push; add "error near line "; lines; add " of script: empty braces {}. \n"; print; clear; quit; } "B*class*","E*class*" { push; push; add "error near line "; lines; add " of script:\n classes ([a-z], [:space:] etc). \n"; add " cannot use the 'begin' or 'end' modifiers (B/E) \n"; print; clear; quit; } "comment*{*" { push; push; add "error near line "; lines; add " of script: comments cannot occur between \n"; add " a test and a brace ({). \n"; print; clear; quit; } "}*command*" { push; push; add "error near line "; lines; add " of script: extra closing brace '}' ?. \n"; print; clear; quit; } #* E"begin*".!"begin*" { push; push; add "error near line "; lines; add " of script: Begin blocks must precede code \n"; print; clear; quit; } *# #------------ # The .restart command jumps to the first instruction after the # begin block (if there is a begin block), or the first instruction # of the script. ".*word*" { clear; ++; get; --; "restart" { clear; count; # this is the opposite of .reparse, using run-once loops # cant do next before label, infinite loop # need to set flag variable. I think go has labelled loops # before the parse> label "0" { clear; add "restart = true; continue // restart"; } "1" { clear; add "break"; } # after the parse> label put; clear; add "command*"; push; .reparse } "reparse" { clear; count; # check accumulator to see if we are in the "lex" block # or the "parse" block and adjust the .reparse compilation # accordingly. "0" { clear; add "break"; } "1" { clear; add "continue"; } put; clear; add "command*"; push; .reparse } push; push; add "error near line "; lines; add " (char "; chars; add ")"; add " of script: \n"; add " misplaced dot '.' (use for AND logic or in .reparse/.restart \n"; print; clear; quit; } #--------------------------------- # Compiling comments so as to transfer them to the java "comment*command*","command*comment*","commandset*comment*" { clear; get; add "\n"; ++; get; --; put; clear; add "command*"; push; .reparse } "comment*comment*" { clear; get; add "\n"; ++; get; --; put; clear; add "comment*"; push; .reparse } # ----------------------- # negated tokens. # # This is a new more elegant way to negate a whole set of # tests (tokens) where the negation logic is stored on the # stack, not in the current tape cell. We just add "not" to # the stack token. # eg: ![:alpha:] ![a-z] ![abcd] !"abc" !B"abc" !E"xyz" # This format is used to indicate a negative test for # a brace block. eg: ![aeiou] { add "< not a vowel"; print; clear; } "!*quote*","!*class*","!*begintext*", "!*endtext*", "!*eof*","!*tapetest*" { # a simplification: store the token name "quote*/class*/..." # in the tape cell corresponding to the "!*" token. replace "!*" "not"; push; # this was a bug?? a missing ++; ?? # now get the token-value get; --; put; ++; clear; .reparse } #----------------------------------------- # format: E"text" or E'text' # This format is used to indicate a "workspace-ends-with" text before # a brace block. "E*quote*" { clear; add "endtext*"; push; get; '""' { # empty argument is an error clear; add "pep script error near line "; lines; add " (character "; chars; add "): \n"; add ' empty argument for end-test (E"") \n'; print; quit; } --; put; ++; clear; .reparse } #----------------------------------------- # format: B"sometext" or B'sometext' # A 'B' preceding some quoted text is used to indicate a # 'workspace-begins-with' test, before a brace block. "B*quote*" { clear; add "begintext*"; push; get; '""' { # empty argument is an error clear; add "pep script error near line "; lines; add " (character "; chars; add "): \n"; add ' empty argument for begin-test (B"") \n'; print; quit; } --; put; ++; clear; .reparse } #-------------------------------------------- # ebnf: command := word, ';' ; # formats: "pop; push; clear; print; " etc # all commands need to end with a semi-colon except for # .reparse and .restart # "word*;*" { clear; # check if command requires parameter get; "add","while","whilenot","mark", "escape","unescape","delim","replace" { put; clear; add "'"; get; add "'"; add " << command needs an argument, on line "; lines; add " of script.\n"; print; clear; quit; } # new until; syntax "until" { clear; add "mm.until(mm.tape[mm.cell]) /* until (tape) */"; put; } # new go; syntax (go to mark named in current tape cell) "go" { clear; add "mm.goToMark(mm.tape[mm.cell]) /* go (tape) */"; put; } "clip" { clear; add "mm.clip()"; put; } "clop" { clear; add "mm.clop()"; put; } "clear" { clear; add 'mm.work = "" // clear'; put; } "upper" { clear; add "mm.work = strings.ToUpper(mm.work) /* upper */"; put; } "lower" { clear; add "mm.work = strings.ToLower(mm.work) /* lower */"; put; } "cap" { clear; add "mm.work = strings.Title(strings.ToLower(mm.work)) // capital"; put; } "print" { clear; add 'fmt.Printf("%s", mm.work) // print'; put; } "state" { clear; add 'mm.printState() // state'; put; } "pop" { clear; add "mm.pop();"; put; } "push" { clear; add "mm.push();"; put; } "unstack" { clear; add "for mm.pop() {} /* unstack */ "; put; } "stack" { clear; add "for mm.push() {} /* stack */"; put; } "put" { clear; add "mm.tape[mm.cell] = mm.work /* put */"; put; } "get" { clear; add "mm.work += mm.tape[mm.cell] /* get */"; put; } "swap" { clear; add "mm.work, mm.tape[mm.cell] = mm.tape[mm.cell], mm.work /* swap */"; put; } "++" { clear; add "mm.increment() /* ++ */ \n"; put; } "--" { clear; add "if mm.cell > 0 { mm.cell-- } /* -- */"; put; } "read" { clear; add "mm.read() /* read */"; put; } "count" { clear; add "mm.work += strconv.Itoa(mm.counter) /* count */ "; put; } "a+" { clear; add "mm.counter++ /* a+ */"; put; } "a-" { clear; add "mm.counter-- /* a- */"; put; } "zero" { clear; add "mm.counter = 0 /* zero */"; put; } "chars" { clear; add "mm.work += strconv.Itoa(mm.charsRead) /* chars */"; put; } "lines" { clear; add "mm.work += strconv.Itoa(mm.linesRead) /* lines */"; put; } "nochars" { clear; add "mm.charsRead = 0 /* nochars */"; put; } "nolines" { clear; add "mm.linesRead = 0 /* nolines */"; put; } # use a labelled loop to quit script. "quit" { clear; add "os.Exit(0)"; put; } # inline this? "write" { clear; # go syntax add "/* write */\n"; add 'f, err := os.Create("sav.pp")\n'; add "if err != nil { panic(err) }\n"; add "defer f.Close()\n"; add '_, err = f.WriteString(mm.work)\n'; add "if err != nil { panic(err) }\n"; add "f.Sync()"; put; } "nop" { clear; add "/* nop eliminated */"; put; } clear; add "command*"; push; .reparse } #----------------------------------------- # ebnf: commandset := command , command ; "command*command*", "commandset*command*" { clear; add "commandset*"; push; # format the tape attributes. Add the next command on a newline --; get; add "\n"; ++; get; --; put; ++; clear; .reparse } #------------------- # here we begin to parse "test*" and "ortestset*" and "andtestset*" # #------------------- # eg: B"abc" {} or E"xyz" {} # transform and markup the different test types "begintext*,*","endtext*,*","quote*,*","class*,*", "eof*,*","tapetest*,*", "begintext*.*","endtext*.*","quote*.*","class*.*", "eof*.*","tapetest*.*", "begintext*{*","endtext*{*","quote*{*","class*{*", "eof*{*","tapetest*{*" { B"begin" { clear; add "strings.HasPrefix(mm.work, "; get; add ")"; } B"end" { clear; add "strings.HasSuffix(mm.work, "; get; add ")"; } B"quote" { clear; add "mm.work == "; get; } B"class" { # go condition syntax # use helper function isInClass clear; get; add ", mm.work)"; } # clear the tapecell for testeof and testtape because # they take no arguments. B"eof" { clear; put; add "mm.eof"; } B"tapetest" { clear; put; add "mm.work == mm.tape[mm.cell]"; } put; #* # maybe we could ellide the not tests by doing here B"not" { clear; add "!"; get; put; } *# clear; add "test*"; push; # the trick below pushes the right token back on the stack. get; add "*"; push; .reparse } #------------------- # negated tests # eg: !B"xyz {} !(eof) {} !(==) {} # !E"xyz" {} # !"abc" {} # ![a-z] {} "notbegintext*,*","notendtext*,*","notquote*,*","notclass*,*", "noteof*,*","nottapetest*,*", "notbegintext*.*","notendtext*.*","notquote*.*","notclass*.*", "noteof*.*","nottapetest*.*", "notbegintext*{*","notendtext*{*","notquote*{*","notclass*{*", "noteof*{*","nottapetest*{*" { B"notbegin" { clear; add "!strings.HasPrefix(mm.work,"; get; add ")"; } B"notend" { clear; add "!strings.HasSuffix(mm.work,"; get; add ")"; } B"notquote" { clear; add "mm.work != "; get; } B"notclass" { # produces !isInClass(.. or !isInList(.. or !isInRange(.. clear; add "!"; get; add ", mm.work)"; } # clear the tapecell for testeof and testtape because # they take no arguments. B"noteof" { clear; put; add "!mm.eof"; } B"nottapetest" { clear; put; add "mm.work != mm.tape[mm.cell]"; } put; clear; add "test*"; push; # the trick below pushes the right token back on the stack. get; add "*"; push; .reparse } #------------------- # 3 tokens #------------------- pop; #----------------------------- # some 3 token errors!!! # not a comprehensive list "{*quote*;*","{*begintext*;*","{*endtext*;*","{*class*;*", "commandset*quote*;*", "command*quote*;*" { push; push; push; add "[pep error]\n invalid syntax near line "; lines; add " (char "; chars; add ")"; add " of script (misplaced semicolon?) \n"; print; clear; quit; } # to simplify subsequent tests, transmogrify a single command # to a commandset (multiple commands). "{*command*}*" { clear; add "{*commandset*}*"; push; push; push; .reparse } # errors! mixing AND and OR concatenation ",*andtestset*{*", ".*ortestset*{*" { # push the tokens back to make debugging easier push; push; push; add " error: mixing AND (.) and OR (,) concatenation in \n"; add " in pep script near line "; lines; add " (character "; chars; add ") \n"; add ' For example: B".".!E"/".[abcd./] { print; } # Correct! B".".!E"/",[abcd./] { print; } # Error! \n'; print; clear; quit; } #-------------------------------------------- # ebnf: command := keyword , quoted-text , ";" ; # format: add "text"; "word*quote*;*" { clear; get; "replace" { # error add "< command requires 2 parameters, not 1 \n"; add "near line "; lines; add " of script. \n"; print; clear; quit; } # disable "while " syntax since it is not necessary "while", "whilenot" { add "[error] while/whilenot should not have quoted \n"; add "single character argument. Use eg: while [x] instead\n"; add "near line "; lines; add " of script. \n"; print; clear; quit; } # check whether argument is single character, otherwise # throw an error. Also, convert to single quotes for go # which is "delim", "escape", "unescape" { # This is trickier than I thought it would be. clear; ++; get; # check that arg not empty, (but an empty quote is ok # for the second arg of 'replace' '""' { clear; add "[pep error] near line "; lines; add " (or char "; chars; add "): \n"; add " command '"; --; get; ++; add "' "; add 'cannot have an empty argument ("") \n'; print; quit; } # quoted text has the quotes still around it. # also handle escape characters like \n \r etc # Also, unicode escape sequences like \u0x2222 clip; clop; clip; !"".!B"\\" { clear; add "[pep error] Pep script error near line "; lines; add " (character "; chars; add "): \n"; add " command '"; get; add "' takes only a single character argument. \n"; print; quit; } B"\\" { clip; !"" { clear; add "[pep error] Pep script error near line "; lines; add " (character "; chars; add "): \n"; add " command '"; --; get; add "' takes only a single character argument or \n"; add " and escaped single char eg: \n \t \f etc"; print; quit; } } # replace double quotes with single for argument clear; get; escape "'"; unescape '"'; clip; clop; put; clear; add "'"; get; add "'"; put; # re-get the command name --; clear; get; } "mark" { clear; add "mm.marks[mm.cell] = "; ++; get; --; add " /* mark */"; put; clear; add "command*"; push; .reparse } "go" { clear; add 'mm.goToMark('; ++; get; --; add ') /* go to mark */\n'; put; clear; add "command*"; push; .reparse } "delim" { clear; # the delimiter should be a single character, no? add "mm.delimiter = "; ++; get; --; add " /* delim */ "; put; clear; add "command*"; push; .reparse } "add" { clear; add "mm.work += "; ++; get; --; # handle multiline text check this! \\n or \n replace "\n" '"\nmm.work += "\\n'; put; clear; add "command*"; push; .reparse } # not used now "while" { clear; add "/* while */\n"; add "for mm.peep == "; ++; get; --; add " {\n"; add " if mm.eof { break }\n mm.read()\n"; add "}"; put; clear; add "command*"; push; .reparse } # not used now "whilenot" { clear; add "/* whilenot */\n"; add "for mm.peep != "; ++; get; --; add " {\n"; add " if mm.eof { break }\n mm.read()\n}"; put; clear; add "command*"; push; .reparse } "until" { clear; add "mm.until("; ++; get; --; # error until cannot have empty argument 'mm.until(""' { clear; add "Pep script error near line "; lines; add " (character "; chars; add "): \n"; add " empty argument for 'until' \n"; add " For example: until '.txt'; until \">\"; # correct until ''; until \"\"; # errors! \n"; print; quit; } # handle multiline argument replace "\n" "\\n"; add ');'; put; clear; add "command*"; push; .reparse } "escape" { clear; ++; # argument still has quotes around it # it should be a single character since this has been previously # checked. add 'mm.work = strings.Replace(mm.work, string('; get; add '), string(mm.escape)+string('; get; add '), -1)'; --; put; clear; add "command*"; push; .reparse } # replace \n with n for example (only 1 character) "unescape" { clear; ++; # use the machine escape char add 'mm.work = strings.Replace(mm.work, string(mm.escape)+string('; get; add '), string('; get; add '), -1)'; --; put; clear; add "command*"; push; .reparse } # error, superfluous argument add ": command does not take an argument \n"; add "near line "; lines; add " of script. \n"; print; clear; #state quit; } #---------------------------------- # format: "while [:alpha:] ;" or whilenot [a-z] ; "word*class*;*" { clear; get; "while" { clear; add "/* while */\n"; add "for "; ++; get; --; add ", string(mm.peep)) {\n"; add " if mm.eof { break }\n mm.read()\n}"; put; clear; add "command*"; push; .reparse } "whilenot" { clear; add "/* whilenot */\n"; add "for !"; ++; get; --; add ", string(mm.peep)) {\n"; add " if mm.eof { break; }\n"; add " mm.read()\n}"; put; clear; add "command*"; push; .reparse } # error add " < command cannot have a class argument \n"; add "line "; lines; add ": error in script \n"; print; clear; quit; } # arrange the parse> label loops (eof) { "commandset*parse>*commandset*","command*parse>*commandset*", "commandset*parse>*command*","command*parse>*command*" { clear; # indent both code blocks add " "; get; replace "\n" "\n "; # go has labelled loops, but complains if the label # is not used. So we have to use the flag technique # to make restart with before/after/without the parse> label replace "continue // restart" "break // restart"; put; clear; ++; ++; add " "; get; replace "\n" "\n "; put; clear; --; --; # add a block so that .reparse works before the parse> label. # it appears that go has labelled loops add "\n/* lex block */\n"; add "for true { \n"; get; add "\n break \n}\n"; ++; ++; add "if restart { restart = false; continue; }"; # indent code block # add " "; get; replace "\n" "\n "; put; clear; # using flag technique add "\n// parse block \n"; add "for true {\n"; get; add "\n break \n} // parse\n"; --; --; put; clear; add "commandset*"; push; .reparse } } # ------------------------------- # 4 tokens # ------------------------------- pop; #------------------------------------- # bnf: command := replace , quote , quote , ";" ; # example: replace "and" "AND" ; "word*quote*quote*;*" { clear; get; # check! go replace syntax # not used here # match1, err := regexp.MatchString("geeks", str) "replace" { #--------------------------- # a command plus 2 arguments, eg replace "this" "that" clear; add "/* replace */\n"; # add 'if mm.work != "" { \n'; add "mm.work = strings.Replace(mm.work, "; ++; get; add ", "; ++; get; add ", -1)\n"; --; --; put; clear; add "command*"; push; .reparse } add "Pep script error on line "; lines; add " (character "; chars; add "): \n"; add " command does not take 2 quoted arguments. \n"; print; quit; } #------------------------------------- # format: begin { #* commands *# } # "begin" blocks which are only executed once (they # will are assembled before the "start:" label. They must come before # all other commands. # "begin*{*command*}*", "begin*{*commandset*}*" { clear; ++; ++; get; --; --; put; clear; add "beginblock*"; push; .reparse } # ------------- # parses and compiles concatenated tests # eg: 'a',B'b',E'c',[def],[:space:],[g-k] { ... # these 2 tests should be all that is necessary "test*,*ortestset*{*", "test*,*test*{*" { clear; get; add " || "; ++; ++; get; --; --; put; clear; add "ortestset*{*"; push; push; .reparse } # dont mix AND and OR concatenations # ------------- # AND logic # parses and compiles concatenated AND tests # eg: 'a',B'b',E'c',[def],[:space:],[g-k] { ... # it is possible to elide this block with the negated block # for compactness but maybe readability is not as good. # negated tests can be chained with non negated tests. # eg: B'http' . !E'.txt' { ... } "test*.*andtestset*{*", "test*.*test*{*" { clear; get; add " && "; ++; ++; get; --; --; put; clear; add "andtestset*{*"; push; push; .reparse } #------------------------------------- # we should not have to check for the {*command*}* pattern # because that has already been transformed to {*commandset*}* "test*{*commandset*}*", "andtestset*{*commandset*}*", "ortestset*{*commandset*}*" { clear; # indent the code for readability ++; ++; add " "; get; replace "\n" "\n "; put; --; --; clear; add "if ("; get; add ") {\n"; ++; ++; get; # block end required add "\n}"; --; --; put; clear; add "command*"; push; # always reparse/compile .reparse } # ------------- # multi-token end-of-stream errors # not a comprehensive list of errors... (eof) { E"begintext*",E"endtext*",E"test*",E"ortestset*",E"andtestset*" { add " Error near end of script at line "; lines; add ". Test with no brace block? \n"; print; clear; quit; } E"quote*",E"class*",E"word*"{ put; clear; add "Error at end of pep script near line "; lines; add ": missing semi-colon? \n"; add "Parse stack: "; get; add "\n"; print; clear; quit; } E"{*", E"}*", E";*", E",*", E".*", E"!*", E"B*", E"E*" { put; clear; add "Error: misplaced terminal character at end of script! (line "; lines; add "). \n"; add "Parse stack: "; get; add "\n"; print; clear; quit; } } # put the 4 (or less) tokens back on the stack push; push; push; push; (eof) { print; clear; # create the virtual machine object code and save it # somewhere on the tape. add ' // code generated by "translate.go.pss" a pep script // http://bumble.sf.net/books/pars/tr/ // s.HasPrefix can be used instead of strings.HasPrefix package main import ( "fmt" "bufio" "strings" "strconv" "unicode" "io" "os" "unicode/utf8" ) // an alias for Println for brevity var pr = fmt.Println /* a machine for parsing */ type machine struct { SIZE int // how many elements in stack/tape/marks eof bool charsRead int linesRead int escape rune delimiter rune counter int work string stack []string cell int tape []string marks []string peep rune reader *bufio.Reader } // there is no special init for structures func newMachine(size int) *machine { mm := machine{SIZE: size} mm.eof = false // end of stream reached? mm.charsRead = 0 // how many chars already read mm.linesRead = 1 // how many lines already read mm.escape = \'\\\\\' mm.delimiter = \'*\' // push/pop delimiter (default "*") mm.counter = 0 // a counter for anything mm.work = "" // the workspace mm.stack = make([]string, 0, mm.SIZE) // stack for parse tokens mm.cell = 0 // current tape cell // slices not arrays mm.tape = make([]string, mm.SIZE, mm.SIZE) // a list of attribute for tokens mm.marks = make([]string, mm.SIZE, mm.SIZE) // marked tape cells // or dont initialse peep until "parse()" calls "setInput()" // check! this is not so simple mm.reader = bufio.NewReader(os.Stdin) var err error mm.peep, _, err = mm.reader.ReadRune() if err == io.EOF { mm.eof = true } else if err != nil { fmt.Fprintln(os.Stderr, "error:", err) os.Exit(1) } return &mm } // method syntax. // func (v * vertex) abs() float64 { ... } // multiline strings are ok ? func (mm *machine) setInput(newInput string) { print("to be implemented") } // read one utf8 character from the input stream and // update the machine. func (mm *machine) read() { var err error if mm.eof { os.Exit(0) } mm.charsRead += 1 // increment lines if mm.peep == \'\\n\' { mm.linesRead += 1 } mm.work += string(mm.peep) // check! mm.peep, _, err = mm.reader.ReadRune() if err == io.EOF { mm.eof = true } else if err != nil { fmt.Fprintln(os.Stderr, "error:", err) os.Exit(1) } } // remove escape character: trivial method ? // check the python code for this, and the c code in machine.interp.c // bug. fix. func (mm *machine) unescapeChar(c string) { // if mm.work = "" { return } mm.work = strings.Replace(mm.work, "\\\\"+c, c, -1) } /* Perl code. Also allows multiple escape chars eg: unescape "+-xyz"; # this walks the string and determines if the given char # is already escaped or not # eg "ab\cab\\cab\c" # allow multiple chars for escape/unescape sub unescapeChar { my $self = shift; # the machine my $chars = shift; # list of chars to escape my $cc = ""; my $result = ""; my $isEscaped = $false; foreach $cc (split(//,$self->{"work"})) { if (($isEscaped == $false) && ($cc eq $self->{"escape"})) { $isEscaped = $true; } else { $isEscaped = $false; } # remove the last escape character (usually backslash) # this allows multiple chars for escaping if (($isEscaped == $true) && (index($chars, $cc) != -1)) { $result =~ s/.$//s; } $result .= $cc; } $self->{"work"} = $result; } */ // add escape character : trivial func (mm *machine) escapeChar(c string) { mm.work = strings.Replace(mm.work, c, "\\\\"+c, -1) } /** a helper function to count trailing escapes */ func (mm *machine) countEscapes(suffix string) int { count := 0 ss := "" if strings.HasSuffix(mm.work, suffix) { ss = strings.TrimSuffix(mm.work, suffix) } for (strings.HasSuffix(ss, string(mm.escape))) { ss = strings.TrimSuffix(ss, string(mm.escape)) count++ } return count } // reads the input stream until the workspace ends with the // given character or text, ignoring escaped characters func (mm *machine) until(suffix string) { if mm.eof { return; } // read at least one character mm.read() for true { if mm.eof { return; } // we need to count the mm.Escape chars preceding suffix // if odd, keep reading, if even, stop if strings.HasSuffix(mm.work, suffix) { if (mm.countEscapes(suffix) % 2 == 0) { return } } mm.read() } } /* increment the tape pointer (command ++) and grow the tape and marks arrays if necessary */ func (mm *machine) increment() { mm.cell++ if mm.cell >= len(mm.tape) { mm.tape = append(mm.tape, "") mm.marks = append(mm.marks, "") mm.SIZE++ } } /* pop the last token from the stack into the workspace */ func (mm *machine) pop() bool { if len(mm.stack) == 0 { return false } // no, get last element of stack // a[len(a)-1] mm.work = mm.stack[len(mm.stack)-1] + mm.work // a = a[:len(a)-1] mm.stack = mm.stack[:len(mm.stack)-1] if mm.cell > 0 { mm.cell -= 1 } return true } // push the first token from the workspace to the stack func (mm *machine) push() bool { // dont increment the tape pointer on an empty push if mm.work == "" { return false } // push first token, or else whole string if no delimiter aa := strings.SplitN(mm.work, string(mm.delimiter), 2) if len(aa) == 1 { mm.stack = append(mm.stack, mm.work) mm.work = "" } else { mm.stack = append(mm.stack, aa[0]+string(mm.delimiter)) mm.work = aa[1] } mm.increment() return true } // func (mm *machine) printState() { fmt.Printf("Stack %v Work[%s] Peep[%c] \\n", mm.stack, mm.work, mm.peep) fmt.Printf("Acc:%v Esc:%c Delim:%c Chars:%v", mm.counter, mm.escape, mm.delimiter, mm.charsRead) fmt.Printf(" Lines:%v Cell:%v EOF:%v \\n", mm.linesRead, mm.cell, mm.eof) for ii, vv := range mm.tape { fmt.Printf("%v [%s] \\n", ii, vv) if ii > 4 { return; } } } func (mm *machine) goToMark(mark string) { markFound := false for ii := range mm.marks { if mm.marks[ii] == mark { mm.cell = ii; markFound = true; break } } if markFound == false { fmt.Printf("badmark \'%s\'", mark) os.Exit(1) } } // this is where the actual parsing/compiling code should go // so that it can be used by other go classes/objects. Also // should have a stream argument. func (mm *machine) parse(s string) { } /* adapt for clop and clip */ func trimLastChar(s string) string { r, size := utf8.DecodeLastRuneInString(s) if r == utf8.RuneError && (size == 0 || size == 1) { size = 0 } return s[:len(s)-size] } func (mm *machine) clip() { cc, _ := utf8.DecodeLastRuneInString(mm.work) mm.work = strings.TrimSuffix(mm.work, string(cc)) } func (mm *machine) clop() { _, size := utf8.DecodeRuneInString(mm.work) mm.work = mm.work[size:] } type fn func(rune) bool // eg unicode.IsLetter(\'x\') /* check whether the string s only contains runes of type determined by the typeFn function */ func isInClass(typeFn fn, s string) bool { if s == "" { return false; } for _, rr := range s { //if !unicode.IsLetter(rr) { if !typeFn(rr) { return false } } return true } /* range in format \'a,z\' */ func isInRange(start rune, end rune, s string) bool { if s == "" { return false; } for _, rr := range s { if (rr < start) || (rr > end) { return false } } return true } /* list of runes (unicode chars ) */ func isInList(list string, s string) bool { // bug! logic incorrect. should be "onlyContainsAny" return strings.ContainsAny(s, list) } func main() { // This size needs to be big for some applications. Eg // calculating big palindromes. Really // it should be dynamically allocated. var size = 30000 var mm = newMachine(size); var restart = false; // the go compiler complains when modules are imported but // not used, also if vars are not used. if restart {}; unicode.IsDigit(\'0\'); strconv.Itoa(0); '; # save the code in the current tape cell put; clear; #--------------------- # check if the script correctly parsed (there should only # be one token on the stack, namely "commandset*" or "command*"). pop; pop; "commandset*", "command*" { clear; # indent generated code for readability. add " "; get; replace "\n" "\n "; put; clear; # restore the go preamble from the tape ++; get; --; #add 'script: \n'; add 'for !mm.eof { \n'; get; add "\n }\n"; add "}\n"; add "\n\n// end of generated 'go' code\n"; # put a copy of the final compilation into the tapecell # so it can be inspected interactively. put; print; clear; quit; } "beginblock*commandset*", "beginblock*command*" { clear; # indentation not needed here #add ""; get; #replace "\n" "\n"; put; clear; # indent main code for readability. ++; add " "; get; replace "\n" "\n "; put; clear; --; # get go preamble (Machine object definition) from tape ++; ++; get; --; --; get; add "\n"; ++; # a labelled loop for "quit" (but quit can just exit?) #add "script: \n"; add "for !mm.eof { \n"; get; # end block marker required in 'go' add "\n }\n"; add "}\n"; add "\n\n// end of generated golang code\n"; # put a copy of the final compilation into the tapecell # for interactive debugging. put; print; clear; quit; } push; push; # try to explain some more errors unstack; B"parse>" { put; clear; add "[error] pep syntax error:\n"; add " The parse> label cannot be the 1st item \n"; add " of a script \n"; print; quit; } put; clear; clear; add "[error] After compiling with 'translate.go.pss' (at EOF): \n "; add " parse error in input script. \n "; print; clear; unstack; put; clear; add "Parse stack: "; get; add "\n"; add " * debug script "; add " >> pep -If script -i 'some input' \n "; add " * debug compilation. \n "; add " >> pep -Ia asm.pp script' \n "; print; clear; quit; } # not eof # there is an implicit .restart command here (jump start)