#* mark.html.pss OVERVIEW note: check for E"aaa" E"bbb" in compile.pss and throw an error This script explores the possibilities of transforming text documents in a kind of markdown format into other formats. The script parses the document as a heirarchy of elements (in a "bottom-up" fashion) rather than just applying regular expressions to patterns. The trick in writing the grammar for this kind of transformation is not to have too many token types, to reduce the number of brace blocks and grammar rules required. MARKDOWNISH DOCUMENT FORMAT This section documents (yet another) markdown-style format which I personally use. I dont claim this document format is superior to any other markdown-style format, its just that I like it and have used it for a long time. No numbers are allowed in section headings, basically because the machine doesnt have any regular expression matching. * An example of the type of document is this file: ---- && document title UPPERCASE WORDS 1st Level Heading UPPERCASE WITH FOUR DOTS 2nd Level Heading ** Two Stars 3rd Level Heading * code lines begin with >> >> Links begin with http:// or https:// or just / code blocks are enclosed in ---- ,,, on their own lines lines beginning with a star are for emphasis or as a description of a following code line (a recipe). USES I tried to make a unix man page from an asciidoc document with a2x and it made me go via xsltproc and various other bits of ridiculous cruft. Whats more, it converted from asciidoc to xml and then to a man page and took about 30 seconds for a tiny document. So maybe this script can do better than that. IDEAS Use "mark" and "go" to build a table of contents from the headings in the first tape cell. implemented a "starline*" token. also: "nl/starline/nl/codeline/nl/" Maybe this is feasible, eg resolve: implement images with the same format used by booktolatex.cgi emptyline nl lines nl emptyline -> paragraph emptyline nl text nl emptyline -> paragraph starline codeblock -> titlecodeblock ; could also parse quoted-text. TESTING * convert a text document to html and print to stdout >> pep -f eg/mark.html.pss pars-book.txt BUGS HISTORY 1 july 2020 Need to totally rethink and rewrite. deleting all except tokenisation, then build up script with one structure at a time. Eliminate all unnecessary tokens. Made progress by incrementally adding structures. added multiline quotes """ ... """ which can be used in images [[ ... ]] etc. Made links, and images. 17 june 2020 New ideas. "----" doesnt have to start line but is a word. Dont do line by line parsing (except for headings, codelines, starlines). Get rid of newline tokens as soon as possible, eg: ---- "nl*text*","nl*word*",,"nl*file*","nl*link*", "nl*heading*","nl*subheading*", "nl*codeline*","nl*codeblock*", "nl*starline*","nl*[[*" { clop; clop; clop; push; # workspace should be clear now. # transfer value add "\n"; get; --; put; ++; .reparse } ,,,, Use transmogrification in images [[ ]] to safely get rid on nl* newline tokens, eg: ----- "[[*file*", "[[*link*" { # turn 'file*' into 'image.file*' and 'link*' into 'image.link*' replace "[[*" "[[*image."; push; push; .reparse } ,,,, Now we can safely get rid of some newline tokens in images ( because newlines are not significant), and also use the new tokens to transmogrify captions "..." and location indicators >> and << eg ---- "image.file*quoted*","image.link*quoted*" { # changed quoted into caption push; clear; add "caption*"; push; .reparse } "image.file*nl*","image.link*nl*","caption*nl*" { clip; clip; clip; push; .reparse } ,,,, 16 june 2020 Would also like to implement lists. In fact the whole "line by line" parsing below is dodgy because it interfers with structures which can be multiline, such as images. So I will remove the line* token as well. revising again. I think in order to simplify, we can remove the "space*" token. All words will be separated by only one space. and also make "word*" just "text*" and "uword*" into "utext*" Also, need to change [[ >> and ]] parsing (parse char by char, not as a word). Rename this to "mark.html.space.pss" and remove space tokens. Also, need a better way to get rid of tokens: eg ------ parse> pop; pop; # check that at least 2 tokens, that last is >> and # first is not newline. A ">>" is only significant if it starts the # line, so the block below just turns >> into a text* token if it # doesnt start the line. Can do the same with * # but --- doesnt have to start the line nor does [[ image marker # no!!! because >> and << are also the image float indicators !">>*".E">>*".!B"nl*" { clear; get; add " "; ++; get; put; clear; add "text*"; push; .reparse } !"star*".E"star*".!B"nl*" { clear; get; add " "; ++; get; put; clear; add "text*"; push; .reparse } ,,,, 15 june 2020 Revising this to remove unnecessary newline "nl*" tokens and to try to simplify the logic. Also, will try to methodically view different text parsing. we can try, for example >> pp -f eg/mark.html.pss -i '"link text" www.google.com' as a way to test structures of text and how it is parsed/transcribed. 24 Feb 2020 Starting to make an image marker eg: [[/images/screenshot.png >>] This needs to start the line it is on. Revisiting this and doing more work to see if I can markup a starline*codeline* token sequence as a table. I dont think that all the nl* newline tokens are really necessary, mainly the ones that preceed other tokens on the stack. eg nl*starline* seems unnecessary. We could reduce this to just starline*. This kind of parsing and translating seems much more feasible to me now, especially making use of the pp -I interactive debugger. After all, a big complex sed script is just as confusing for the uninitiated. 14 sept 2019 Implemented starline for emphasis, but it has problems. 9 september 2019 I am still not convinced that this is practical. It may be better just to use regular expressions. Doing more work on this. I will not try to parse sections and subsections. I will just subsume headings into lines. and output html. Very basic html output is working. 26 august 2019 A bit more work. This does not seem easy to do. Mainly because of newline problems, and also, lots of different token types that need to be resolved into text. eg link, uword, word, mixword quoted text, utext, uword, ... 23 august 2019 Started this script. Made quite a bit of progress. It is necessary to write a lot of rules, but the coding is straightforward and it seems easy to debug. We can adapt this script to output different formats. I realised that I would like syntax like this (now implemented) * combine begin and ends tests into quotesets. >> B"http", B"www.", E".txt", E".c" { ... } *# read; [\n] { put; clear; count; # check counter as flag. If set, then dont generate newline # tokens. "0" { clear; add "nl*"; push; .reparse } } [\r] { clear; .restart } # space includes \n\r so we can't use the [:space:] class [ \t] { while [ \t]; clear; .reparse } # cant really use ' because then we can't write "can't" for example '"' { # check for multiline syntax """ while ["]; !'"' { put; clear; add "word*"; push; .reparse } whilenot ["\n]; # check for multiple """ for multiline quotes (eof) { put; clear; add "text*"; push; .reparse } read; # one double quote on line. [\n] { put; clear; add "text*"; push; .reparse } # closing double quote. put; clear; add "quoted*"; push; .reparse } # [[ ]] >> << are parse as words (space delimited) # everything else is a word # all the logic in the word* block could just be here. !"" { whilenot [:space:]; put; clear; add "word*"; push; .reparse } # end of the lexing phase of the script # start of the parse/compile/translate phase parse> # The parse/compile/translate/transform phase involves # recognising series of tokens on the stack and "reducing" them # according to the required bnf grammar rules. #* A list of tokens types: codeline text word quoted file >> << [[ ]] link nl *# #----------------- # 1 token pop; #(eof).!"end*" { #} "word*" { clear; get; # no numbers in headings! #[A-Z]{ clear; add "uword*"; push; .reparse } # the subheading marker #"...." { clear; add "4dots*"; push; .reparse } # emphasis or explanation line marker #"*" { clear; add "star*"; push; .reparse } # image markers "[[" { add "*"; push; .reparse } "]]" { add "*"; push; .reparse } # the code line marker, and float right marker ">>" { # convert to html entities clear; add ">> "; put; clear; add ">>*"; push; .reparse } # the float left marker "<<" { clear; add "<< "; put; clear; add "<<*"; push; .reparse } # multiline quotes '"""' { clear; until '"""'; !E'"""' { put; clear; add "text*"; push; .reparse } clip; clip; clip; put; clear; add "quoted*"; push; .reparse } # multiline codeblocks start with --- on a newline B"---".[-] { clear; pop; "nl*" { clear; until ',,,'; !E',,,' { put; clear; add "text*"; push; .reparse } clip; clip; clip; put; clear; # discard extra ,,,, while [,]; clear; add "codeline*"; push; .reparse } push; add "word*"; push; .reparse } # starline starts with a star '*' { clear; add "⊗ "; put; clear; pop; "nl*" { clear; # clear leading whitespace while [ \t]; clear; add ""; whilenot [\n]; add ""; put; clear; add "emline*"; push; .reparse } push; add "word*"; push; .reparse } # the code block begin marker. can't read straight to end marker #B"---".[-] { clear; put; add "---*"; push; .reparse } B"http://",B"https://",B"www.",B"ftp://",B"sftp://" { clear; add "link*"; push; .reparse } B"/" { E"/",E".c",E".txt",E".html",E".pss",E".pp",E".js",E".java", E".tcl",E".py",E".pl",E".jpeg",E".jpg",E".png" { clear; add "file*"; push; .reparse } } clear; add "word*"; # leave the wordtoken on the workspace. } # get rid of insignificant tokens at the end of the document "[[*","<<*",">>*","quoted*" { (eof) { clear; add "word*"; } } # resolve links at the end of the document "link*" { (eof) { clear; add ""; get; add ""; put; clear; add "text*"; push; .reparse } } # resolve file links at the end of the document "file*" { (eof) { clear; add ""; get; add ""; put; clear; add "text*"; push; .reparse } } #----------------- # 2 tokens pop; # eliminate insignificant newlines and ellide words "nl*word*","nl*text*", "emline*text*","emline*word*", "word*word*","text*word*","text*text*","word*text*", "quoted*text*", "quoted*word*" { clear; get; add " "; ++; get; --; put; clear; add "text*"; push; .reparse } # ellide as text insignificant "]]" image end tokens "word*]]*","text*]]*" { clear; get; add " "; ++; get; --; put; clear; add "text*"; push; .reparse } # ellide multiple newlines "nl*nl*" { clear; get; ++; get; --; add "
\n"; put; clear; add "nl*"; push; .reparse } # codelines. nl*>>* should not occur in image markup "nl*>>*" { clear; # clear leading whitespace while [ \t]; clear; whilenot [\n]; put; clear; add "codeline*"; push; .reparse } # eliminate insignificant newlines at end of document "word*nl*","text*nl*" { (eof) { clear; get; add " "; ++; get; --; put; clear; add "text*"; push; .reparse } } # mark this up as a "recipe". # sample: # * description # >> sh code.to.exec "emline*codeline*" { clear; add "\n\n"; add "\n
"; get; add "
"; ++; get; 
    add "
\n"; --; put; clear; add "text*"; push; .reparse } "word*codeline*","text*codeline*","quoted*codeline*" { clear; get; add " "; add "\n
"; ++; get; 
    add "
\n"; --; put; clear; add "text*"; push; .reparse } # a line of code at the start of the document "codeline*" { clear; add "
"; get; 
    add "
\n"; put; clear; add "text*"; push; .reparse } # sample: tree www.abc.org (also at the start of document) "word*link*","text*link*","nl*link*" { clear; get; add " "; add ""; get; --; add ""; put; clear; add "text*"; push; .reparse } # link at the start of document (only 1 token) "link*" { clear; add ""; get; add ""; put; clear; add "text*"; push; .reparse } # sample: condor /file.txt "word*file*","text*file*","nl*file*" { clear; get; add " "; add ""; get; --; add ""; put; clear; add "text*"; push; .reparse } # file link at start of document "file*" { clear; add ""; get; add ""; put; clear; add "text*"; push; .reparse } "quoted*file*","quoted*link*" { clear; # remove quotes from quoted text get; clip; clop; put; clear; add ""; get; add ""; put; clear; add "text*"; push; .reparse } # get rid of irrelevant ">>" tokens (ie not in image, nor at # start of code line). # image format: [[ /file.txt "caption" >> ]] E">>*"{ !B"nl*".!B"quoted*".!B"file*".!B"link*" { clear; get; add " "; ++; get; --; put; clear; add "text*"; push; .reparse } } # ellide insignificant "<<" tokens (ie not in image markup) B"<<*".!E"]]*" { replace "<<*" "word*"; push; push; .reparse } # eliminate newlines in image markup "[[*nl*" { clear; get; ++; get; --; put; clear; add "[[*"; push; .reparse } "nl*]]*" { clear; get; ++; get; --; put; clear; add "]]*"; push; .reparse } # get rid of insignificant "[[" image start tokens # image format: [[ /file.txt "caption" >> ]] B"[[*".!"[[*" { !E"file*".!E"link*" { clear; get; add " "; ++; get; --; put; clear; add "text*"; push; .reparse } } #---------------------- # 3 tokens pop; # eliminate newlines within image markup # this is important because nl*>>* is considered the # start of a "codeline". "[[*file*nl*","[[*link*nl*","link*quoted*nl*","file*quoted*nl*" { clip; clip; clip; push; push; .reparse } # simple image format: [[ /path/file.jpg ]] "[[*file*]]*","[[*link*]]*" { clear; ++; add "\n"; --; put; clear; add "text*"; push; .reparse } # incorrect image format: [[ /path/file.jpg word # just becomes text. I probably should hyperlink the links # but wont for now. "[[*file*word*","[[*link*word*", "[[*file*text*","[[*link*text*" { clear; get; add " "; ++; get; add " "; ++; get; --; --; put; clear; add "text*"; push; .reparse } #---------------------- # 4 tokens pop; # image format with caption: [[ /path/file.jpg "caption" ]] "[[*file*quoted*]]*","[[*link*quoted*]]*" { clear; add "\n
"; ++; get; add "
\n"; --; --; put; clear; add "text*"; push; .reparse } # image format with float: [[ /path/file.jpg >> ]] "[[*file*>>*]]*","[[*link*>>*]]*" { clear; add "\n"; --; put; clear; add "text*"; push; .reparse } # image format with float: [[ /path/file.jpg >> ]] "[[*file*<<*]]*","[[*link*<<*]]*" { clear; add "\n"; --; put; clear; add "text*"; push; .reparse } #---------------------- # 5 tokens pop; # image format with caption and float: [[ /path/file.jpg "caption" >> ]] "[[*file*quoted*>>*]]*","[[*link*quoted*>>*]]*" { clear; add "\n
"; ++; get; add "
\n"; --; --; put; clear; add "text*"; push; .reparse } # image format with caption and float: [[ /path/file.jpg "caption" >> ]] "[[*file*quoted*<<*]]*","[[*link*quoted*<<*]]*" { clear; add "\n
"; ++; get; add "
\n"; --; --; put; clear; add "text*"; push; .reparse } push; push; push; push; push; (eof) { add "\n \n"; print; clear; # workspace should be empty !"" { put; clear; add "\n"; print; } add "\n"; print; clear; pop; pop; "word*","text*","link*","file*","quoted*","emline*","nl*" { clear; add "\n"; get; add "\n\n"; add "\n"; print; } }