#*
  
ABOUT

  This is another version of a script to transform a "plain" (minimal markup)
  text document into simple html and css. This is a rewrite of mark.html.pss
  but does not have lists. I intend to only add the structures which I use
  frequently.  this script is grammatically simpler than mark.html.pss
  and feels easier to maintain and extend. This script has been very
  successful and is used for rendering the [html] at www.nomlang.org


GRAMMAR TOKENS

  space* whitespace and newlines
  word* one space delimited word
  text* any text including html tags as added
  quoted* text between double quotes "like this" but one line only.
  url* anything starting with http:// https:// www. etc
  file* a filename
  item* an item of a list
  endlist* reverse shift reductions of lists

TODO

  put a "title=" attribute in links to give a hint for things like oed:
  lookup.

  think about "quoted". or "quoted", either ending in dot or comma
  to parse properly.

  lists, but especially definition lists.

DONE 

  unordered lists. eg 
   - one
   - two end
     with blank line

  images, sort of (need to make a width for the <figure> tag) 

OTHER FILES

  books/pars/www/blog.sh
    A bash script which contains a set of functions which use this 
    script to manage blog websites.
  books/pars/eg/make.html.header.pss
    Generates the html header and banner for the page (contains css).

NOTES

  The development of this proceeded very quickly. In a few hours I
  had significant syntax implemented. This was much faster than
  mark.html.pss and mark.latex.pss

  This is part of an effort to create a pep/nom based blog with
  rss feed (pars/www/blog.sh) as well as the shrob.org blog
  and the makethespoon.org site.

  
  It would be nice to have an "until 'ab','cd','ef'" syntax
  to that we could parse one line quotes etc. Eg
    >> until '"','\n';
  We cant do
    >> whilenot ["\n]
  but that has its own problems.

MARKUP FORMAT

  See pars/eg/text.tohtml.format.txt for detailed info about 
  the minimal markup format that this script recognises and 
  formats.

  "caption" <image.file.jpg> as an easy image format.
  and <image.file.jpg> as an image with no caption.
  
  No lists, 

  *word* or *some words* emphasis/italic
  # first level heading
  ## 2nd level heading
  ### 3rd level heading
  
  hyperlinks:
    "linktext" http://url.org
    "more text" file://url.org
    "some word" local.file.html 
    "java file" local.file.java
    and other formats too

  >> single code line starting with '>>'
  --- multiline code block ending with ,,, 
  
  Many other formats and code available.

STATUS

  18 mar 2025
    marking up lots of formats, see /eg/text.tohtml.format.html
    Script seems to be working well with links headings emphasis.
    Unordered lists seem to work.
    Still missing some lists and all capital heading lines.

HISTORY

  18 march 2025
    Doing some crazy reverse reductions with lists. There is no real
    start-token for an unordered list (just the '-' word starting a
    line, but that is the item* token). So have to reduce the list 
    when I get to the end of it.
  6 march 2025 
    Added rosetta:// schema for rosettacode.org problems.
  24 feb 2025 
    added an urbandict:// schema for word lookups.
    reformed the was oed: and urbandict: links are rendered by rendering
    in the parsing phase not lexing phase
  21 feb 2025
    added a nomsyn:// url schema for writing about nom syntax
  20 feb 2025
    really struggled with the images. Had to change the width format
    to multiples of 5em, because variable length fields were too 
    hard to extract without regexs
  13 feb 2025
    added attributes to images eg <:o:3:<<:imagename.ext> the 
    parsing is working but have to fix the css and maybe also
    the <figure> <img> tag interaction.
  11 feb 2025
    added forced line breaks >> and horizontal rules
  9 feb 2025
    Added some html curly quotes to quoted text which is not a 
    link. Added ordinal number superscripts in english (which is 
    a bit silly really. working on images 
  4 Feb 2025
    Began this script, created some useful syntax

*#

 read;
 # newlines and empty lines
 [\n] {
   clear; add "\n"; put;
   # no words on previous line, so this is a blank line
   clear; count; "0" {
     clear; 
     # check here for end of list?
     pop;pop; 
     "item*text*" {
       push; push; add "</li>\n</ul>\n"; put; 
       clear; add "endlist*"; push; .reparse
     }
     push;push; 
     add "<p>\n"; put; clear;
   }
   # set accumulator == 0 so that we can count words 
   # per line (and know which is the first word)
   clear; zero; nochars;
   add "space*"; push; .reparse
 }

 # parse space, but maybe [:space:] would be better.
 [ \t] {
   while [ \t]; 
   clear; add " "; put; clear;
   add "space*"; push; .reparse
 }

 # ignore other types of space
 [:space:] {
   clear; .restart
 }

 # everything else is a word
 ![:space:] {
   # read word and increment word counter
   whilenot [:space:]; a+;
   
   # here parse image files in format <imfile.jpg> before we
   # change > < to entities.
   # --------------
   #B"<" {
   #  E".jpg>",E".jpeg>",E".png>",E".gif>" {
   #    clip; clop; put; clear;
   #    add "imagefile*"; push; .reparse
   #  }
   #}

   # here we build an <img> html tag from minimal and optional markup
   # attributes are <:corners:float:width:filename.ext>
   # this code is quite tricky. See also 
   #   pars/eg/imagetext.tohtml.pss 
   B"<".E">".!"<>" {
     E".png>",E".jpg>",E".jpeg>",E".bmp>",E".gif>" { 
       # an example image text format may be 
       # <:0:4:>>:/image/name.gif> or <name.gif> 
       # The order of the attributes is important but the attributes 
       # are optional eg: <:<<:r:20pt:name.jpg> wont work because the 
       # float attribute '<<' comes before the rounded corner attribute 'r'
       clip; clop; put; clear;
       add "<img style='"; swap; 
       # we use swap to juggle the built html and the original
       # minimal markup text.
       # :0: is the circle image (avatar) indicator,

       # allow the first colon to be missing
       B":O:",B":o:",B"O:",B"o:" {
         swap; add "border-radius:50%;";
         swap; B":" { clop; } clop;
       }
       # small rounded corners on the image
       B":r:",B"r:" {
         swap; add "border-radius:5%;";
         swap; B":" { clop; } clop;
       }
       # large rounded corners 
       B":R:",B"R:" {
         swap; add "border-radius:15%;";
         swap; B":" { clop; } clop;
       }

       # width spec multiple of 5em, allow missing 1st colon
       B":1:",B":2:",B":3:",B":4:",B":5:",
       B"1:",B"2:",B"3:",B"4:",B"5:" {
         B":" { clop; }
         B"1:" { swap; add "width:5em;"; }
         B"2:" { swap; add "width:10em;"; }
         B"3:" { swap; add "width:15em;"; }
         B"4:" { swap; add "width:20em;"; }
         B"5:" { swap; add "width:25em;"; }
         swap; clop;
       }
       # add a default width
       swap; !E"em;" { add "width:10em;"; }
       # finish off the style attribute
       add "' "; swap;
       # the float right indicator, it needs to come after :0:
       B":>>:",B">>:" {
         swap; add "class='float-right' "; 
         swap; B":" { clop; } clop; clop;
       }
       # float left 
       B":<<:",B">>:" {
         swap; add "class='float-left' "; 
         swap; B":" { clop; } clop; clop;
       }
       # centre indicator  
       B":cc:" {
         swap; add "class='center' "; 
         swap; clop; clop; clop;
       }

       B":" { clop; }
       # build the html image src= attribute. 
       swap; add " src='"; get; add "'/>"; 
       swap; clear;

       clear;
       add "imagefile*"; push; .reparse
     }
   }

   # make < and > html entities because they will wreck our page
   # but not if is >> as 1st word
   !">>" { replace ">" "&gt;"; replace "<" "&lt;"; }
   # some curly quotes, why not? A half hearted attempt for english


   # insert some apostrophes
   "doesnt","isnt","cant","arent","couldnt","didnt","hasnt",
   "havent","shouldnt","mustnt","wasnt" {
     replace "nt" "n't";
   }
   "lets","thats","whats" {
     replace "ts" "t's";
   }

   "I'm","you're","he's","she's","it's","we're","they're","aren't",
   "can't","couldn't","didn't","doesn't","hadn't","hasn't","haven't",
   "isn't","mightn't","mustn't","oughtn't","shouldn't","wasn't",
   "weren't","won't","wouldn't","I've","you've","he's","she's","it's",
   "we've","they've","I'd","you'd","he'd","she'd","it'd","we'd",
   "they'd","I'll","you'll","he'll","she'll","it'll","we'll","they'll",
   "there's","that's","what's","who's","where's","when's","why's",
   "how's" {
     replace "'" "&rsquo;";
   }

   # some common typos for apostrophe contractions in english
   "wouldnt","shouldnt","wont","dont","Im","theyre","wasnt","werent",
   "arent","cant","didnt","doesnt","havent","hasnt","isnt","couldnt" {
     replace "Im" "I&rsquo;m"; replace "nt" "n&rsquo;t"; 
   }
   # now do t's english typos or fast typing. I know, very english centric
   # but I write in english, so there
   "thats","whats","its" {
     replace "ts" "t&rsquo;s"; 
   }

   put;
   # ordinals in english, very perfunctory but sort of fun. 
   # eg: 1st, 2nd, 301rd
   [0123456789stndrdth] {
     E"1st" {
       # check matches [0-9]*1st
       clip; clip; clip; "",[0-9] {
         clear; get; clip; clip;
         add "<sup>st</sup>";
       }
     } 
     E"2nd" {
       # check matches [0-9]*2nd
       clip; clip; clip; "",[0-9] {
         clear; get; clip; clip;
         add "<sup>nd</sup>";
       }
     }
     E"3rd" {
       clip; clip; clip; "",[0-9] {
         clear; get; clip; clip;
         add "<sup>rd</sup>";
       }
     }
     E"th" {
       # check matches [0-9]*[4-9]th 
       clip; clip; !"".!E"1".!E"2".!E"3".[0-9] {
         add "<sup>th</sup>";
       }
     }
   }

   put; clear; count;
   # deal with ">>" when not first word
   !"1" { 
     clear; get; 
     ">>" { clear; add "&gt;&gt;"; put; }
     clear; count;
   }
   # check if this is the first word on the line
   # because several markup elements (as in markdown) need to be
   # the 1st word to be significant.
   "1" {
     clear; get;

     # a one line comment, just ignored at the moment.
     "#:" { 
       clear; whilenot [\n]; clear; .reparse
     }

     "-" { 
       clear; add "</li>\n<li>"; put; clear;
       add "item*"; push; .reparse
     }

     # asterix as first word on line marks the description of 
     # a code line or block which follows (like a caption)
     # format this later in 2 token parsing. 
     # starlines are used as captions for code and also citations
     # for quotations.
     "*" { 
       clear; whilenot [\n]; 
       replace ">" "&gt;"; replace "<" "&lt;";
       put; clear;
       add "starline*"; push; .reparse
     }
     # document or page title  
     "&&" { 
       clear; whilenot [\n]; 
       replace ">" "&gt;"; replace "<" "&lt;";
       put; clear;
       add "<!-- ------------ page title -------------------- -->\n";
       add "<h1 class='page-title'>"; get; add "</h1>\n"; put; clear;
       add "text*"; push; .reparse
     }
     # markdown style headings. I would prefer to use one # as 
     # a comment.
     "#" { 
       clear; whilenot [\n]; 
       replace ">" "&gt;"; replace "<" "&lt;";
       put; clear;
       add "<!-- ------------------------------- -->\n";
       add "<h1>"; get; add "</h1>\n"; put; clear;
       add "text*"; push; .reparse
     }
     # headings to capital case
     "##" { 
       clear; whilenot [\n]; cap;
       replace ">" "&gt;"; replace "<" "&lt;";
       put; clear;
       add "<!-- ------------------------------- -->\n";
       add "<h2>"; get; add "</h2>\n"; put; clear;
       add "text*"; push; .reparse
     }
     "###" { 
       clear; whilenot [\n]; cap;
       replace ">" "&gt;"; replace "<" "&lt;";
       put; clear;
       add "<!-- ------------------------------- -->\n";
       add "<h3>"; get; add "</h3>\n"; put; clear;
       add "text*"; push; .reparse
     }

     # one line of code etc
     ">>" { 
       clear; whilenot [\n]; 
       replace ">" "&gt;"; replace "<" "&lt;";
       put; clear;
       add "<pre class='codeline'>\n"; get; 
       add "\n</pre>\n"; put; clear;
       add "codeline*"; push; .reparse
     }

     # horizontal rules >--------  (> is already &gt;)
     B"&gt;---" {
       # ensure matches regex ">[-]{3,}"
       clop;clop;clop;clop; [-] {
         clear; add "<hr/>\n"; put;
         add "text*"; push; .reparse
       }
     }

     # codeblocks begin with --- or ---- etc
     B"---".[-] {
       clear; until ",,,"; clip; clip; clip;
       replace ">" "&gt;"; replace "<" "&lt;";
       put; clear; add "<pre class='codeblock'>\n"; get; add "</pre>\n"; 
       put; clear; while [,]; clear;
       add "codeblock*"; push; .reparse
     }

     # multiline quotes, start and end with 3 quotes """. Starting """ 
     # must be first on line. The only problem is that they can chew up the 
     # whole doc. This may be rendered with a big curly quote at the 
     # beginning. If this is preceded by a star line, then that is the 
     # author of the quotation. The html <blockquote> will be added later
     # during 2 grammar token parsing.
     B'"""' {
       clop; clop; clop; until '"""'; clip; clip; clip;
       replace ">" "&gt;"; replace "<" "&lt;";
       put; clear; 
       add "blockquote*"; push; .reparse
     }

   }
   clear; get;

   # force a line break with '>>' (but not first word on line), 
   # could be a way to imitate lists
   "&gt;&gt;" {
     # todo bug? need to clear; I dont know why this works
     clear; add "<br/>\n"; put; 
     add "text*"; push; .reparse
   }

   # just some chess stuff, why not
   "[chess:king]","[chess:queen]","[chess:rook]" {
     clip; clop; replace "chess:" "";
     replace "king" "â™š"; replace "queen" "â™›"; replace "rook" "â™œ";
     put; clear;
     add "<big><abbr class='chess-piece' \n";
     add "      title='chess'>"; get; add "</abbr></big>";
     put; clear; add "text*"; push; .reparse 
   }

   # todo: some "etymological word": for example 
   # [heuristic] -- insert the greek derivation and the explanation
   #  eg: from 'today' that which we learn each day
   # [idiot] from greek 'private' 
   #  these words would be good as "footnotes" maybe 

   "[cryptic]","[heuristic]","[idiot]","[epistemology]" {
     clip; clop; put; clear;
     add "<abbr class='etymology' \n";
     add "      title='##'>"; get; add "</abbr>";
     swap;
     "cryptic" { swap; replace "##" "hidden: from latin 'cryptus'"; }
     "heuristic" { swap; replace "##" "cotidian: from greek 'day/today'"; }
     "epistemology" { swap; replace "##" "from greek 'knowlege'"; }
     "idiot" { swap; replace "##" "private: from greek 'idiot'"; }
     put; clear; add "text*"; push; .reparse 
   }

   # sort of tech terms that aren't acronyms...
   #  do <stdin> <stdout> ?
   "[stdin]","[stdout]","[malloc]","[realloc]" {
     clip; clop; put; clear;
     add "\n";
     add "<abbr class='tek-acronym' \n";
     add "      title='##'>"; get; add "</abbr>";
     swap; 
     # put < > around these terms 
     "stdin" { swap; replace "##" "standard input stream"; }
     "stdout" { swap; replace "##" "standard output stream"; }
     "malloc" { swap; replace "##" "c memory allocation torture"; }
     "realloc" { swap; replace "##" "more c memory torture"; }
     put; clear; add "text*"; push; .reparse 
   }

   # some explanatory "titles" (tooltips) for acronyms ?
   # a less verbose way, also do <stdin> <stdout> ?
   "[html]","[csv]","[pep]","[ast]","[bnf]","[ebnf]","[xbnf]",
   "[nom]","[json]","[xml]","[http]","[gnu]",
   "[rfc]", "[faq]","[man]","[awk]","[sed]","[grep]",
   "[groff]","[eqn]","[lisp]","[latex]",
   "[forth]","[unix]","[linux]","[minix]","[vim]","[java]","[c++]","[python]",
   "[lua]","[wren]","[logo]","[go]","[dart]","[rust]","[tcl]",
   "[antlr]","[gcc]","[tcc]","[utf8]","[utf16]","[unicode]",
   "[eof]","[bash]","[markdown]" {
     clip; clop; 
     upper; put; clear;
     add "\n";
     add "<abbr class='tek-acronym' \n";
     add "      title='##'>"; get; add "</abbr>";
     # a silly attempt at the LATEX logo
     replace "LATEX" "L<sup><small>A</small></sup>T<sub>E</sub>X";
     swap; 
     "HTML" { swap; replace "##" "Hyper-text Markup Language"; }
     "CSV" { swap; replace "##" "Comma Separate Values"; }
     "PEP" { swap; replace "##" "Parsing Engine for Patterns"; }
     "AST" { swap; replace "##" "Abstract Syntax Tree"; }
     "BNF" { swap; replace "##" "Backus-Naur Form"; }
     "EBNF" { swap; replace "##" "Extended Backus-Naur Form"; }
     "XBNF" { swap; replace "##" "Any random BNF format"; }
     "NOM" { swap; replace "##" "Nom Parsing Language"; }
     "JSON" { swap; replace "##" "Javascript object notation"; }
     "XML" { swap; replace "##" "Extensible Markup Language"; }
     "HTTP" { swap; replace "##" "Hyper-text Transport Protocol"; }
     "GNU" { swap; replace "##" "Gnu is not Unix, silly acronym."; }
     "RFC" { swap; replace "##" "Request For Comments"; }
     "FAQ" { swap; replace "##" "Frequently Asked Questions"; }
     "MAN" { swap; replace "##" "Unix Manual (Doc) Pages"; }
     "AWK" { swap; replace "##" "AWK Programming language"; }
     "SED" { swap; replace "##" "Text Stream Editor"; }
     "GREP" { swap; replace "##" "Search Text Files: g/regex/p"; }
     "GROFF" { swap; replace "##" "Old unix typesetting system"; }
     "EQN" { swap; replace "##" "Old unix formula typesetting system"; }
     "LISP" { swap; replace "##" "List Processing Language"; }
     "LATEX" { 
       # a silly attempt at the LATEX logo
       replace "A" "<sup><small>A</small></sup>";
       replace "E" "<sub>E</sub>";
       swap; replace "##" "The LaTeX text processing system"; 
     }
     "FORTH" { swap; replace "##" "The Incomparable Forth 'Language'"; }
     "UNIX" { swap; replace "##" "The Unix Operating System"; }
     "LINUX" { swap; replace "##" "The successor to minix"; }
     "MINIX" { swap; replace "##" "The Minix Minimal Unix Operating System"; }
     "VIM" { swap; replace "##" "Vi Improved Text Editor"; }
     "JAVA" { swap; replace "##" "Java Programming Language"; }
     "C++" { swap; replace "##" "Object-Oriented C"; }
     "LUA" { swap; replace "##" "An embeddable script language"; }
     "WREN" { swap; replace "##" "R. Nystrom's language"; }
     "PYTHON" { swap; replace "##" "A strangely popular indent language"; }
     "LOGO" { swap; replace "##" "The turtle drawing language"; }
     "GO" { swap; replace "##" "Google's C Language Replacement"; }
     "RUST" { swap; replace "##" "The Rust System Language (c-ish)"; }
     "DART" { swap; replace "##" "Google's Application Language"; }
     "TCL" { swap; replace "##" "Tool Control Language"; }
     "ANTLR" { swap; replace "##" "Another Tool for Language Recognition"; }
     "GCC" { swap; replace "##" "The Gnu C Compiler"; }
     "TCC" { swap; replace "##" "Bellard's Tiny C Compiler"; }
     "UTF8" { swap; replace "##" "Unicode Text Format 8"; }
     "UTF16" { swap; replace "##" "Unicode Text Format 16"; }
     "UNICODE" { swap; replace "##" "The Universal Language Code"; }
     "EOF" { swap; replace "##" "End-Of-File (input-stream)"; }
     "BASH" { swap; replace "##" "Unix [B]ourne [A]gain [Sh]ell"; }
     "MARKDOWN" { swap; replace "##" "non-distracting text documents"; }

     put; clear; add "text*"; push; .reparse 
   }

   # urls, we need to add html formatting later because of the
   # "text" http://dada.org syntax There are a lot of "fake" schemas 
   # here for convenience.
   B"rosetta:",B"urbandict:",B"oed:",B"wp:",B"nom:",B"nomsyn:",
   B"nomsf:",B"pep:",B"http://",B"https://",B"nntp://",B"file://",B"www." {
     !"rosetta:".!"urbandict:".!"oed:".!"wp:".!"nom:".
     !"nomsyn:".!"nomsf:".!"pep:".!"http://".!"https://".
     !"nntp://".!"file://".!"www." {
       B"file://" {
         replace "file://" ""; put; 
       }
       # make the fake schema wp:// or wp: wikipedia links after wp:// should
       # just be the wikipedia page name

       # better to parse this in the E"url*".!"url*" block so that 
       # we can make a nice visible link text for the wikipedia page.
       # ie. do the same as the nom:// fake url
       B"wp:" {
         clop; clop; clop; B"//" { clop; clop; }
         # I dont like writing underscores
         replace "." "_"; put; clear;
         add "https://en.wikipedia.org/wiki/"; get; put; 
       }

       # schema for oed eg oed:// with search
       # oxford english dictionary
       B"oed:" {
         # allow trailing dot or comma
         E".",E"," { clip; } 
         !B"oed://" { replace "oed:" "oed://"; }
         put;
       }

       #add "https://www.oed.com/search/dictionary/?scope=Entries&q=";

       # schema for the urban dictionary, just because it can be fun,
       # and anyway, nom is a language thing, and we like language.
       # this should be parse in quoted*url* etc
       B"urbandict:" {
         # allow trailing dot or comma
         E".",E"," { clip; } 
         !B"urbandict://" { replace "urbandict:" "urbandict://"; }
         put;  
       }

       # rosettacode.org problems
       B"rosetta:" {
         # allow trailing dot or comma
         E".",E"," { clip; } 
         !B"rosetta://" { replace "rosetta:" "rosetta://"; }
         replace "." "_"; put;  
       }

       # add "https://www.urbandictionary.com/define.php?term=";
       # this is just a convenience so I dont have to type out the url
       # to the pep/nom sourceforge site everytime
       B"nomsf:",B"nomsf://" {
         E".",E"," { clip; }  # allow trailing ./,
         replace "nomsf:" ""; B"//" { clop; clop; }
         put; clear; add "https://bumble.sf.net/books/pars/";
         get; put;
       }

       # convenience schema, this time for nom language commands 
       # eg: push pop get put
       B"nom:" {
         # allow trailing dot or comma
         E".",E"," { clip; } 
         # add the url later, much easier.
         !B"nom://" { replace "nom:" "nom://"; }
         put;  
       }
       # another convenience schema, nom syntax documentation
       # eg: blocks, tests, parselabel 
       B"nomsyn:" {
         # add the url later, much easier.
         !B"nomsyn://" { replace "nomsyn:" "nomsyn://"; }
         put;  
       }

       # pep virtual machine structure eg: stack, tape, peep 
       B"pep:" {
         E".",E"," { clip; }  # allow trailing ./,
         !B"pep://" { replace "pep:" "pep://"; }
         put; clear; 
       }

       # add a schema to www. urls
       B"www." { clear; add "http://"; get; put; }
       clear; add "url*"; push; .reparse
     }
   }

   # a fake uri schema syntax eg google:"pratt parsers"
   # --> https://www.google.com/search?q=distance+colombia+to+tasmania
   # this is separate to the code above because it has to read ahead
   # in the input stream
   B"google:",B"google://" {
      replace "google://" "";
      replace "google:" ""; 
      # read until next " or newline
      B'"' {
        clop; whilenot [\n"]; 
        #replace ">" "&gt;"; replace "<" "&lt;";
        replace " " "+"; put; clear;
        add "https://www.google.com/search?q="; get;
        put; clear;
        !(eof) { read; [\n] { zero; nochars; } }
        clear;
        add "url*"; push; .reparse
      }
   }

   # local files with no schema, imagefile tokens have already been parsed
   E".h",E".c",E".a",E".txt",E".doc",E".py",E".rb",E".rs",E".java",E".class",
   E".tcl",E".tk",E".sw",E".js",E".go",E".pp",E".pss",E".cpp",E".pl",
   E".html",E".pdf",E".tex",E".sh",E".css",E".out",E".log",
   E".png",E".jpg",E".jpeg",E".bmp",
   E".mp3",E".wav",E".aux",
   E".tar",E".gz",E"/" {
     # not very elegant all this. maybe an ee test would be good 
     # (begins with but not equal to) or change the delim to . and push
     !".h",!".c",!".a",!".txt",!".doc",!".py",!".rb",!".rs",!".java",!".class",
     !".tcl",!".tk",!".sw",!".js",!".go",!".pp",!".pss",!".cpp",!".pl",
     !".html",!".pdf",!".tex",!".sh",!".css",!".out",!".log",
     !".png",!".jpg",!".jpeg",!".bmp",
     !".mp3",!".wav",!".aux",
     !".tar",!".gz",!"/" {
       !B"http://".!B"https://".!B"nntp://".!B"file://".!B"www." {
         clear; add "file*"; push; .reparse
       }
     }
   }
   # quoted text between "and and", maximum one line 
   B'"'.!'"'.!E'"' {
     clop; whilenot [\n"]; 
     replace ">" "&gt;"; replace "<" "&lt;";
     put; clear;
     # The code below is not great, but is required because we
     # dont have "until 'ab','cd','ef'" syntax. ie multiple end delimiters
     # all this is to prevent multiline quotes (which could eat up the 
     # whole document.
     !(eof) { read; [\n] { zero; nochars; } }
     clear;
     add "quoted*"; push; .reparse
   }

   # single quoted word, multiline quotes (blockquotes) may begin with
   # """
   B'"'.!'"'.!'""'.!'"""'.E'"' {
     clip; clop; 
     replace ">" "&gt;"; replace "<" "&lt;";
     put; clear;
     add "quoted*"; push; .reparse
   }

   # single bold emphasised word eg: **strong**
   B'**'.!'**'.!'****'.!"***".E'**' {
     clip; clip; clop; clop;
     replace ">" "&gt;"; replace "<" "&lt;";
     put; clear;
     add "<strong><em>"; get; add "</em></strong>\n"; put; clear;
     add "text*"; push; .reparse
   }

   # bold emphasised text between **double asterixes**
   # single line maximum, multiple words
   B"**" {
     clop; clop; whilenot [\n*]; 
     # find the next * if its there. This is clumsy code because we
     # cant say "until '**','\n';" which would be better
     # actually this code accepts ** text* with only one terminating 
     # asterix, but its not important. It's a text format...

     replace ">" "&gt;"; replace "<" "&lt;";
     put; clear;
     add "<strong><em>"; get; add "</em></strong>\n"; put; clear;
     # If there is some emphasised text immediately on the next line
     # this will not be good, but we aren't flying an aeroplane.
     while [*]; clear;
     add "text*"; push; .reparse
   }

   # emphasised italic text between *two asterixes*
   # single line maximum, multiple words
   B"*".!"*".!E"*" {
     clop; whilenot [\n*]; 
     replace ">" "&gt;"; replace "<" "&lt;";
     put; clear;
     add "<em>"; get; add "</em>"; put; clear;
     # could i just use "while [*];" here?
     !(eof) { read; [\n] { zero; nochars; } }
     clear;
     add "text*"; push; .reparse
   }

   # single emphasised word, no special grammar token needed.
   B'*'.!'*'.!'**'.E'*' {
     clip; clop; 
     replace ">" "&gt;"; replace "<" "&lt;";
     put; clear;
     add "<em>"; get; add "</em>"; put; clear;
     add "text*"; push; .reparse
   }
   clear; add "word*"; push;

 }

 !"" {
   clear;
   # just delete weird characters, we don't need them.
   # but probably should investigate further
   #*
   add "! An unexpected character '"; get; add "'";
   add "  in text input was encountered at \n";
   add "  line "; lines; add " char "; chars; add "\n";
   add "  Check the 'lexical parsing' phase of the script \n";
   add "    pars/eg/text.tohtml.pss ";
   add "  This is the section of the script above the parse> label \n";
   print; quit;
   *#
 }

parse>

 # for debugging, add % as a latex comment.
   #add "<!-- line "; lines; add " char "; chars; add ": "; print; clear; 
   #unstack; print; stack; add " -->\n"; print; clear;

 # -----------------
 # 2 tokens parse reductions
 pop; pop;

 # a list at the end of the document, with no blankline to
 # terminate it.
 (eof) {
   "item*text*" {
     push; push; add "</li>\n</ul>\n"; put; 
     clear; add "endlist*"; push; .reparse
   }
 }

 # starline*codeline* or starline*codeblock* is significant
 "starline*space*" {
    # dont really need this space
    clear; get; ++; get; --; put; clear;
    add "starline*"; push; .reparse
 }

 # This is another use for starline, as the "citation" or author
 # for a multiline quote ("""..."""). This has to go above here because
 # starlines are about to disappear
 E"blockquote*".!"blockquote*" {
   B"starline*" { 
     clear; 
     add "<blockquote class='quotation'>\n"; ++; get; --; 
     add "<cite>"; get; add "</cite>\n";
     add "</blockquote>\n"; 
     put; clear; add "text*"; push; .reparse
   }
   # blockquote with no citation, treat the unknown 1st token as 
   # text.
   clear; get;
   add "\n<blockquote class='quotation'>\n"; ++; get; --; 
   add "</blockquote>\n";
   put; clear; add "text*"; push; .reparse
 }

 # a caption followed by some code
 B"starline*".!"starline*" {
   E"codeline*",E"codeblock*" { 
     clear; 
     add "<figure class='code-with-caption'>\n";
     add "<figcaption class='code-caption'>\n";
     get; add "</figcaption>\n"; ++; get; --; add "</figure>";
     put; clear; add "text*"; push; .reparse
   }
   # format star-lines, then reduce to text (token no longer needed)
   replace "starline*" "text*"; push; push; 
   # state;
   --; --; add "<em class='starline'>\n"; get; add "\n</em>\n"; put; clear;
   # dont need to transfer attribute
   ++; ++; .reparse
 }

 # format and reduce image files, but the <img> tag has already
 # been built above so we can just add the figure and caption if 
 # required
 E"imagefile*".!"imagefile*" {
   # "link text" http://abc syntax
   B"quoted*" { 
     # the problem here is that <figure> needs to set the width 
     # and alignment not the image, but the <img> tag has already
     # been built. Maybe we can just accept that captioned images
     # are not going to look much good.
     clear; ++; get; 
     # check if image is floating left or right
     # a hack to get around no "contains" test eg C"float-right"
     replace "float-right" "";   
     !(==) {
       clear; add "\n<figure class='float-right'>\n  ";
     }
     (==) {
       clear; add "\n<figure class='float-left'>\n  ";
     }
     get; --;
     add "\n  <figcaption class='image-caption'>\n  "; get;  
     add "\n  </figcaption>"; 
     add "\n</figure>\n";
     put; clear; add "text*"; push; .reparse
   }
   # image with no caption 
   clear; get; ++; get; --; add "\n";
   put; clear; add "text*"; push; .reparse
 }

 # format and reduce urls
 E"url*" {
   # "link text" http://abc syntax
   B"quoted*" { 
     clear; ++; get; --;
     # deal with nom schema, nom: has been normalised to nom://
     # nom command filenames in format nom.<command>.txt

     B"oed://" {
       replace "oed://" ""; ++; put; clear;
       add "<a href='https://www.oed.com/search/dictionary/?scope=Entries&q=";
       get; --; add "' title='oxford english dictionary'>"; get; add "</a>";
       put; clear; add "text*"; push; .reparse
     }

     B"urbandict://" {
       replace "urbandict://" ""; ++; put; clear;
       add "<a href='https://www.urbandictionary.com/define.php?term=";
       get; --; add "' title='urban dictionary'>"; get; add "</a>";
       put; clear; add "text*"; push; .reparse
     }

     # the rosettacode.org problems 
     # eg https://rosettacode.org/wiki/Balanced_brackets
     B"rosetta://" {
       replace "rosetta://" ""; ++; put; clear;
       add "<a href='https://rosettacode.org/wiki/";
       get; --; add "' title='rosetta-code problem'>"; get; add "</a>";
       put; clear; add "text*"; push; .reparse
     }

     B"nom://" {
       replace "nom://" ""; ++; put; clear;
       #todo put a title='nom stack command' here
       add "<a href='http://nomlang.org/doc/commands/nom."; 
       get; --; add ".html'>"; get; add "</a>";
       put; clear; add "text*"; push; .reparse
     }

     B"nomsyn://" {
       replace "nomsyn://" ""; ++; put; clear;
       add "<a href='http://nomlang.org/doc/syntax/nom.syntax."; 
       get; --; add ".html'>"; get; add "</a>";
       put; clear; add "text*"; push; .reparse
     }

     # pep machine filenames in format pep.<part>.txt
     B"pep://" {
       replace "pep://" ""; ++; put; clear;
       add "<a href='http://nomlang.org/doc/machine/pep."; 
       get; --; add ".html'>"; get; add "</a>";
       put; clear; add "text*"; push; .reparse
     }
     # other "quote" url:// formats
     clear; 
     add "<a href='"; ++; get; --; add "'>"; get; add "</a>";
     put; clear; add "text*"; push; .reparse
   }
   # plain url link, add html link to text
   clear; ++; get; --;

   # make a nice link to the OED
   B"oed://" {
     replace "oed://" ""; 
     ++; put; --; clear;
     get; add "<a href='https://www.oed.com/search/dictionary/?scope=Entries&q=";
     ++; get; add "' title='oxford english dictionary'>"; get; add "</a>"; 
     --; put; clear; add "text*"; push; .reparse
   }

   # what about multiple words
   B"urbandict://" {
     replace "urbandict://" ""; 
     ++; put; --; clear;
     get; add "<a href='https://www.urbandictionary.com/define.php?term=";
     ++; get; add "' title='urban dictionary'>"; get; add "</a>"; 
     --; put; clear; add "text*"; push; .reparse
   }

   B"rosetta://" {
     replace "rosetta://" ""; 
     ++; put; --; clear;
     get; add "<a href='https://rosettacode.org/wiki/";
     ++; get; add "' title='rosetta-code problem'>"; get; add "</a>"; 
     --; put; clear; add "text*"; push; .reparse
   }

   B"nom://" {
     # is nom://-- valid? yes and also nom://minusminus 
     # mark this up as <code> because it is.
     replace "nom://" ""; 
     ++; put; --; clear; get; 
     add "<code class='nom-command'>";
     add "<a href='http://nomlang.org/doc/commands/nom."; 
     ++; get; add ".html'>"; 
     # allow nom://++ and nom://a+ syntax etc
     replace "++.html" "plusplus.html";
     replace "--.html" "minusminus.html";
     replace "a+.html" "aplus.html";
     replace "a-.html" "aminus.html";
     get; add "</a></code>"; 
     # allow nom://plusplus syntax etc (make visible link correct)
     replace "html'>plusplus" "html'>++";
     replace "html'>minusminus" "html'>--";
     replace "html'>aplus" "html'>a+";
     replace "html'>aminus" "html'>a-";
     replace "html'>reparse" "html'>.reparse";
     replace "html'>restart" "html'>.restart";
     --; put; clear; add "text*"; push; .reparse
   }

   B"nomsyn://" {
     replace "nomsyn://" ""; 
     ++; put; --; clear;
     get; add "<a href='http://nomlang.org/doc/syntax/nom.syntax."; 
     ++; get; add ".html'>"; 
     # allow nomsyn://reparse> syntax etc, but > has already been
     # made into &gt; for html.
     replace "parse&gt;.html" "parselabel.html";
     replace "class.html" "classes.html";
     get; add "</a>"; 
     # allow nom://parselabel syntax etc (make visible link correct)
     replace "html'>parselabel" "html'>parse&gt;";
     --; put; clear; add "text*"; push; .reparse
   }

   B"pep://" {
     replace "pep://" ""; ++; put; --; clear;
     get; add "<a href='http://nomlang.org/doc/machine/pep."; 
     ++; get; add ".html'>"; get; add "</a>"; --;
     put; clear; add "text*"; push; .reparse
   }

   clear; 
   get; add "<a href='"; ++; get; add "'>";
   swap;
   # remove the https:// etc from the visible link because
   # they look ugly in the text.
   replace "https" ""; replace "http" ""; replace "nntp" "";
   replace "://" "";
   swap; get; --; add "</a>";
   put; clear; add "text*"; push; .reparse
 }

 # "text" file.txt syntax to be linked
 "quoted*file*" {
    clear; 
    add "<a href='"; ++; get; --; add "'>"; get; add "</a>";
    put; clear; add "text*"; push; .reparse
 }

 # reduce file* grammar tokens separately so we can html format them
 B"file*".!"file*" {
   replace "file*" "text*"; push; push; 
   --; --; add "<code>"; get; add "</code>"; put; ++; ++; clear;
   .reparse
 }

 # quoted*url* or quoted*file* is significant
 "quoted*space*" {
    clear; 
    get; ++; get; --; put; clear;
    add "quoted*"; push; .reparse
 }

 # reduce "quoted" separately so we can add some html curly quotes
 # the !"quoted*" clause is supposed to ensure 2 tokens (this should only
 # really be a problem if the "quoted" is the first word of the document)
 B"quoted*".!"quoted*" {
   !E"url*".!E"file*" {
     clear;
     # The quoted attribute may have a space (or many?) at the end
     # so need to put it after the curly quotes
     # add some html curly quotes and get saved space
     add "&ldquo;"; get; add "&rdquo;"; 
     # remove the space just before the last quote (which is added
     # during "space*" reductions.
     replace " &rdquo;" "&rdquo;"; 
     # add a space to separate from next word.
     add " ";
     ++; get; --; put; clear;
     add "text*"; push; .reparse
   }
 }
 
 # tokens to reduce to text
 # codeline, codeblock, word, text, space, quoted,
 #B"word*",B"text*",B"space*",B"quoted*",B"codeline*",B"codeblock*" {
 B"word*",B"text*",B"space*",B"codeline*",B"codeblock*" {
   # need to conserve quoted at end
   E"word*",E"text*",E"space*",E"codeline*",E"codeblock*" {
     # check that there really are 2 tokens (not one)
     push; !"" {
       pop;
       clear; get; ++; get; --; put; clear;
       add "text*"; push; .reparse
     }
     pop;
   }
 }

 # -------------------
 # 3 token token reductions
 pop;

 # this is crazy reverse reduction
 "item*text*endlist*" {
   clear; get; ++; get; ++; get; --; --; put;
   clear; add "endlist*"; push; .reparse
 }

 # have reduced the whole list (in reverse) so just make text 
 # there could be 1,2, or 3 tokens here. need to get the endlist attribute
 # and add a start <ul> tag
 E"endlist*".!B"item*text*" {
    # more succinct way to get the last token when there is a variable
    # number of tokens. A clever trick.
    push; !"" { push; !"" { push; } } pop;
    clear; add "<!---- list ----->\n";
    add "<ul>"; get; replace "<ul></li>" "<ul>\n"; put; 
    clear; add "text*"; push; .reparse 
 }

 # sometimes we might get text*text*something* (eg in lists)
 B"text*text*".!"text*text*" {
   replace "text*text*" "text*"; push; push;
   --; --; get; ++; get; --; put; clear;
   # transfer unknown token attribute
   ++; ++; get; --; put; clear;
   # realign tape pointer
   ++; .reparse
 }

 push; push; push;

 (eof) {
   #*
   add "<!-- final stack: "; print; clear;
   unstack; add " -->\n"; print; clear;
   stack;
   add "<!-- html rendered by Nom script (www.nomlang.org): -->\n";
   add "<!--   bumble.sf.net/books/pars/eg/text.tohtml.pss -->\n";
   add "<!--   pep -f eg/text.tohtml.pss file.txt -->\n";
   add "<!-- see eg/text.tohtml.format.txt for text format -->\n";
   *#
   print; clear;
   # print the rendered html
   pop; clear; get; print; quit;
   # The html header and footer are made in pars/www/blog.sh
 }