#*

###  mark.latex.pss

ABOUT 

  Parses a plain-text document format and produces [latex] output. 
  This is an older version. The quality of the [latex] output may not
  be very good.

  The script file:///eg/text.tohtml.pss may be a simpler and better 
  *plain-text* formatter (although it doesnt produce latex)

  This script attempts to parse a "markdown"-like syntax (this script header is
  an example of the document format) and to transform it into LaTeX source
  code. The script runs on on the "pep" parsing machine which is implemented at
  http://bumble.sf.net/books/pars/ . 

  There are other ways, which may seem better or more straight-forward of
  achieving this. The [antlr] parsing system can be used to write grammars that
  can parse markdown like structures, or just regular expressions can be used.
  But this is a good exercise for the pep/nom machine, and more complex
  structures can be recognised than with plain regular expressions.

STATUS

  Producing nice readable output with pdflatex. Includes most syntax
  except faq's and datelist

  Would like to include date lists with descriptions.
  eg 
  24 july 2022
    did something
  26 jul 2022
    more stuff

  The script is now generating compilable tex output with:
  >> pep -f eg/mark.latex.pss pars-book.txt > test.tex
  >> pdflatex test.tex; pdflatex test.tex
 
  This produces a pdf file "test.pdf" and the formatting includes
  lists.

  A work in progress. The document parses as text, most structures are
  recognised (not images). This file contains an interesting way of resolving
  or removing parse tokens when they are no longer significant- it uses a
  negative approach such as...

  ----
    # codeblocks with no caption (description) 
    !"codeblock*".E"codeblock*".!B"emline*" {
      clear; get; add " "; ++; get; --; put; clear;
      add "text*"; push; .reparse
    }
  ,,,,

TOKEN LIST

  * tokens currently used by this script
  >> --- >> 4dots codeblock codeline emline nl text uutext uuword word
   bl, ulist, olist, dlist, dash, 
  
    [[* (for images) ??

  We dont actually need heading* and subheading* tokens because they get
  transpiled (into LaTeX) as soon as they are seen in the document.
  Also, dont need link/file/quoted/star, although I could allow them
  to exist for a brief moment.

  For lists need: - dash, bl, list
  eg:  o/- -> olist, 
  olist, text, nl, dash -> list

  star* has been eliminated by parsing immediately, and >> and ---
  could also be eliminated. Probably need to add bl*=blankline
  dash* and ulist/olist/dlist for unordered lists, ordered lists
  and definition lists

  Useful grammar analysis.

  * get a unique list of tokens used during parsing
  >> pep -f eg/mark.latex.simple.pss pars-book.txt | sed '/%% ---/q;' | sed 's/^[^:]*: *//;s/\* *$//' | tr '*' '\n' | sort | uniq

TODO

  make images work. make datelists work. 
  adapt this script to translate to markdown. translate to 
  nroff man format. translate to html.

BUGS

  Strange pep/nom segmentation fault with multiline comment.
  Special chars like \n in a d/- definition list will halt pdflatex.
  Need to escape them.

NOTES

 Using a technique to make image-related tokens and then make them
 disappear by changing tokens to "word*"

 Lists might have a title- caption.

 Could use d/- for description lists where definition occurs on the 
 same line, and D/- for lists where definition starts on next line.

 Use pdfpages to create a bindable booklet on A4, without sticking pages
 together; this works but the font is currently too small, and the margins too
 big.

 A signature is how many pages (not sheets) go into a "folio". Each folio
 gets sown through its centre onto bookspine. Signatures must be *4.

  see also psnup
  >> www.ctan.org/pkg/pdfpages

  * create a bindable landscape pdf with a4 paper using pdfpages
  --------
    %%Signatures with pdfpages
    \documentclass[a4paper]{article}
    \usepackage{pdfpages}
    \begin{document}
    \includepdf[pages=-,signature=8,landscape]{book.pdf}
    \end{document}
  ,,,,

  0/- pages=- use all pages
    - signature, number of logical pages per signature, divide by 4
    - for physical A4 sheets. This seems to work well.

  Need to escape | in code lines because it terminates the 
  \verbatim environment.

  This script is probably a great deal more complex than some 
  equivalent regular expression type renderer (for a format such 
  as markdown). And when it goes wrong, it has to be carefully
  debugged, thinking about how the rules interact with each other.
  Also, normally you have to watch the token stack as it reduces
  in order to find out what is going wrong.

  But apart from these problems it has great advantages: Once the 
  grammar is robust and permissive, it can be easily modified to 
  output different formats such as html or markdown.
  Also, it can be translated into scripting and compilable languages
  using the pep/nom scripts in the tr/ folder: languages such
  as "go","java","c","ruby","python" and maybe "tcl".

  Add images, datelists, inline code?
  Convert this grammar to generate html/markdown etc

  Need to tidy up description lists.

  Nested lists may just work- out of the box! But a different
  list terminator (not blankline) would be handy.

  What about inline code?
  This script needs to parse *any* text successfully! Even text
  that is not in any particular format.

  May need to add "quoted" to handle quoted text, but not really
  necessary at the moment.
  
  Using o/- O/- u/- U/- d/- D/- to start ordered/unordered/description
  lists!

IMAGE FORMAT
  
  * examples of the image
  ---
    [[ f.png >> 80% "caption" ]]
    [[ f.png >> 
      80% 
    ]]
  ,,,

DOCUMENT FORMAT
  
  This is a document format I have used in many code and 
  booklet files. It is a kind of markdown, with even less markup than
  markdown. It would be useful to try to transform it to markdown.
  
 * A Description of the format

D/- UPPER CASE LINE
      is formatted as a top level heading. It can include "quotes"
      but no other characters apart from [A-Z]
  - UPPER CASE LINE ....
      is a subheading if it ends with 4 dots
  - asterixes
      before and after a word eg *this* emphasises the word.
  - ">>" 
      starting a line is a "code" line 
  - --- and ,,,,
      starting their lines and surrounding code, is a code block
  - star
      a line starting with a star "*"
      describes the code line or block which follows
  - urls and filenames should get a special format.

  d/- d/- : makes a description list (which is good for a 
      type of glossary 
    - term : the description term can end with a colon ":"
      or with a newline
    - machine
        or with a newline
    - end the whole list with a blank line
    - it may be possible to nest lists, I havent tried
      but probably not, until I include a different terminating
      token (not a blank line)
    
  o/- or O/- or 0/- makes an ordered list
    - each item gets a number automatically. 
    - This type of list also ends with a blank line
    - *u/-* or *U/-* makes an unordered or "bullet" list
    - empty dashes in the middle of the list may crash it
      at the moment, who knows.


  Key names are rendered as keys.
  eg [return] is rendered as a keyname.
  urls are turned into links. Filenames are made into fixed-pitch
  font.

LATEX 

  Need to escape all these chars in latex
  >> & % $ # _ { } ~ ^ \
  The first 7 add a backslash eg \&
  The last 3 do
    \textasciitilde, \textasciicircum, and \textbackslash

  \begin{verbatim}...\end{verbatim}

  Look at books/format-book/booktocgi.sed for lots of latex tricks and tips

HISTORY 

   25 aug 2022
     added images. added quotes, width and positioning. Parsing seems to
     work well. Wrote mark.format.txt to document this format and 
     test the "mark" scripts.

   9 July 2022
     ordered and unordered and description lists appear to work well.
     Made emphasis lines (starting with star) appear on a line by 
     themselves. This means they can be used for as a simple list.
     Also, introduced a bl* blankline token which will be used with
     lists (to terminate a list). Made the parser more permissive.
   8 July 2022
     Removed the star* token and parsed immediately in the word block.
     But didnt remove >>* because nl* tokens cause problems.
   7 july 2022
     A lot of progress. Had the idea to use the accumulator to 
     count words on a line and so be able to tell what is the 1st word.
     This could allow to displense with a number of tokens, and simplify
     the grammar.

     Output now seems acceptable apart from lists not working.
     Revisiting this to try to make a nice pep/nom book. Cant use 
     a table in a figure environment. This version mark.latex.simple.pss
     is actually better than mark.latex.pss

   17 June 2021
     Adding emphasis words such as *this* . Need to do lists with - * and
     maybe dates, images with floating and resizing. Better code listings

   14 June 2021
     Looked at how to escape chars. Done. We also escape chars that 
     are in star lines and maybe codelines, but the \verbatim and 
     \lstlisting words seem to take care of special characters in
     those blocks.

    I should try
      -----
        ">>*star*" --> "starline*"
        "starline*word*" --> "starline*"
        "starline*uuword*" --> "starline*"
      ,,,,

    This is in contrast to the technique of immediately reading to the 
    end of the line with "until" or "whilenot"

   3 june
    more work. proceeding well.

   2 june 2021
    Much progress, a basic latex doc has been formed. Now to refine
    and introduce new tokens. Have done sections, subsections, and 
    codeblocks. Found an elegant way to eliminate insignificant tokens
    using a "negative" logic.

    Continuing to write this. The unstack;print;stack trick is 
    very useful. Parsing and printing text with just upper case headings.
    This seems a promising incremental way to parse

   1 June 2021
    Try the unstack trick at the parse label to debug token
    reductions. I fixed the "stack" command in object/machine.interp.c
    so that it updates the tape pointer.

  23 April 2021
    Continuing to work on the script and look for ways to 
    simply debug it. I have come to the idea that ".reparse"
    should also print the stack if a particular switch is set.
    This would allow the stack reductions to be easily followed.

  21 April 2021
    Script begun. This is a second attempt.

vvvvvvv
*#
  
  begin {
    # create a dummy newline so that doc structures work even
    # on the first line of the file/stream.
    add "nl*"; push;
  }

  read;

  ![:space:] {
    # count words per line with the accumulator
    a+;
    whilenot [:space:]; put;
    
    # image structure delimiters
    "[[","]]" { put; add "*"; push; .reparse }
    # an image position indicators, default is centre?
    ">>>","<<<","ccc" { put; clear; add "float*"; push; .reparse }
    # quotes for image captions. I will use """ to delimit 
    # image captions, but not multiline. Because a run-away multiline
    # quote will eat up the whole document. Or use word parsing here

    B'"""' { 
      put; clop; clop; clop;
      E'"""'.!'"""' { clear; add "quote*"; push; .reparse } 
      clear; get;
      whilenot ["\n]; 
      !(eof) { read; } !(eof) { read; } !(eof) { read; }
      !E'"""' { put; clear; add "word*"; push; .reparse } 
      put; clear; add "quote*"; push; .reparse
    }

    # widths for images in format eg 20%
    E"%".!"%".[0123456789%] {
      put; clear; add "width*"; push; .reparse
    }
    # image width in point format
    E"pt".!"pt".[0123456789pt] {
      put; clear; add "width*"; push; .reparse
    }
    E"cm".!"cm".[0123456789cm] {
      put; clear; add "width*"; push; .reparse
    }
    E"mm".!"mm".[0123456789mm] {
      put; clear; add "width*"; push; .reparse
    }
    E"em".!"em".[0123456789em] {
      put; clear; add "width*"; push; .reparse
    }

    # create an image file token for images. 
    #E".png",E".jpg",E".jpeg",E".bmp",E".gif" {

    # these are the formats that pdflatex can handle
    E".png",E".jpg",E".jpeg",E".eps",E".pdf" {
      clear; add "imfile*"; push; .reparse 
    }

    # date and datelist tokens
    [0-9] {
      put; clear; add "number*"; push; .reparse
    }
    # case insensitive month names
    lower;
    "jan","january","feb","febuary","mar","march" {
      put; clear; add "month*"; push; .reparse
    }
    clear; add "word*"; push; .reparse 
  }

  # keep leading space in newline token?
  [\n] { 
    # set accumulator == 0 so that we can count words 
    # per line (and know which is the first word)
    zero; nochars;
    while [ ]; put; clear; add "nl*"; 
    push; .reparse
  }
  [\r\t ] { clear; !(eof){.restart} }

parse>

  # for debugging, add % as a latex comment.
   add "%%> line "; lines; add " char "; chars; add ": "; print; clear; 
   unstack; print; stack; add "\n"; print; clear;

  # -------------
  # 1 token
  pop;

  "nl*" { nop; }

  # here we classify words into other tokens
  # we can use accumulator with a+ a- to determine if current
  # word is the first word of the line, or even count number of 
  # words per line. This should simplify grammar items such as
  # nl/---  and nl/star/ etc
  # another advantage, is that we can dispense with tokens such as 
  # ---, >> etc and not have to get rid of them later.
  "word*" {
    clear; get; 
    # no numbers in headings!
    [A-Z] { clear; add "uuword*"; push; .reparse }

    # at least three --- on a newline marks a code block start
    # use 'count;' here to simplify. The token --- probably doesnt
    # need to exist.
    B"---".[-] { clear; add "---*"; push; .reparse }

    ">>" { add "*"; push; .reparse }
    # >> is use for code lines and for image position indicators 
    # only make a dash token if it is first word on the line
    #*
     no this wont work, it may result in an infinite loop
     because ">>" := float  float := word  word := float
    ">>" { 
      put; clear; count;
      # >> on a newline marks a code line start
      "1" { clear; add ">>*"; push; .reparse }
      # if not first word then is image position 
      clear; add "float*"; push; .reparse
      #clear; get;
    }
    *#

    # subheading marker
    B"....".[.] { clear; add "4dots*"; push; .reparse }

    # dash is used for lists 
    # only make a dash token if it is first word on the line
    "-" { 
      clear; count;
      "1" { clear; add "dash*"; push; .reparse }
      clear; get;
    }

    # ordered list start token 
    # only make token if it is first word on the line
    "o/-","O/-","0/-" { 
      clear; count;
      "1" { clear; put; add "olist*"; push; .reparse }
      clear; get;
    }

    # unordered list start token 
    "u/-","U/-" { 
      clear; count;
      "1" { clear; put; add "ulist*"; push; .reparse }
      clear; get;
    }

    # definition/description list start token 
    # need to parse a bit differently because of the desc
    "d/-","D/-" { 
      clear; count;
      "1" { 
        clear;
        # read description here, but have to escape special
        # verb cant go in here. Special chars will crash this. 
        add "\n \\item["; whilenot [\n:]; add "]"; put;
        clear; add "dlist*"; push; .reparse
      }
      clear; get;
    }

    # star on newline marks emphasis, list or code description 
    # probably dont need star token.
    "*" { 
      # check that * is 1st 'word' on line using accumulator
      clear; count; 
      !"1" { clear; add "*"; }
      "1" {
        clear; while [ \t\f]; clear;
        whilenot [\n]; cap; put; clear;
        # this is a trick, because we want special LaTeX chars to
        # be escaped. So, will add \\emph{} after next replace code. 
        add "::EMPH::"; get; put;
        #add "emline*"; push; .reparse
      }
    }

    # need to escape % # } \ and others
    # & % $ # _ { } ~ ^ \   
    # \textasciitilde, \textasciicircum, and \textbackslash
    replace '\\' "\\textbackslash ";
    replace "&" "\\&";
    replace "%" "\\%";
    replace "$" "\\$";
    replace "#" "\\#";
    replace "_" "\\_";
    replace "{" "\\{";
    replace "}" "\\}";
    replace "~" "\\textasciitilde";
    replace "^" "\\textasciicircum";
    replace ">" "\\textgreater ";
    replace "<" "\\textless ";
    replace "LaTeX" "\\LaTeX{}";

    # now make the emphasis line token, after special chars have 
    # been escaped.
    B"::EMPH::" { 
      replace "::EMPH::" " \\emph{"; add "}";
      put; clear;
      add "emline*"; push; .reparse
    }

    # If a previous test has matched, then the workspace should
    # be clear, and so none of the following will match.

    # graphical key representations
    B"[".E"]" {
      replace "[esc]" "\\Esc";
      replace "[enter]" "\\Enter";
      replace "[return]" "\\Enter";
      replace "[insert]" "\\Ins";
      replace "[shift]" "\\Shift";
      replace "[delete]" "\\Del";
      replace "[home]" "\\Home";
      # keys defined by 'keystroke' package, can make new ones.
      # \Enter \Del  \Ins    \Esc   \Shift  \Ctrl  \Home
      # \End   \PgUp \PgDown \PrtSc \Scroll \Break
    }

    #replace '\\n' "\\textbackslash n";
    #replace '\\f' "\\textbackslash f";
    #replace '\\r' "\\textbackslash r";
    #replace '\\t' "\\textbackslash t";

    put;     
    
    # urls, not so important for LaTex (and pdf) output.
    # Dont really need a token because we can render immediately
    # Could maybe render them as footnotes
    B"http://",B"https://",B"www.",B"ftp://",B"sftp://" { 
      # clear; add "url*"; push; .reparse
      # render as fixed pitch font
      clear; add "\\url{"; get; add "}"; put; clear; 
    }

    # format acronyms as a small capital font, case insensitive
    lower;
    "antlr","pdf","json","ebnf","bnf","dns","html" {
      clear; add "\\textsc{\\textbf{"; get; add "}}"; put; clear;
    }
    # restore the mixed-case version of the input word
    !"" { clear; get; }

    # filenames, could be elided with quoted filenames
    "parse>","print","pop","push","get","put",".reparse",".restart", "add",
    "sed","awk","grep","pep","nom","less","stdin","stdout","bash",
    "lex","yacc","flex","bison","lalr","gnu",
    E".h",E".c",E".a",E".txt",E".doc",E".py",E".rb",E".rs",E".java",E".class",
    E".tcl",E".tk",E".sw",E".js",E".go",E".pp",E".pss",E".cpp",E".pl",
    E".html",E".pdf",E".tex",E".sh",E".css",E".out",E".log",
    E".png",E".jpg",E".jpeg",E".bmp",
    E".mp3",E".wav",E"aux",
    E".tar",E".gz",E"/" {
      clear; add "\\texttt{"; get; add "}"; put; clear;
    }

    # mark up language names
    "python","java","ruby","perl","tcl","rust","swift","markdown",
    "c","c++" {
      clear; add "\\textit{\\texttt{"; get; add "}}"; put; clear;
    }

    # paths and directories ? 
    B"../".!"../" {
      clear; add "\\texttt{"; get; add "}"; put; clear;
    }

    B'"'.E'"'.!'""'.!'"' {
      # filenames in quotes
      clip; clop; put;
      # quoted uppercase words in headings
      [A-Z] {
        # add LaTeX curly quotes to the heading word
        clear; add "``"; get; add "''"; put; clear;
        add "uuword*"; push; .reparse 
      }

      # markup language names
      "python","java","ruby","perl","tcl","rust","swift","markdown",
      "c","c++","forth" {
        clear; add "``\\textit{\\texttt{"; get; add "}}''"; put; clear;
      }

      # markup filenames and some unix and pep/nom names as fixed-pitch
      # font. 
      "pep",
      "parse>","print","pop","push","get","put",".reparse",".restart", "add",
      "sed","awk","grep","pep","nom","less","stdin","stdout","bash",
      E".h",E".c",E".a",E".txt",E".doc",E".py",E".rb",E".rs",E".java",E".class",
      E".tcl",E".tk",E".sw",E".js",E".go",E".pp",E".pss",E".cpp",E".pl",
      E".html",E".pdf",E".tex",E".sh",E".css",E".out",E".log",
      E".png",E".jpg",E".jpeg",E".bmp",
      E".mp3",E".wav",E"aux",
      E".tar",E".gz",E";"
        { clear; add "``\\texttt{"; get; add "}''"; put; clear; }
      # everything else in quotes (but only words without spaces!)
      !"" { clear; add "``\\textit{"; get; add "}''"; put; clear; }
    }

    # filenames 
    # crude pattern checking.
    B"/".!"/" {
      clip; E"." { clear; add "\\texttt{"; get; add "}"; put; clear; }
      clip; E"." { clear; add "\\texttt{"; get; add "}"; put; clear; }
      clip; E"." { clear; add "\\texttt{"; get; add "}"; put; clear; }
    }

    # emphasis is *likethis* (only words, not phrases) 
    B"*".E"*".!"**" {
      clip; clop; put; clear; 
      add "\\textbf{\\emph{"; get; add "}}"; put; clear;
    }

    # && starting a line marks the document title 

    # the document 'title' after && or first heading, & has already 
    # been escaped
    "\\&\\&" { 
      clear; count; 
      "1" {
        clear; while [ \t\f]; clear;
        whilenot [\n]; put; clear;
        add "\\centerline{\\Large \\bf "; get;
        add "} \\medskip \n"; put; clear;
      }
    }

   # A quote, starting the line
   "quote:" { 
      clear; count; 
      "1" {
        # \begin{center}
        #    {\huge \`\`}\textit{$quote}{\huge ''}
        #    \textsc{$quoteauthor}
        #  \end{center}
        clear; while [ \t\f]; clear;
        whilenot [\n]; put; clear;
        add "\\begin{center}{\\huge ``}\\textit{"; get;
        add "}{\\huge ''}\\end{center} \n"; put; clear;
      }
    }

    clear; add "word*";
  }

  pop;
  # -------------
  # 2 tokens

  #----------------
  # dates for datelists
  # dates begin on a newline and each date begins a list item.

  # vanish numbers if not first on line or preceded by month*
  E"number*".!"number*".!B"nl*".!B"bl*".!B"month*" {
    replace "number*" "word*"; push; push; .reparse
  }
  B"number*".!"number*".!E"nl*".!E"bl*".!E"month*" {
    replace "number*" "word*"; push; push; .reparse
  }

  # vanish months if not between day/year or first on line
  # this should allow eg "aug 2022" and "30 aug 2022"
  E"month*".!"month*".!B"nl*".!B"bl*".!B"number*" {
    replace "month*" "word*"; push; push; .reparse
  }
  B"month*".!"month*".!E"number*" {
    replace "month*" "word*"; push; push; .reparse
  }

  #--------------------
  # images 
  # standard format is [[*imfile*quote*width*float*]]*
  # A width is "50%" or "200pt"; float is left/right/center 
  # imfile is a image file name. quote/width/float are optional
  # tokens. The order of tokens is mandatory

  # remove newline and blank line tokens when parsing
  # images. But this is tricky, because we want to preserve
  # them otherwise.

  # remove nl/bl tokens in image formats  
  "[[*nl*","[[*bl*","imfile*nl*","imfile*bl*",
  "quote*nl*","quote*bl*","width*nl*","width*bl*",
  "float*nl*","float*bl*" { 
    push; clear; .reparse
  }
  
  # vanish [[ if not followed by imfile
  B"[[*".!"[[*".!E"imfile*" {
    replace "[[*" "word*"; push; push; .reparse
  }

  # vanish ]] 
  E"]]*".!"]]*".
  !B"imfile*".!B"float*".!B"quote*".!B"width*" {
    replace "]]*" "word*"; push; push; .reparse
  }

  # vanish imfiles 
  B"imfile*".!"imfile*".!E"float*".!E"quote*".!E"width*".!E"]]*" {
    replace "imfile*" "word*"; push; push; .reparse
  }
  E"imfile*".!B"[[*" {
    replace "imfile*" "word*"; push; push; .reparse
  }

  # vanish quotes
  B"quote*".!"quote*".!E"float*".!E"width*".!E"]]*" {
    replace "quote*" "word*"; push; push; .reparse
  }
  E"quote*".!"quote*".!B"imfile*" {
    replace "quote*" "word*"; push; push; .reparse
  }

  # vanish widths
  B"width*".!"width*".!E"float*".!E"]]*" {
    replace "width*" "word*"; push; push; .reparse
  }
  E"width*".!"width*".!B"quote*".!B"imfile*" {
    replace "width*" "word*"; push; push; .reparse
  }

  # vanish floats
  B"float*".!"float*".!E"]]*" {
    replace "float*" "word*"; push; push; .reparse
  }
  E"float*".!"float*".!B"width*".!B"quote*".!B"imfile*" {
    replace "float*" "word*"; push; push; .reparse
  }

  # Add missing attributes here. This is a technique for 
  # providing "optionality" in pep/nom scripts
  "width*]]*" {
    clear; add "width*float*]]*"; 
    push; push; push; 
    # also add an appropriate attribute for a center float
    --; --; get; ++; put; 
    clear; --; put; ++; ++;
    .reparse
  }
  "quote*]]*","quote*float*" {
    replace "quote*" "quote*width*"; push; push; push;
    # now transfer the attributes and add null quote
    --; --; get; ++; put; 
    # or add an appropriate width
    clear; --; put; ++; ++;
    .reparse
  }
  "imfile*]]*","imfile*width*","imfile*float*" {
    replace "imfile*" "imfile*quote*";
    push; push; push; # ws should be clear
    # now transfer the attributes and add null quote
    --; --; get; ++; put; 
    # or put a null quote here.
    clear; --; put; ++; ++;
    .reparse
  }

  # End image token manipulation

  # ellide text
  "text*text*","word*text*",
  "word*word*","text*word*",
  "word*uuword*","text*uuword*","uutext*word*","uuword*word*" {
    clear; get; add " "; ++; get; --; put; clear;
    add "text*"; push; .reparse
  }

  # tokenlist:
  # --- >> 4dots codeblock codeline emline nl text uutext uuword word
  # codeblock,
  # remove pesky newline tokens, 4dots handled elsewhere
  # not really working
  #*
  "nl*text*","nl*word*","nl*emline*","nl*codeline*",
  "nl*codeblock*" {
    # delete nl token
    clop; clop; clop; push; clear;
    # ignore newline
    get; --; put; clear;
    .reparse
  }
  *#

  "nl*text*","nl*word*", "bl*text*","bl*word*" {
    clear; get; ++; get; --; put; clear;
    add "text*"; push; .reparse
  }

  "nl*dash*" {
    clear; get; ++; get; --; put; clear;
    add "dash*"; push; .reparse
  }

  "nl*emline*","bl*emline*" {
    clear; ++; get; --; put; clear;
    add "emline*"; push; .reparse
  }

  # We are using a dummy nl* token at the start of the doc, so the 
  # codeblock* codeline* etc tokens are not able to be the first token
  # of the document. So we can remove the !"codeblock*". clause.

  # multiline codeblocks with no caption 
  E"codeblock*".!B"emline*" {
    clear; get; 
    add "\n\n \\begin{tabular}{l}\n  ";
    ++; get; --; 
    add " \\end{tabular} \n";
    put; clear;
    add "text*"; push; .reparse
  }

  # single line code with no caption 
  E"codeline*".!B"emline*" {
    clear; get; 
    add "\n\n \\begin{tabular}{l}\n  ";
    ++; get; --; 
    add " \\end{tabular} \n";
    put; clear;
    add "text*"; push; .reparse
  }

  # eliminate emline* tokens (not followed by codeblock/line)
  # the logic is slightly different because emline* is significant before
  # other tokens, not after.
  # also, consider emline*text*nl*
  B"emline*".!E"nl*".!E"codeline*".!E"codeblock*" {
    replace "emline*" "text*"; push; push; 
    # make emline display on its own line, even when not
    # followed by codeline/codeblock. LaTeX will treat a blank line 
    # as a paragraph break, but \newline or \\ could be used.
    --; --; add "\n\n"; get; add "\n\n"; put; clear;
    .reparse
  }

  # remove insignificant 4dots* tokens, 
  # 4 dots (....) marks a subheading and always comes at the end of 
  # all capitals line. Just replacing the 4dots token with a text
  # token is safer and more logical.
  E"4dots*".!B"uutext*".!B"uuword*" {
    replace "4dots*" "text*"; push; push; .reparse
  }

  # remove insignificant ---* tokens
  E"---*".!B"nl*".!B"bl*" {
    clear; get; add " "; ++; get; --; put; clear;
    add "text*"; push; .reparse
  }

  # remove insignificant >>* tokens
  # lets assume that codelines cant start a document? Or lets
  # generate a dummy nl* token at the start of the document to 
  # make parsing easier.
  # !">>*".E">>*".!B"nl*" {
  E">>*".!B"nl*".!B"bl*" {
    clear; get; add " "; ++; get; --; put; clear;
    add "text*"; push; .reparse
  }

  # ellide upper case text 
  "uuword*uuword*","uutext*uuword*" {
    clear; get; add " "; ++; get; --; put; 
    clear; add "uutext*"; push; .reparse
  }

  # a blank line token for terminating lists etc 
  # bl/bl should not happen really
  "nl*nl*","bl*nl*","bl*bl*" {
    clear; get; ++; get; --; put; clear;
    add "bl*"; push; .reparse
  }

  # code line (starts with >>) 
  "bl*>>*","nl*>>*" { 
    # ignore leading space.
    clear; while [ \t\f]; clear;
    # escape | so it doesnt terminate the verb environment.
    # but how to do it? or use lstlisting
    whilenot [\n]; put; clear;
    add " \\verb| "; get;
    add " |\n"; put; clear;
    add "codeline*"; push; .reparse
  }

  # code block marker 
  "bl*---*","nl*---*" { 
    clear; until ",,,"; clip; clip; clip;
    # remove excessive indentation.
    replace "\n   " "\n";
    put; while [,]; clear;
    add "\n \\begin{lstlisting}[breaklines]"; get;
    add "\n \\end{lstlisting} \n"; put; clear;
    add "codeblock*"; push; .reparse
  }

  # a code block with its preceding description
  "emline*codeblock*" {
    clear; 
    add "\n\n \\begin{tabular}{l}\n  ";
    get; add " \\\\ "; ++; get; --; 
    add " \\end{tabular} \n";
    put; clear;
    add "text*"; push; .reparse
  }

  # a code line with its preceding description
  # add some tabular LaTeX markup here.
  "emline*codeline*" {
    clear; 
    add "\n\n \\begin{tabular}{l}\n  ";
    get; add " \\\\ \n";
    ++; get; --; 
    add " \\end{tabular} \n";
    #add " \\end{figure}";
    put; clear;
    add "text*"; push; .reparse
  }

  # probably indicates an empty - at the end of a list
  # add a dummy text token
  "olist*bl*","ulist*bl*","dlist*bl*" {
    push; clear; add "empty"; put; 
    clear; add "\n\n"; ++; put; --;
    clear; add "text*bl*"; push; push; .reparse
  }

  # or use this to terminate the list, and so allow nested lists
  "olist*dash*","ulist*dash*","dlist*dash*" {
    push; clear; add "empty"; put; 
    clear; add "text*dash*"; push; push; .reparse
  }

  pop;
  # -------------
  # 3 tokens
  "olist*word*dash*","ulist*word*dash*","dlist*word*dash*",
  "olist*word*bl*","ulist*word*bl*","dlist*word*bl*" {
    replace "word*" "text*"; 
    # or dont reparse
    # push; push; push; .reparse
  }

  # eliminate dashes that are not part of a list
  # eg: ulist*dash* olist*text*dash* dlist*word*dash*
  # the logic is tricky, how do we know there are really 3 tokens 
  # here, and not 2. This is the problem with negative tests.
  # doesnt matter because not altering attributes here.
  E"dash*" {
    !B"ulist*text*".!B"olist*text*".!B"dlist*text*" {
      replace "dash*" "text*"; push; push; push; .reparse
    }
  }

  "olist*text*dash*" {
    clear;
    get; add "\n \\item "; ++; get; --; put; clear;
    add "olist*"; push; .reparse
  }

  # could be ellided, but for readability, no
  "ulist*text*dash*" {
    clear;
    get; add "\n \\item "; ++; get; --; put; clear;
    add "ulist*"; push; .reparse
  }

  # 
  "dlist*text*dash*" {
    clear;
    # already have \item start
    get; add " "; ++; get; --; 
    # also, put a \verbatim in [] because text is not escaped??
    add "\n \\item["; whilenot [\n:]; add "] ";
    put; clear;
    add "dlist*"; push; .reparse
  }

  # finish off the ordered list, also could finish it off with 
  # ulist*dash* ??
  "olist*text*bl*" {
    clear; 
    add "\n \\begin{enumerate}\n"; get;
    add "\n \\item "; ++; get; --; 
    add "\n \\end{enumerate}\n\n"; 
    put; clear;
    # insert the blankline attribute
    add "\n\n"; ++; put; --; clear;
    add "text*bl*"; push; push; .reparse
  }

  # finish off the unordered list
  "ulist*text*bl*" {
    clear; 
    add "\n \\begin{itemize}\n"; get;
    add "\n \\item "; ++; get; --; 
    add "\n \\end{itemize}\n\n"; 
    put; clear; 
    # insert the blankline attribute
    add "\n\n"; ++; put; --; clear;
    add "text*bl*"; push; push; .reparse
  }

  # finish off the description list
  "dlist*text*bl*" {
    # or check here if it is D/- or d/- for nextline style
    # or use \hfill \\ on each item which also works
    clear; 
    add "\n \\begin{description}[style=nextline]\n"; get;
    add "\n "; ++; get; --; 
    add "\n \\end{description}\n\n"; 
    put; clear; 
    # insert the blankline attribute
    add "\n\n"; ++; put; --; clear;
    add "text*bl*"; push; push; .reparse
  }

  # top level headings, all upper case on the line in the source document.
  # dont need a "heading" token because we dont parse the document as a 
  # heirarchy, we just render things as we find them in the stream.
  "nl*uutext*nl*","nl*uuword*nl*",
  "bl*uutext*nl*","bl*uuword*nl*" {
    clear; 
    # Check that heading is at least 4 chars
    ++; get; --; clip; clip; clip; 
    "" { 
      add "nl*text*nl*"; push; push; push; .reparse
    }
    clear;
    # make headings capital case
    ++; get; 
    # capitalise even 1st word in latex curly quotes
    # add "<<heading\n"; print; replace "<<heading\n" "";
    B"``" { clop; clop; }
    cap; put; replace "''" "";
    # add open curly quotes if there before.
    !(==) {
      clear; add "``"; get;
    }
    put; --; clear; 
    get; # newline
    add '\\section{'; ++; get; --; add "}"; put; 
    clear;
    # transfer nl value
    ++; ++; get; --; put; clear; --;
    add "text*nl*"; push; push; .reparse
  }

  # simple reductions 
  "nl*text*nl*","nl*word*nl*", "bl*text*nl*","bl*word*nl*",
  "text*text*nl*","emline*text*nl*" {
    clear; get; ++; get; --; put; clear;
    ++; ++; get; --; put; --; clear; # transfer newline value
    add "text*nl*"; push; push; .reparse
  }

  pop;
  # -------------
  # 4 tokens

  # sub headings, 
  "nl*uutext*4dots*nl*","nl*uuword*4dots*nl*",
  "bl*uutext*4dots*nl*","bl*uuword*4dots*nl*" {
    clear; 

    # Check that sub heading text is at least 4 chars ?
    # yes but need to transfer 4dots and nl
    # ++; get; --; clip; clip; clip; 
    # "" { add "nl*text*nl*"; push; push; push; .reparse }

    clear;
    # make subheadings capital case
    ++; get; 
    # capitalise even 1st word in latex curly quotes
    B"``" { clop; clop; }
    cap; put; replace "''" "";
    # add open curly quotes if there before.
    !(==) {
      clear; add "``"; get;
    }
    put; --; clear; 
    get; # newline
    add '\\subsection{'; ++; get; --; add "}"; put; clear;
    # transfer nl value, really? just add "\\n" no? 
    ++; ++; ++; get; --; --; put; clear; --;
    add "text*nl*"; push; push; .reparse
  }

  pop;

  #------------------
  # 5 tokens

  # resolve dates
  "bl*number*month*number*nl*", 
  "nl*number*month*number*nl*" {
    clear; ++; get;
    # make sure 1st number is a valid day number
    "0","00","000","0000" { 
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    clip; clip;
    # >2 digits, not day number
    !"" {
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    clear; get;
    !B"0".!B"1".!B"2".!"30".!"31" {
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    # is valid day number (01-31 or 1-31)
    clear; 
    # now check the year number
    ++; ++; get; --; --; 
    clip; clip; 
    # less than 3 digits not allowed for year
    B"0","" {
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    clip; clip;
    # >4 digits, not a year
    !"" {
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    # now assemble date value
    get; ++; get; ++; get; --; --; --; put;
    clear; add "date*"; push; .reparse
  }

  pop;
  # -------------
  # 6 tokens

  # all images have been standardised to this format (all 
  # optional tokens have been added but may be empty).
  # for latex, image formats are jpeg,png,or pdf. Others need
  # to be converted. Image names cant have dots in them???
  "[[*imfile*quote*width*float*]]*" {
    # need to translate widths floats etc to latex here because
    # they may revert to word* if out of context.
    clear; 
    # latex can't handle dots in file names so need 
    # to put braces around the name part eg: {name.is}.png
    # this is a hack. Or try the grffile package
    add "{"; ++; get;
    replace ".png" "}.png"; 
    replace ".jpg" "}.jpg"; 
    replace ".jpeg" "}.jpeg"; 
    replace ".pdf" "}.pdf"; 
    replace ".eps" "}.eps"; 
    put; clear;

    # get quote, if any, and remove """
    ++; get; clip; clip; clip; clop; clop; clop; 
    put; clear;
    # get width attribute
    ++; get; 
    # turn percentage into decimal
    E"%","" { 
      clip;  
      !"".!"100" { put; clear; add "0."; get; }
      "" { add "0.60"; } # default width 60%
      "100" { clear; add "1.00"; }
      add "\\textwidth";
    }
    put; clear; 
    # translate floats into LaTeX, default is centre
    ++; get; "" { add "c"; }
    # unknown positioning spec
    !"ccc".!"<<<".!">>>" { clear; add "c"; }
    "ccc" { clear; add "c"; } # centre
    "<<<" { clear; add "l"; } # left
    ">>>" { clear; add "r"; } # right
    put; clear;
    add "\n\\begin{wrapfigure}{"; 
    # position attribute
    get; add "}{"; 
    # width attribute
    --; get; add "}\n";
    add "\\includegraphics[width="; 
    # width attribute again
    get;
    --; --; add "]{"; 
    # image file name
    get; --; add "}"; 
    add "\n\\centering\n";
    # the * removes the "figure" prefix
    add "\\caption*{";
    # get the quote attribute
    ++; ++; get; --; --; add "}\n\\end{wrapfigure}";
    put; clear;
    add "text*"; push; .reparse
  }
 
  # example: 75% page width
  #  \includegraphics[width=0.75\textwidth]{image/test.jpg}

  push; push; push; push; push; push;

  (eof) {
    # or use 'unstack' but does it adjust the tape pointer?
    pop; pop; pop; pop; pop; pop;

    # "nl*word*","nl*text*" have already been dealt with.

    # we would like "permissive" parsing, because this is just
    # a document format, not code, so will just check for starting
    # text token
    #"text*nl*","text*bl","text*" {

    B"text*",B"word*" {
      # show the token parse stack at the top of the document
      ++; put; clear; 
      add "%% Document parse-stack is: "; get; add "\n"; --;
      clear; 
      # make a valid LaTeX document
      add "
  %% <start-document>
  %% -------------------------------------------
  %%  latex generated by: mark.latex.pss 
  %%   from source file : 
  %%                  on: 
  %% -------------------------------------------

  \\documentclass[a4paper,12pt]{article}
  \\usepackage[margin=4pt,noheadfoot]{geometry}
  \\usepackage{color}                   %% to use colours, use 'xcolor' for more
  \\usepackage{multicol}                %% for multiple columns
  \\usepackage{keystroke}               %% for keyboard key images
  \\usepackage[toc]{multitoc}           %% for multi column table of contents
  \\usepackage{tocloft}                 %% to customize the table of contents
  \\setcounter{tocdepth}{2}             %% only display 2 levels in the contents
  \\setlength{\\cftbeforesecskip}{0cm}   %% make the toc more compact
  \\usepackage{listings}                %% for nice code listings
  \\usepackage{caption}                 %% 
  \\lstset{
    captionpos=t,
    language=bash,
    basicstyle=\\ttfamily,           %% fixed pitch font
    xleftmargin=0pt,                %% margin on the left outside the frames
    framexleftmargin=0pt,
    framexrightmargin=0pt,
    framexbottommargin=5pt,
    framextopmargin=5pt,
    breaklines=true,                %% break long code lines
    breakatwhitespace=false,        %% break long code lines anywhere
    breakindent=10pt,               %% reduce the indent from 20pt to 10
    postbreak=\\mbox{{\\color{blue}\\small$\\Rightarrow$\\space}},  %% mark with arrow
    showstringspaces=false,            %% dont show spaces within strings
    framerule=2pt,                     %% thickness of the frames
    frame=top,frame=bottom,
    rulecolor=\\color{lightgrey}, 
    % frame=l
    % define special comment delimiters '##(' and ')'
    % moredelim=[s][\\color{grey}\\itshape\\footnotesize\\ttfamily]{~(}{)},
  }   %% source code settings
  \\usepackage{graphicx}                %% to include images
  \\usepackage{fancybox}                %% boxes with rounded corners
  \\usepackage{wrapfig}                 %% flow text around tables, images
  \\usepackage{tabularx}                %% change width of tables
  \\usepackage[table]{xcolor}           %% alternate row colour tables
  \\usepackage{booktabs}                %% for heavier rules in tables
  \\usepackage[small,compact]{titlesec} %% sections more compact, less space
  \\usepackage{enumitem}                %% more compact and better lists
  \\setlist{noitemsep}                  %% reduce list item spacing
  \\usepackage{hyperref}     %% make urls into hyperlinks
  \\hypersetup{              %% add pdftex if only pdf output is required
     colorlinks=false,       %% set up the colours for the hyperlinks
     linkcolor=black,        %% internal document links black
     urlcolor=black,        %% url links black
     frenchlinks=true,
     bookmarks=true, pdfpagemode=UseOutlines}

  \\geometry{ left=1.0in,right=1.0in,top=1.0in,bottom=1.0in }
  %% define some colours to use
  \\definecolor{lightgrey}{gray}{0.70}
  \\definecolor{grey}{gray}{0.30}

  %% titlesec: create framed section headings
  %% \\titleformat{\\section}[frame]{\\normalfont}
  %%   {\\filleft \\footnotesize \\enspace Section \\thesection\\enspace\\enspace}
  %%   {3pt} {\\bfseries\\itshape\\filright}

  \\title{The Pep/nom parsing language and machine}
  \\author{m.j.bishop}
  \\date{\\today}
  \\setlength{\\parindent}{0pt}
  %% \\setlength{\\parskip}{1ex}

  %% label lists with stars
  \\renewcommand{\\labelitemi}{$\\star$}

  \\parindent=0pt
  \\parskip=6pt
  \\begin{document}

  ";

      get; 
      add "\n\\end{document} \n";
      add "\n\n %% Document parsed as text*!\n"; 
      # show parse-stack at end of doc as well
      ++; add " %% Document parse-stack is: "; get; add "\n"; --;
      print; quit;
    }

    stack; 
    add "Document parsed unusually!\n";
    add "Stack at line "; lines; add " char "; chars; add ": "; print; clear; 
    unstack; print; stack; add "\n"; print; clear;
    quit;

  }