#* mark.latex.simple.pss This version does not currently recognise lists. This script attempts to parse a "markdown"-like syntax (this script header is an example of the document format) and to transform it into LaTeX source code. The script runs on on the "pep" parsing machine which is implemented at http://bumble.sf.net/books/pars/ . There are other ways, which may seem better or more straight-forward of achieving this. The ANTLR parsing system can be used to right grammars that can parse markdown like structures, or just regular expressions can be used. But this is a good exercise for the parse machine, and more complex structures can be recognised than with plain regular expressions. Required tokens are, at least: >>* ---* codeline* codeblock* link* file* quoted* emline* uutext* uuword* nl* word* text* [[* (for images) We dont actually need heading* and subheading* tokens because they get transpiled (into LaTeX) as soon as they are seen in the document. Also, dont need link/file/quoted/star STATUS Producing nice readable output with pdflatex. Would like to include date lists with descriptions. eg 24 july 2022 did something 26 jul 2022 more stuff The script is now generating compilable tex output with: >> pep -f eg/mark.latex.pss pars-book.txt > test.tex >> pdflatex test.tex; pdflatex test.tex This produces a pdf file "test.pdf" and the formatting includes lists. A work in progress. The document parses as text, but not all structures are recognised. This file contains an interesting way of resolving or removing parse tokens when they are no longer significant- it uses a negative approach such as... ---- # codeblocks with no caption (description) !"codeblock*".E"codeblock*".!B"emline*" { clear; get; add " "; ++; get; --; put; clear; add "text*"; push; .reparse } ,,,, Another interesting idea: use the accumulator as a state marker when parsing in codeblocks*. So if acc>0 then dont create uuword or other tokens. TOKEN LIST * tokens currently used by this script >> --- >> 4dots codeblock codeline emline nl text uutext uuword word For lists need: - dash, bl, list eg: o/- -> olist, olist, text, nl, dash -> list list, text, bl -> But why have bl???? just used nl nl because nl/nl is reduced immediately. star* has been eliminated by parsing immediately, and >> and --- could also be eliminated. Probably need to add bl*=blankline dash* and ulist/olist/dlist for unordered lists, ordered lists and definition lists Useful grammar analysis. * get a unique list of tokens used during parsing >> pep -f eg/mark.latex.simple.pss pars-book.txt | sed '/%% ---/q;' | sed 's/^[^:]*: *//;s/\* *$//' | tr '*' '\n' | sort | uniq BUGS Strange pep/nom segmentation fault with multiline comment. NOTES Add images, datelists, inline code? Convert this grammar to generate html/markdown etc Need to tidy up description lists. Nested lists may just work- out of the box! But a different list terminator (not blankline) would be handy. What about inline code? This script needs to parse *any* text successfully! Even text that is not in any particular format. May need to add "quoted" to handle quoted text, but not really necessary at the moment. Using o/- O/- u/- U/- d/- D/- to start ordered/unordered/description lists! DOCUMENT FORMAT This is a document format I have used in many code and booklet files. It is a kind of markdown, with even less markup than markdown. It would be useful to try to transform it to markdown. A Description of the format - ALL UPPER CASE LINE Is a heading or UPPER CASE WITH "QUOTES" D/- ALL UPPER CASE LINE WITH FOUR DOTS .... is a subheading - asterixes before and after a word eg *this* emphasises the word. - >> a line starting with 2 > is a code line - a block surrounded by --- and ,,,, on their own lines is a code block - * a line starting with a star describes the code line or block which follows - urls and filenames should get a special format. d/- d/- : makes a description list - term : end with a colon - machine or with a newline - end the whole list with a blank line o/- or O/- or 0/- makes and ordered - list - end with a blank line - u/- makes an unordered list. Key names are rendered as keys. eg [return] is rendered as a keyname. urls are turned into links. Filenames are made into fixed-pitch font. LATEX Need to escape all these chars in latex >> & % $ # _ { } ~ ^ \ The first 7 add a backslash eg \& The last 3 do \textasciitilde, \textasciicircum, and \textbackslash \begin{verbatim}...\end{verbatim} Look at books/format-book/booktocgi.sed for lots of latex tricks and tips HISTORY 9 July 2022 ordered and unordered and description lists appear to work well. Made emphasis lines (starting with star) appear on a line by themselves. This means they can be used for as a simple list. Also, introduced a bl* blankline token which will be used with lists (to terminate a list). Made the parser more permissive. 8 July 2022 Removed the star* token and parsed immediately in the word block. But didnt remove >>* because nl* tokens cause problems. 7 july 2022 A lot of progress. Had the idea to use the accumulator to count words on a line and so be able to tell what is the 1st word. This could allow to displense with a number of tokens, and simplify the grammar. Output now seems acceptable apart from lists not working. Revisiting this to try to make a nice pep/nom book. Cant use a table in a figure environment. This version mark.latex.simple.pss is actually better than mark.latex.pss 17 June 2021 Adding emphasis words such as *this* . Need to do lists with - * and maybe dates, images with floating and resizing. Better code listings 14 June 2021 Looked at how to escape chars. Done. We also escape chars that are in star lines and maybe codelines, but the \verbatim and \lstlisting words seem to take care of special characters in those blocks. I should try ----- ">>*star*" --> "starline*" "starline*word*" --> "starline*" "starline*uuword*" --> "starline*" ,,,, This is in contrast to the technique of immediately reading to the end of the line with "until" or "whilenot" 3 june more work. proceeding well. 2 june 2021 Much progress, a basic latex doc has been formed. Now to refine and introduce new tokens. Have done sections, subsections, and codeblocks. Found an elegant way to eliminate insignificant tokens using a "negative" logic. Continuing to write this. The unstack;print;stack trick is very useful. Parsing and printing text with just upper case headings. This seems a promising incremental way to parse 1 June 2021 Try the unstack trick at the parse label to debug token reductions. I fixed the "stack" command in object/machine.interp.c so that it updates the tape pointer. 23 April 2021 Continuing to work on the script and look for ways to simply debug it. I have come to the idea that ".reparse" should also print the stack if a particular switch is set. This would allow the stack reductions to be easily followed. 21 April 2021 Script begun. This is a second attempt. vvvvvvv