Skip to content

OCX extension for in-page sections

It's sometimes useful to divide the content presented on a single page into semantically meaningful sections, and to be able to describe these sections individually. HTML 5 gives us tags for the semantically meaningful sections, e.g. <chapter>, <article>, <header>, <main>, <footer>, <section>, <nav> (for navigation) and the general purpose <div>. These can all be given an id attribute so that they can be indentified in metadata using a fragment identifier. For example <chapter id="c1"> can be identified with the fragment URI #c1.

In this extension the content to be put into a section between is delimited by the codes ~~X~~ and ~~\X~~, where X is one of C,A,H,M,F,S,N,D, being the initial of the type of sectioning element into which to place the content. Text after that letter is used as the identifier for the new element, so ~~C c1~~ becomes <chapter id="c1">

Python markdown treeprocessors

The rearranging of content in the document tree entailed by taking the content between some textual delimiters and moving it into a new element in document tree generated. Simply changing ~~S~~[...]~~\S~~ into <section>[...]</section> before processing would mean that whatever is in the [...] would not get processed from markdown into HTML. The rearrangement can be done using a treeprocessor that runs after the markdown has been processed into HTML. At this stage of processing the document is in a python ElementTree and the ~~X~~ markers from the marjdown have been processed into html as <p>~~S~~</p>. So the aim is to change the element tree

    |--p: ~~C lesson1~~
    |--p: ...
    |--p: ~~H~~
    |--h1 ...
    |--p: ...
    |--p: ~~/H~~
    |--p: ~~S~~
    |--p: ...
    |--P: ...
    |--p: ~~/S~~
    |--p: ~~/C~~
  |--chapter{'id': 'lesson1'}
      |--p: ...
      |--header
          |--h1: ...
          |--p: ...
      |--section{'id': 'activity1'}
          |--p: ...
          |--p: ...

Note that the sections introduced by the markdown tags may be nested, and remember that there will be sectioning elements in the html before, so you might find those p: ~~S~~ nodes in some branch coming off the root.

Python markdown treeprocessors extend the class Treeprocessor and have a run method that takes the root of the element tree to be processed. To register and define a new treeprocessor in a named extension:

class OCXMetadata(Extension):
    def extendMarkdown(self, md):
        md.registerExtension(self)
        md.treeprocessors.register(OCXTreeProcessor(md), 'ocxsection', 29)

class OCXTreeProcessor(Treeprocessor):
    def run(self, root):
      #do stuff
      return

def makeExtension(**kwargs):
    # allows calling of extension by string which is not dot-noted
    return OCXMetadata(**kwargs)
from setuptools import setup
setup(
    name='ocxmd',
    version='0.1',
    py_modules=['ocxmd'],
    install_requires=['markdown>=3.0','PyYAML>=3.13'],
    author="Phil Barker",
    author_email="phil.barker@pjjk.co.uk",
    url="https://github.com/philbarker/ocxmd",
    description="A python markdown extension to insert semantic sectioning elements into the HTML created by MkDocs.",
    license="Apache2"
)

Parsing the element tree

The strategy is to rebuild the elment tree by running through all the nodes, and recursively through the children of those nodes, replacing any p elements that indicate the start of a sectioning element (e.g. ~~S~~) with a new section into which subsequent nodes are moved until an element indicating the end of of a section.

Two parameters are sent to the function that does this shuffling - node : the element in oringinal eTree whose children are processed; - new_ancestors : a stack of ancestors in eTree being created. The most recent ancestor, i.e. the top of the stack, is where nodes will go when they are processed. The node being processed won't be in the eTree created if it marks the beginning of end of a section.

A list of nodes in the eTree at the start is made, new nodes are added to the end of the eTree as they are processed, and the orginal node removed. Fortunately the list made at the beginning of the iteration is immutable so this removing and adding of nodes to the eTree while processing it does stop when all the nodes that were there at the start have been processed.

Each node that marks the beginning of a section is deleted and a new section node added to the node in the eTree that is top of the ancestor stack, this new section node is set as the top of the ancestor stack.

Each node that marks the end of a section is deleted and the most recent ancestor is popped from the top of the ancestor stack.

Each node that does not mark the beginning or end of a section is added to the node that is top of the ancestor stack and passed to the function so that it's children can be processed.

Once all the children of a node have been processed the most recent ancestor is popped off the top of the ancestors stack so that on return from recursion we have the appropriate ancestor

Other code deals with determining what type of sectioning element to create, and what should be the identifier

The id attributes generated are going to be used as fragment identifiers in URIs, so there had better not be any charcters in there that are no URI friendly. I took a very conservative line on this, just allowed charcters from the set A-Za-z0-9!$-()+ (yes, I know this is bad for internationalization, but it'll do for now). I checked for any character not in this set during the tree processing of the start tag, and removed it.

class OCXTreeProcessor(Treeprocessor):
  START_SECTION_RE = re.compile('~~([SCHFNDA])([^~]*)~~')
  END_SECTION_RE = re.compile('~~/([SCHFNDA])~~')
  BAD_URI_FRAG_CHARS = '[^A-Za-z0-9!$-()+]' #really stingy in what's allowed

  def run(self, root):
      ancestors = [root]
      self.section(root, ancestors)
      self.md.tree_diagram = ''

      for child in list(node):
      # the list is immutable, so we run through the *original* nodes in the
      # eTree removing them when done and adding the new processed nodes to
      # the end
          if child.text :
              start_match = self.START_RE.match(child.text)
          else :
              start_match = False
          if start_match:
          # we have a node that indicates the start of a section
              # there is nothing to keep in such a node
              node.remove(child)
              # determine which sectioning elmt
              if 'S' == start_match.group(1):
                  newsect_type = 'section'
              elif 'C' == start_match.group(1):
                  newsect_type = 'chapter'
              elif 'A' == start_match.group(1):
                  newsect_type = 'article'
              elif 'H' == start_match.group(1):
                  newsect_type = 'header'
              elif 'F' == start_match.group(1):
                  newsect_type = 'footer'
              elif 'N' == start_match.group(1):
                  newsect_type = 'nav'
              elif 'D' == start_match.group(1):
                  newsect_type = 'div'
              else :
                  newsect_type = 'div'
              # find id attribute of new section, if any
              if start_match.group(2):
                  # make sure id has no bad characters in it
                  i = re.sub(self.BAD_URI_FRAG_CHARS,'',start_match.group(2))
                  attr = {"id": i}
              else:
                  attr = {}
              # create new section
              newsect = ET.SubElement(new_ancestors[-1], newsect_type, attr)
              # this new section will be the new parent until we get to end marker
              new_ancestors.append(newsect)
          elif (child.text and self.END_RE.match(child.text.upper())):
              # we have reached an end of section marker
              # nothing to keep in such a node
              node.remove(child)
              # revert to using previous new_ancestor as new parent
              new_ancestors.pop()
          else :
              node.remove(child) # remove from original place in tree
              new_ancestors[-1].append(child) # append to the latest new ancestor
              new_ancestors.append(child)
              self.section(child, new_ancestors) # recurse through nodes children
      new_ancestors.pop()

Making sure the element tree is clean

As written above, the sectioning tag ~~X~~ is assumed to be the only content in paragraph element. In order for this to be the case the tag needs to be entered on a line by itself with a blank line or other block tag above or below. This has great potential to cause errors, e.g.

 ~~S~~
 # Heading

is OK, but

 ~~S~~
 Just some text

will be missed.

We can fix this with a preprocessor that looks through all the lines, copying them to a new list of lines to be returned, but when it gets to a line that is the start or end of a sectioning tag (i) checks that the previous line is a blank line, and adds one if not, and (ii) sets a flag to check that the next line is a blank line and adds one if not. At the same time we can also make sure that all the section tags are upper case.

class OCXSectionPreprocessor(Preprocessor):
    """Clean up the input, checking for start and end tags that don't have a
       blank line before or after, and for use of lower case letter in tags"""
    START_RE = re.compile(START_SECTION, re.IGNORECASE)
    END_RE = re.compile(END_SECTION, re.IGNORECASE)

    def run(self, lines):
        after_tag = False # used to indicate we are looking at line after tag
        new_lines = []
        for line in lines:
            if after_tag : # we at a line after a start or end tag
                if '' != line: # make sure it is followed by a blank
                    new_lines.append('')
                after_tag = False # reset flag
            match = self.START_RE.match(line)
            if match: # we have a start tag,
                # make sure it is upper case
                after_tag = True
                line = line.replace(match.group(1), match.group(1).upper())
                if (new_lines and '' != new_lines[-1]):
                # make sure line before is blank
                    new_lines.append('')
            match = self.END_RE.match(line)
            if (match): # we have an end tag,
                # make sure it is upper case
                after_tag = True
                line = line.replace(match.group(1), match.group(1).upper())
                if (new_lines and '' != new_lines[-1]):
                # make sure line before is blank
                    new_lines.append('')

            new_lines.append(line)
        return new_lines

Making sure nothing bad gets in the URLs

class OCXSectionTreeProcessor(Treeprocessor):
    START_RE = re.compile(START_SECTION)
    END_RE = re.compile(END_SECTION)
    BAD_URI_FRAG_CHARS = '[^A-Za-z0-9!$-()+]' #really stingy in what's allowed
...