accessibile semantics for python code¤
pre-formatted html representations of code, in browse or focus mode, ignore the semantic of the source language. this leaves readers consuming preformatted text splattered with some color. screen reader users with a stream of unstructured text despite the fact that programming languages have richer semantics.
in this document, we consider a semantically meaningful representation of python code (this is not a general approach) that aligns the structure of the annotation object model closer to the semantics of python programming language. some flexible changes we'll propose are:
- function and class definition blocks are landmarks
- function and class definitions are headings
- top-level expressions/comments are grouped
in this document, we imagine an annotation object model for code that provides landmarks, headings, and other aria semantics.
import pandas, ast, tokenize, pygments, io, inspect, itertools, html; from typing import *; from IPython.display import *; from functools import partial
synthesizing multiple representations.¤
pygments
is the primary way we display syntax highlighted code, it is a language agnostic tool.
to achieve our goals of a semantically meaningful code structure we need more details than pygments
lexical analysis provides.
our semantic solution merges three streams of tokenized python source:
1. the pygments
tokens provides html classes with reusable style sheets
2. the ast
module provides nesting information of expressions
3. tokenize
is used to discover comments in the python code because the ast
module ignores them.
capturing structure from the ast
and tokenize
¤
we use the ast
module to capture the block nature of the python source code.
we synthesize the block line numbers with the tokenize
tokens to capture comments, ast
does not capture comments.
the line number and tag attributes are yielded for the regions of interest (eg expression and comment blocks) sorted by line number.
def get_sorted_regions(source: str) -> Iterator[tuple[int, dict]]:
nodes: ast.AST = ast.parse(source)
nested = []
for i, s in sorted(itertools.chain(get_limits_from_ast(nodes), get_comments_from_tokenize(source))
, key=lambda x: (x[0], not bool(x[1]))):
if s is not None:
if isinstance(s, str):
nested.append(s)
dots = ".".join(filter(bool, nested))
yield i, dict(id=dots, role="region", **{"aria-label": dots})
elif isinstance(s, ast.AST):
yield i, dict(id=F"{type(s).__name__}-L{s.lineno}", role="group", **{
"aria-label": F"{type(s).__name__} Line {s.lineno}"})
nested.append(None)
elif isinstance(s, list):
yield i, dict(**{"aria-label": F"Comment Line {s[0].start[0]}"})
nested.append(None)
else:
nested and nested.pop()
yield i, None
capturing nesting structure of the semantics with the ast
module¤
the ast
module allows us to capture the line numbers encapsulating expressions, functions and classes.
to add structure the semantics:
1. the top level expressions and statements in the module are grouped
2. all functions and classes are grouped
def skip(n, x=None): yield from (y for i, y in enumerate(x) if i > n)
def get_last_line(nodes):
last = -1
for x in skip(1, ast.walk(nodes)):
last = max(getattr(x, "end_lineno", -1), last)
return last
def get_end(node):
if node.lineno == node.end_lineno:
return node.lineno-1, None
if node.end_lineno == get_last_line(node):
return node.end_lineno-1, None
return node.end_lineno-2, None
def get_limits_from_ast(nodes: ast.AST) -> Iterator[tuple[int, dict]]:
for node in nodes.body:
if isinstance(node, (ast.ClassDef, ast.AsyncFunctionDef, ast.FunctionDef)):
yield node.lineno-1, node.name
yield get_end(node)
else:
yield node.lineno-1, node
yield get_end(node)
for node in itertools.chain(*map(partial(skip, 1), map(ast.walk, nodes.body))):
if isinstance(node, (ast.ClassDef, ast.AsyncFunctionDef, ast.FunctionDef)):
yield node.lineno-1, node.name
yield get_end(node)
capturing comments with tokenize
¤
comments are effectively paragraphs in code. they should be more readable and specifically demarcated as non-code.
def get_comments_from_tokenize(source: str) -> Iterator[tuple[int, dict]]:
last = []
for token in tokenize.tokenize(io.BytesIO(source.encode()).readline):
if token.type == tokenize.NEWLINE:
pass
if token.type == tokenize.COMMENT:
if last and (last[-1].start[0] + 1) < token.start[0]:
yield last[0].start[0]-1, list(last)
yield last[-1].end[0]-1, None
last.clear()
if token.line.lstrip().startswith("#"):
last.append(token)
if last and (last[-1].start[0] + 1) < token.start[0]:
yield last[0].start[0]-1, list(last)
yield last[-1].end[0]-1, None
last.clear()
we extract docstrings so we use them to describe landmarks or links.
custom pygments
formatter¤
pygments
drives the translation of source code to html. our custom renderer merges the pygments
, ast
, and tokenize
streams together. the outcome is a semantically meaningfully representation of the code source.
class Html(pygments.formatters.HtmlFormatter):
def _format_lines(self, tokens):
for j, (i, line) in enumerate(super()._format_lines(tokens), 1):
while self.regions and j > self.regions[0][0]:
m, n = self.regions.pop(0)
if n is None:
line += ""
else:
attrs = " ".join(F'{k}="{v}"' for k, v in n.items())
line = F'''<span {attrs}="">''' + line
yield i , line
def format(self, tokensource, outfile):
tokensource = list(tokensource)
self.regions = list(get_sorted_regions("".join(y for _, y in tokensource)))
return super().format(tokensource, outfile)
more semantics by post processing the html¤
out of covenience, we add a post processing step to modify the highlighted html. these changes make:
- functions and classes headings
- functions and classes declarations links
the headings and links will now be included redundantly in screen reader navigation.
def post_highlight(html):
soup = __import__("bs4").BeautifulSoup(html, features="html.parser")
for name in soup.select(".k+.nf,.k+.nc,.k+.fm"):
id = name.parent.attrs.get("id") or ""
if id:
a = soup.new_tag("a")
a.attrs.update(href="#"+id, role="heading", **{"aria-level": id.count(".")+2})
a.extend(name.children), name.clear(), name.append(a)
return soup.prettify()
the modified highlighter in action¤
our sample
source is randomly taken as the source of the tokenize
module. if this notebook is live then you can change the source.
we include accessible pygments
themes extended from eric bailey's accessible theme palettes.
sample = inspect.getsource(tokenize)
formatter = Html(style="github-light-high-contrast")
page = post_highlight(pygments.highlight(sample, pygments.lexers.get_lexer_by_name("python"), formatter))
HTML(F"""<details><summary>expand this to see the highlighted code </summary>{page}</details>""")
a page/document of code¤
all of this can be combined into a complete page of that treats code as an accessible document. we can event include a heading for navigation.
def get_toc(body):
import mistune, textwrap
soup = __import__("bs4").BeautifulSoup(body, features="html.parser")
toc = """"""
for x in soup.select("[role=heading][aria-level]"):
toc += " " * int(x.attrs.get("aria-level"))
toc += F"* [`{x.string}`](#{x.string})\n"
return mistune.markdown(textwrap.dedent(toc))
capture the css styles.
style = formatter; HTML(F"<style>{style}</style>")
all = F"""<head><meta content="dark light" name="color-scheme"/>
<style>{Html(style="github-dark-high-contrast").get_style_defs()}</style>
</head><body><main><header>
<details><summary>Table of contents</summary><nav>{get_toc(page)}</nav></details></header>{page}</main>"""
HTML(F'<iframe height="600" srcdoc="{html.escape(all)}" width="100%"></iframe>')
flexible configuration¤
the implementation of semantic structure for html should be flexible. some settings that make sense for pure code documents would not apply for notebooks. for example, cell landmarks would be preferred to expressions level landmarks. a rough configuration of the code semantis would abide the schema below.
from pydantic import BaseModel, Field
class Settings(BaseModel):
all_expressions_are_grouped: bool = Field(
True, description="each top level expression or statement is grouped"
)
functions_and_classes_are_headings: bool = Field(
True, description="function and classes in code blocks are treated as headings"
)
functions_and_classes_are_landmarks: bool = Field(
True, description="function and classes in code blocks are treated as headings"
)
functions_and_classes_are_labelled: bool = Field(
True, description="function and classes in code blocks are treated as headings"
)
JSON(Settings.schema(), root="Semantic code settings")
conclusions¤
- all of this is hand wavy bullshit cause i'm the only disabled person to test this and i'm not an experienced screen reader user.
- some structure is better, too much structure would be bad.
- reading code is a more practical literacy than writing code. assistive technology users should have an easier time reading code.
- for long code documents, line numbers are challenging to navigate with a screen reader. better semantics can improve code navigation.
- the visual structure of the annotation object model is more navigable.
- audibly, this is better for me when testing notebooks on a screen reader. code cells with more than 5 lines of code are garbled and unstructured. more structure and interactive elements can improve the comprehension of coded elements.