unravel directories of notebooks with `dask`¤

this is another pass at using dask to load notebooks with the ultimate intent to search them. in searching-notebooks, i first approach this task with some keen pandas skills that we not so kind in the dask land. this document takes another pass at using clearer expressions to ravel a bunch of notebooks to dask.dataframe

taking care to load notebooks as dask.dataframes offers the power to apply direct queries, export to parquet, export to sqlite, export to duckdb, arrow..

    import pandas, json, jsonpointer, orjson, dask.dataframe; from pathlib import Path
    from toolz.curried import *
    XXX = __name__ == "__main__" and "__file__" not in locals()

if you know the shape then define it¤

dask truly prefers explicit dtypes while pandas is more flexible. meta holds our shape information for the cells, outputs, and displays

    class meta: 
        O = "object"
        ANY = None, O
        NB = [("cells", O), ("metadata", O), ("nbformat", int), ("nbformat_minor", int)]
        CELL = [
            ("cell_type", str), ("execution_count", int), ("id", str),
            ("metadata", O), ("outputs", O), ("source", str), ("cell_ct", int),]
        OUTPUT = [
            ("data", O), ("metadata", O), ("ename", str), ("evalue", str),
            ("text", str), ("execution_count", int), ("output_type", str), ("output_ct", int)]
        DISPLAY = [("type", str), ("value", str)]        
        new_nb = pandas.Series(index=map(first, NB), dtype="O")
        new_cell = pandas.Series(index=map(first, CELL), dtype="O")
        new_output = pandas.Series(index=map(first, OUTPUT), dtype="O")
        new_display = pandas.Series(index=map(first, DISPLAY), dtype="O")
    def enumerate_list(x, key="cell_ct"): return [{key: i, **y} for i, y in enumerate(x)]    

    def get_series(data, key="text", new=meta.new_output):
        if key in data:
            data[key] = "".join(data[key])
        s = new.copy()
        return s.update(data) or s

off to the races as we load some data from our local files.

    WHERE = Path("oct")

the files we include start and remain our index. in prior iterations, there were a few set index operations, but we don't want to be opening files to do this cause that is costly. we'll store other metadata on the dataframe as we unpack the notebook shapes.

    def get_files(WHERE=WHERE):
        return dask.bag.from_sequence(
            dict(file=str(x)) for x in WHERE.glob("*.ipynb")
        ).to_dataframe().set_index("file")
    XXX and (files := get_files())

Dask DataFrame Structure:


npartitions=20
oct/2022-10-05-dask-search.ipynb
oct/2022-10-06-github-open-source-stats.ipynb
...
oct/2022-11-21-1.ipynb
oct/test_nbconvert_html5.ipynb

Dask Name: sort_index, 8 graph layers

contents loads our files in to a dataframe containg real cell contents. each row is a file.

    def get_contents_from_files(files):
        return  files.index.to_series().apply(
            compose_left(Path, Path.read_text, orjson.loads, partial(
                get_series, new=meta.new_nb)), meta=meta.NB)
    XXX and (contents := get_contents_from_files(files))

Dask DataFrame Structure:

	cells	metadata	nbformat	nbformat_minor
npartitions=20
oct/2022-10-05-dask-search.ipynb	object	object	int64	int64
oct/2022-10-06-github-open-source-stats.ipynb	...	...	...	...
...	...	...	...	...
oct/2022-11-21-1.ipynb	...	...	...	...
oct/test_nbconvert_html5.ipynb	...	...	...	...

Dask Name: apply, 11 graph layers

the cells are built by exploding the rows of the contents

    def get_cells_from_contents(contents):
        cells = contents.cells
        cells = cells.apply(enumerate_list, meta=meta.ANY)
        return cells.explode().apply(get_series, key="source", new=meta.new_cell, meta=meta.CELL)
    if XXX:
        cells = get_cells_from_contents(contents)    
        meta_cells = cells["metadata cell_ct".split()]; cells.pop("metadata"); display(cells)

Dask DataFrame Structure:

	cell_type	execution_count	id	outputs	source	cell_ct
npartitions=20
oct/2022-10-05-dask-search.ipynb	object	int64	object	object	object	int64
oct/2022-10-06-github-open-source-stats.ipynb	...	...	...	...	...	...
...	...	...	...	...	...	...
oct/2022-11-21-1.ipynb	...	...	...	...	...	...
oct/test_nbconvert_html5.ipynb	...	...	...	...	...	...

Dask Name: drop_by_shallow_copy, 16 graph layers

new we deal will outputs that include display_data, stdout, and stderr.

    def get_outputs_from_cells(cells):
        outputs = cells["outputs cell_ct".split()].dropna(subset="outputs")
        outputs.outputs = outputs.outputs.apply(enumerate_list, key="output_ct", meta=meta.ANY)
        outputs = outputs.explode("outputs").dropna(subset="outputs")
        return dask.dataframe.concat([
            outputs.pop("outputs").apply(get_series, key="text", new=meta.new_output, meta=meta.OUTPUT),
            outputs
        ], axis=1)
    if XXX:
        outputs = get_outputs_from_cells(cells)
        meta_display = outputs["metadata cell_ct output_ct".split()]; outputs.pop("metadata"); display(outputs)

Dask DataFrame Structure:

	data	ename	evalue	text	execution_count	output_type	output_ct	cell_ct
npartitions=20
oct/2022-10-05-dask-search.ipynb	object	object	object	object	int64	object	int64	int64
oct/2022-10-06-github-open-source-stats.ipynb	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...
oct/2022-11-21-1.ipynb	...	...	...	...	...	...	...	...
oct/test_nbconvert_html5.ipynb	...	...	...	...	...	...	...	...

Dask Name: drop_by_shallow_copy, 2 graph layers

separating the different standard out/error displays from the rich display data. there is probably more to for managing the different types of outputs from the different reprs.

    def get_display_data_from_outputs(outputs):
        display_data = outputs["data execution_count output_type cell_ct output_ct".split()].dropna(subset="data")
        display_data["data"] = display_data["data"].apply(compose_left(dict.items, list), meta=meta.ANY)
        display_data = display_data.explode("data").dropna(subset="data")
        return dask.dataframe.concat([
            display_data.pop("data").apply(
                compose_left(
                    partial(zip, meta.new_display.index), dict, 
                    partial(get_series, key=None, new=meta.new_display)
                ), meta=meta.DISPLAY), display_data], axis=1)
    XXX and (display_data := get_display_data_from_outputs(outputs)).compute()

	type	value	execution_count	output_type	cell_ct	output_ct
file
oct/2022-10-05-dask-search.ipynb	text/plain	["the expected cell keys are ['attachments', '...	3.0	execute_result	7	0
oct/2022-10-05-dask-search.ipynb	text/html	[<div><strong>Dask DataFrame Structure:</stron...	NaN	display_data	15	1
oct/2022-10-05-dask-search.ipynb	text/plain	[Dask DataFrame Structure:\n, at...	NaN	display_data	15	1
oct/2022-10-05-dask-search.ipynb	text/html	[<div>\n, <style scoped>\n, .dataframe tbo...	NaN	display_data	17	0
oct/2022-10-05-dask-search.ipynb	text/plain	[path .....	NaN	display_data	17	0
...	...	...	...	...	...	...
oct/2022-11-21-1.ipynb	application/vnd.jupyter.widget-view+json	{'model_id': 'cd4f96b3078c4adf851d419b9c8e7885...	NaN	display_data	4	1
oct/2022-11-21-1.ipynb	text/plain	[HTML(value='<pre><code>importlib._bootstrap_e...	NaN	display_data	4	1
oct/test_nbconvert_html5.ipynb	text/plain	[([], 1)]	4.0	execute_result	5	0
oct/test_nbconvert_html5.ipynb	text/plain	[([], 16)]	5.0	execute_result	7	0
oct/test_nbconvert_html5.ipynb	text/plain	[([], 5)]	6.0	execute_result	9	0

173 rows × 6 columns

where to go from¤

extend to other files. the notebook format is a hypermedia document format.
save to different formats. initially we think about parquet, while in theory from this dataframe we could go further an imagine it being the seed for documentation.

    XXX and display(*(x.sample(frac=.1).compute().sample(5) for x in (cells, outputs, display_data)))

	cell_type	execution_count	id	outputs	source	cell_ct
file
oct/2022-10-29-metadata-formatter.ipynb	markdown	NaN	50d9dcf9-18f4-4eb7-9a9a-cc3e14b3b2e2	NaN	#### dataframes	16
oct/2022-10-06-github-open-source-stats.ipynb	code	197.0	af8301f0-8a28-4350-87fa-a728867ccf67	[]	%reload_ext doit	6
oct/2022-10-21-markdown-future.ipynb	code	5.0	21fa2ad9-23f9-4868-9cb7-3bb87b69c542	[{'data': {'text/markdown': ['# tangle (code) ...	# tangle (code) and weave (display)\n\n kni...	7
oct/2022-10-19-mobius-text.ipynb	code	NaN	55228766-71c2-48f0-819f-afab637a4867	[]		18
oct/2022-10-27-axe-core-playwright-python.ipynb	code	17.0	6e7cb333-6cc3-4c73-b256-382bcb3e9c2c	[]	async def injectAxe(page): \n await page.ev...	9

	data	ename	evalue	text	execution_count	output_type	output_ct	cell_ct
file
oct/2022-10-21-markdown-future.ipynb	{'text/markdown': ['i'm not saying importing m...	NaN	NaN	NaN	NaN	display_data	0	17
oct/2022-11-17--Copy1.ipynb	{'application/vnd.jupyter.widget-view+json': {...	NaN	NaN	NaN	NaN	display_data	0	6
oct/2022-10-21-markdown-future.ipynb	{'text/markdown': ['```mermaid ', 'flowchart L...	NaN	NaN	NaN	NaN	display_data	1	16
oct/2022-10-29-.ipynb	NaN	AttributeError	'dict' object has no attribute 'breaks'	NaN	NaN	error	1	12
oct/2022-10-21-pidgy-displays.ipynb	{'application/vnd.jupyter.widget-view+json': {...	NaN	NaN	NaN	NaN	display_data	0	17

	type	value	execution_count	output_type	cell_ct	output_ct
file
oct/2022-10-05-dask-search.ipynb	text/html	[<div>\n, <style scoped>\n, .dataframe tbo...	NaN	display_data	19	0
oct/2022-11-21-1.ipynb	text/plain	[HTML(value='<pre><code>__</code></pre>\n')]	NaN	display_data	2	1
oct/2022-10-29-.ipynb	application/vnd.jupyter.widget-view+json	{'model_id': 'fa05bd5e43cc4785834b4b87ec154fbe...	NaN	display_data	11	0
oct/2022-11-17--Copy1.ipynb	text/plain	[HTML(value='<pre><code>notebooks = pandas.con...	NaN	display_data	5	0
oct/2022-10-21-markdown-future.ipynb	text/markdown	[## literate computing with literary machines ...	NaN	display_data	5	0

    from dataclasses import dataclass, field
    @dataclass
    class Contents:
        dir: Path = field(default_factory=Path.cwd)
        contents: dask.dataframe.DataFrame = None

        def __post_init__(self):
            self.contents = get_contents_from_files(get_files(self.dir))
            self.cells = get_cells_from_contents(self.contents)
            self.outputs = get_outputs_from_cells(self.cells)
            self.display_data = self.get_display_data_from_outputs(self.outputs)

unravel directories of notebooks with dask¤

if you know the shape then define it¤

where to go from¤

unravel directories of notebooks with `dask`¤