unravel directories of notebooks with

this is another pass at using `dask` to load notebooks with the ultimate intent to search them.
in [searching-notebooks](oct/2022-10-05-dask-search.ipynb), i first approach this task with some
keen `pandas` skills that we not so kind in the `dask` land.
this document takes another pass at using clearer expressions to ravel a bunch
of notebooks to `dask.dataframe`

taking care to load notebooks as `dask.dataframe`s offers the power to apply
direct queries, export to parquet, export to sqlite, export to duckdb, arrow..

this is another pass at using `dask` to load notebooks with the ultimate intent to search them.
in [searching-notebooks](oct/2022-10-05-dask-search.ipynb), i first approach this task with some
keen `pandas` skills that we not so kind in the `dask` land.
this document takes another pass at using clearer expressions to ravel a bunch
of notebooks to `dask.dataframe`

taking care to load notebooks as `dask.dataframe`s offers the power to apply
direct queries, export to parquet, export to sqlite, export to duckdb, arrow..


        {'data': {'text/html': 'this is another pass at using dask to load notebooks with the ultimate intent to search them.\nin searching-notebooks, i first approach this task with some\nkeen pandas skills that we not so kind in the dask land.\nthis document takes another pass at using clearer expressions to ravel a bunch\nof notebooks to dask.dataframe
\ntaking care to load notebooks as dask.dataframes offers the power to apply\ndirect queries, export to parquet, export to sqlite, export to duckdb, arrow..\n'}}

this is another pass at using dask to load notebooks with the ultimate intent to search them. in searching-notebooks , i first approach this task with some keen pandas skills that we not so kind in the dask land. this document takes another pass at using clearer expressions to ravel a bunch of notebooks to dask.dataframe

taking care to load notebooks as dask.dataframe s offers the power to apply direct queries, export to parquet, export to sqlite, export to duckdb, arrow..

    import pandas, json, jsonpointer, orjson, dask.dataframe; from pathlib import Path
    from toolz.curried import *
    XXX = __name__ == "__main__" and "__file__" not in locals()

    import pandas, json, jsonpointer, orjson, dask.dataframe; from pathlib import Path
    from toolz.curried import *
    XXX = __name__ == "__main__" and "__file__" not in locals()

## if you know the shape then define it

`dask` truly prefers explicit dtypes while `pandas` is more flexible.
`meta` holds our shape information for the cells, outputs, and displays

## if you know the shape then define it

`dask` truly prefers explicit dtypes while `pandas` is more flexible.
`meta` holds our shape information for the cells, outputs, and displays


        {'data': {'text/html': 'if you know the shape then define it
\ndask truly prefers explicit dtypes while pandas is more flexible.\nmeta holds our shape information for the cells, outputs, and displays\n'}}

if you know the shape then define it

dask truly prefers explicit dtypes while pandas is more flexible. meta holds our shape information for the cells, outputs, and displays

    class meta: 
        O = "object"
        ANY = None, O
        NB = [("cells", O), ("metadata", O), ("nbformat", int), ("nbformat_minor", int)]
        CELL = [
            ("cell_type", str), ("execution_count", int), ("id", str),
            ("metadata", O), ("outputs", O), ("source", str), ("cell_ct", int),]
        OUTPUT = [
            ("data", O), ("metadata", O), ("ename", str), ("evalue", str),
            ("text", str), ("execution_count", int), ("output_type", str), ("output_ct", int)]
        DISPLAY = [("type", str), ("value", str)]        
        new_nb = pandas.Series(index=map(first, NB), dtype="O")
        new_cell = pandas.Series(index=map(first, CELL), dtype="O")
        new_output = pandas.Series(index=map(first, OUTPUT), dtype="O")
        new_display = pandas.Series(index=map(first, DISPLAY), dtype="O")
    def enumerate_list(x, key="cell_ct"): return [{key: i, **y} for i, y in enumerate(x)]    

    def get_series(data, key="text", new=meta.new_output):
        if key in data:
            data[key] = "".join(data[key])
        s = new.copy()
        return s.update(data) or s

    class meta: 
        O = "object"
        ANY = None, O
        NB = [("cells", O), ("metadata", O), ("nbformat", int), ("nbformat_minor", int)]
        CELL = [
            ("cell_type", str), ("execution_count", int), ("id", str),
            ("metadata", O), ("outputs", O), ("source", str), ("cell_ct", int),]
        OUTPUT = [
            ("data", O), ("metadata", O), ("ename", str), ("evalue", str),
            ("text", str), ("execution_count", int), ("output_type", str), ("output_ct", int)]
        DISPLAY = [("type", str), ("value", str)]        
        new_nb = pandas.Series(index=map(first, NB), dtype="O")
        new_cell = pandas.Series(index=map(first, CELL), dtype="O")
        new_output = pandas.Series(index=map(first, OUTPUT), dtype="O")
        new_display = pandas.Series(index=map(first, DISPLAY), dtype="O")
    def enumerate_list(x, key="cell_ct"): return [{key: i, **y} for i, y in enumerate(x)]    

    def get_series(data, key="text", new=meta.new_output):
        if key in data:
            data[key] = "".join(data[key])
        s = new.copy()
        return s.update(data) or s

off to the races as we load some data from our local files.

    WHERE = Path("oct")

    WHERE = Path("oct")

the files we include start and remain our index. in prior iterations, there were a few set index operations, but we don't want to be opening files to do this cause that is costly. we'll store other metadata on the dataframe as we unpack the notebook shapes.

the files we include start and remain our index. in prior iterations, there were a few set index operations, but we don't want to be opening files to do this cause that is costly. we'll store other metadata on the dataframe as we unpack the notebook shapes.


        {'data': {'text/html': "the files we include start and remain our index. in prior iterations, there were a few set index operations, but we don't want to be opening files to do this cause that is costly. we'll store other metadata on the dataframe as we unpack the notebook shapes.\n"}}

the files we include start and remain our index. in prior iterations, there were a few set index operations, but we don't want to be opening files to do this cause that is costly. we'll store other metadata on the dataframe as we unpack the notebook shapes.

    def get_files(WHERE=WHERE):
        return dask.bag.from_sequence(
            dict(file=str(x)) for x in WHERE.glob("*.ipynb")
        ).to_dataframe().set_index("file")
    XXX and (files := get_files())

    def get_files(WHERE=WHERE):
        return dask.bag.from_sequence(
            dict(file=str(x)) for x in WHERE.glob("*.ipynb")
        ).to_dataframe().set_index("file")
    XXX and (files := get_files())

Dask DataFrame Structure:


npartitions=20
oct/2022-10-05-dask-search.ipynb
oct/2022-10-06-github-open-source-stats.ipynb
...
oct/2022-11-21-1.ipynb
oct/test_nbconvert_html5.ipynb

Dask Name: sort_index, 8 graph layers

`contents` loads our files in to a dataframe containg real cell contents.
each row is a file.

`contents` loads our files in to a dataframe containg real cell contents.
each row is a file.


        {'data': {'text/html': 'contents loads our files in to a dataframe containg real cell contents.\neach row is a file.\n'}}

contents loads our files in to a dataframe containg real cell contents. each row is a file.

    def get_contents_from_files(files):
        return  files.index.to_series().apply(
            compose_left(Path, Path.read_text, orjson.loads, partial(
                get_series, new=meta.new_nb)), meta=meta.NB)
    XXX and (contents := get_contents_from_files(files))

    def get_contents_from_files(files):
        return  files.index.to_series().apply(
            compose_left(Path, Path.read_text, orjson.loads, partial(
                get_series, new=meta.new_nb)), meta=meta.NB)
    XXX and (contents := get_contents_from_files(files))

Dask DataFrame Structure:

	cells	metadata	nbformat	nbformat_minor
npartitions=20
oct/2022-10-05-dask-search.ipynb	object	object	int64	int64
oct/2022-10-06-github-open-source-stats.ipynb	...	...	...	...
...	...	...	...	...
oct/2022-11-21-1.ipynb	...	...	...	...
oct/test_nbconvert_html5.ipynb	...	...	...	...

Dask Name: apply, 11 graph layers

the cells are built by exploding the rows of the contents

    def get_cells_from_contents(contents):
        cells = contents.cells
        cells = cells.apply(enumerate_list, meta=meta.ANY)
        return cells.explode().apply(get_series, key="source", new=meta.new_cell, meta=meta.CELL)
    if XXX:
        cells = get_cells_from_contents(contents)    
        meta_cells = cells["metadata cell_ct".split()]; cells.pop("metadata"); display(cells)

    def get_cells_from_contents(contents):
        cells = contents.cells
        cells = cells.apply(enumerate_list, meta=meta.ANY)
        return cells.explode().apply(get_series, key="source", new=meta.new_cell, meta=meta.CELL)
    if XXX:
        cells = get_cells_from_contents(contents)    
        meta_cells = cells["metadata cell_ct".split()]; cells.pop("metadata"); display(cells)

Dask DataFrame Structure:

	cell_type	execution_count	id	outputs	source	cell_ct
npartitions=20
oct/2022-10-05-dask-search.ipynb	object	int64	object	object	object	int64
oct/2022-10-06-github-open-source-stats.ipynb	...	...	...	...	...	...
...	...	...	...	...	...	...
oct/2022-11-21-1.ipynb	...	...	...	...	...	...
oct/test_nbconvert_html5.ipynb	...	...	...	...	...	...

Dask Name: drop_by_shallow_copy, 16 graph layers

new we deal will outputs that include display_data, stdout, and stderr.

new we deal will outputs that include display_data, stdout, and stderr.

new we deal will outputs that include display_data, stdout, and stderr.

    def get_outputs_from_cells(cells):
        outputs = cells["outputs cell_ct".split()].dropna(subset="outputs")
        outputs.outputs = outputs.outputs.apply(enumerate_list, key="output_ct", meta=meta.ANY)
        outputs = outputs.explode("outputs").dropna(subset="outputs")
        return dask.dataframe.concat([
            outputs.pop("outputs").apply(get_series, key="text", new=meta.new_output, meta=meta.OUTPUT),
            outputs
        ], axis=1)
    if XXX:
        outputs = get_outputs_from_cells(cells)
        meta_display = outputs["metadata cell_ct output_ct".split()]; outputs.pop("metadata"); display(outputs)

    def get_outputs_from_cells(cells):
        outputs = cells["outputs cell_ct".split()].dropna(subset="outputs")
        outputs.outputs = outputs.outputs.apply(enumerate_list, key="output_ct", meta=meta.ANY)
        outputs = outputs.explode("outputs").dropna(subset="outputs")
        return dask.dataframe.concat([
            outputs.pop("outputs").apply(get_series, key="text", new=meta.new_output, meta=meta.OUTPUT),
            outputs
        ], axis=1)
    if XXX:
        outputs = get_outputs_from_cells(cells)
        meta_display = outputs["metadata cell_ct output_ct".split()]; outputs.pop("metadata"); display(outputs)

Dask DataFrame Structure:

	data	ename	evalue	text	execution_count	output_type	output_ct	cell_ct
npartitions=20
oct/2022-10-05-dask-search.ipynb	object	object	object	object	int64	object	int64	int64
oct/2022-10-06-github-open-source-stats.ipynb	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...
oct/2022-11-21-1.ipynb	...	...	...	...	...	...	...	...
oct/test_nbconvert_html5.ipynb	...	...	...	...	...	...	...	...

Dask Name: drop_by_shallow_copy, 2 graph layers

separating the different standard out/error displays from the rich display data. there is probably more to for managing the different types of outputs from the different reprs.

separating the different standard out/error displays from the rich display data. there is probably more to for managing the different types of outputs from the different reprs.


        {'data': {'text/html': 'separating the different standard out/error displays from the rich display data. there is probably more to for managing the different types of outputs from the different reprs.\n'}}

separating the different standard out/error displays from the rich display data. there is probably more to for managing the different types of outputs from the different reprs.

    def get_display_data_from_outputs(outputs):
        display_data = outputs["data execution_count output_type cell_ct output_ct".split()].dropna(subset="data")
        display_data["data"] = display_data["data"].apply(compose_left(dict.items, list), meta=meta.ANY)
        display_data = display_data.explode("data").dropna(subset="data")
        return dask.dataframe.concat([
            display_data.pop("data").apply(
                compose_left(
                    partial(zip, meta.new_display.index), dict, 
                    partial(get_series, key=None, new=meta.new_display)
                ), meta=meta.DISPLAY), display_data], axis=1)
    XXX and (display_data := get_display_data_from_outputs(outputs)).compute()

    def get_display_data_from_outputs(outputs):
        display_data = outputs["data execution_count output_type cell_ct output_ct".split()].dropna(subset="data")
        display_data["data"] = display_data["data"].apply(compose_left(dict.items, list), meta=meta.ANY)
        display_data = display_data.explode("data").dropna(subset="data")
        return dask.dataframe.concat([
            display_data.pop("data").apply(
                compose_left(
                    partial(zip, meta.new_display.index), dict, 
                    partial(get_series, key=None, new=meta.new_display)
                ), meta=meta.DISPLAY), display_data], axis=1)
    XXX and (display_data := get_display_data_from_outputs(outputs)).compute()

	type	value	execution_count	output_type	cell_ct	output_ct
file
oct/2022-10-05-dask-search.ipynb	text/plain	["the expected cell keys are ['attachments', '...	3.0	execute_result	7	0
oct/2022-10-05-dask-search.ipynb	text/html	[<div><strong>Dask DataFrame Structure:</stron...	NaN	display_data	15	1
oct/2022-10-05-dask-search.ipynb	text/plain	[Dask DataFrame Structure:\n, at...	NaN	display_data	15	1
oct/2022-10-05-dask-search.ipynb	text/html	[<div>\n, <style scoped>\n, .dataframe tbo...	NaN	display_data	17	0
oct/2022-10-05-dask-search.ipynb	text/plain	[path .....	NaN	display_data	17	0
...	...	...	...	...	...	...
oct/2022-11-21-1.ipynb	application/vnd.jupyter.widget-view+json	{'model_id': 'cd4f96b3078c4adf851d419b9c8e7885...	NaN	display_data	4	1
oct/2022-11-21-1.ipynb	text/plain	[HTML(value='<pre><code>importlib._bootstrap_e...	NaN	display_data	4	1
oct/test_nbconvert_html5.ipynb	text/plain	[([], 1)]	4.0	execute_result	5	0
oct/test_nbconvert_html5.ipynb	text/plain	[([], 16)]	5.0	execute_result	7	0
oct/test_nbconvert_html5.ipynb	text/plain	[([], 5)]	6.0	execute_result	9	0

173 rows × 6 columns

## where to go from

* extend to other files. the notebook format is a hypermedia document format.
* save to different formats. initially we think about parquet, while in theory from this dataframe
we could go further an imagine it being the seed for documentation.

## where to go from

* extend to other files. the notebook format is a hypermedia document format.
* save to different formats. initially we think about parquet, while in theory from this dataframe
we could go further an imagine it being the seed for documentation.


        {'data': {'text/html': 'where to go from
\n\nextend to other files. the notebook format is a hypermedia document format.
\nsave to different formats. initially we think about parquet, while in theory from this dataframe\nwe could go further an imagine it being the seed for documentation.
\n\n'}}

where to go from

extend to other files. the notebook format is a hypermedia document format.
save to different formats. initially we think about parquet, while in theory from this dataframe we could go further an imagine it being the seed for documentation.

    XXX and display(*(x.sample(frac=.1).compute().sample(5) for x in (cells, outputs, display_data)))

    XXX and display(*(x.sample(frac=.1).compute().sample(5) for x in (cells, outputs, display_data)))

	cell_type	execution_count	id	outputs	source	cell_ct
file
oct/2022-10-29-metadata-formatter.ipynb	markdown	NaN	50d9dcf9-18f4-4eb7-9a9a-cc3e14b3b2e2	NaN	#### dataframes	16
oct/2022-10-06-github-open-source-stats.ipynb	code	197.0	af8301f0-8a28-4350-87fa-a728867ccf67	[]	%reload_ext doit	6
oct/2022-10-21-markdown-future.ipynb	code	5.0	21fa2ad9-23f9-4868-9cb7-3bb87b69c542	[{'data': {'text/markdown': ['# tangle (code) ...	# tangle (code) and weave (display)\n\n kni...	7
oct/2022-10-19-mobius-text.ipynb	code	NaN	55228766-71c2-48f0-819f-afab637a4867	[]		18
oct/2022-10-27-axe-core-playwright-python.ipynb	code	17.0	6e7cb333-6cc3-4c73-b256-382bcb3e9c2c	[]	async def injectAxe(page): \n await page.ev...	9

	data	ename	evalue	text	execution_count	output_type	output_ct	cell_ct
file
oct/2022-10-21-markdown-future.ipynb	{'text/markdown': ['i'm not saying importing m...	NaN	NaN	NaN	NaN	display_data	0	17
oct/2022-11-17--Copy1.ipynb	{'application/vnd.jupyter.widget-view+json': {...	NaN	NaN	NaN	NaN	display_data	0	6
oct/2022-10-21-markdown-future.ipynb	{'text/markdown': ['```mermaid ', 'flowchart L...	NaN	NaN	NaN	NaN	display_data	1	16
oct/2022-10-29-.ipynb	NaN	AttributeError	'dict' object has no attribute 'breaks'	NaN	NaN	error	1	12
oct/2022-10-21-pidgy-displays.ipynb	{'application/vnd.jupyter.widget-view+json': {...	NaN	NaN	NaN	NaN	display_data	0	17

	type	value	execution_count	output_type	cell_ct	output_ct
file
oct/2022-10-05-dask-search.ipynb	text/html	[<div>\n, <style scoped>\n, .dataframe tbo...	NaN	display_data	19	0
oct/2022-11-21-1.ipynb	text/plain	[HTML(value='<pre><code>__</code></pre>\n')]	NaN	display_data	2	1
oct/2022-10-29-.ipynb	application/vnd.jupyter.widget-view+json	{'model_id': 'fa05bd5e43cc4785834b4b87ec154fbe...	NaN	display_data	11	0
oct/2022-11-17--Copy1.ipynb	text/plain	[HTML(value='<pre><code>notebooks = pandas.con...	NaN	display_data	5	0
oct/2022-10-21-markdown-future.ipynb	text/markdown	[## literate computing with literary machines ...	NaN	display_data	5	0

    from dataclasses import dataclass, field
    @dataclass
    class Contents:
        dir: Path = field(default_factory=Path.cwd)
        contents: dask.dataframe.DataFrame = None
        
        def __post_init__(self):
            self.contents = get_contents_from_files(get_files(self.dir))
            self.cells = get_cells_from_contents(self.contents)
            self.outputs = get_outputs_from_cells(self.cells)
            self.display_data = self.get_display_data_from_outputs(self.outputs)

    from dataclasses import dataclass, field
    @dataclass
    class Contents:
        dir: Path = field(default_factory=Path.cwd)
        contents: dask.dataframe.DataFrame = None
        
        def __post_init__(self):
            self.contents = get_contents_from_files(get_files(self.dir))
            self.cells = get_cells_from_contents(self.contents)
            self.outputs = get_outputs_from_cells(self.cells)
            self.display_data = self.get_display_data_from_outputs(self.outputs)