index
execution_count
cell_type
toolbar
started_at
completed_at
source
loc
metadata
outputs
1
unexecuted
In
[
]
markdown
# unravel directories of notebooks with `dask`
metadata
1
2
unexecuted
In
[
]
markdown
this is another pass at using `dask` to load notebooks with the ultimate intent to search them.
in [searching-notebooks ](oct/2022-10-05-dask-search.ipynb ), i first approach this task with some
keen `pandas` skills that we not so kind in the `dask` land.
this document takes another pass at using clearer expressions to ravel a bunch
of notebooks to `dask.dataframe`
taking care to load notebooks as `dask.dataframe` s offers the power to apply
direct queries, export to parquet, export to sqlite, export to duckdb, arrow..
metadata
8
this is another pass at using
dask
to load notebooks with the ultimate intent to search them.
in
searching-notebooks
, i first approach this task with some
keen
pandas
skills that we not so kind in the
dask
land.
this document takes another pass at using clearer expressions to ravel a bunch
of notebooks to
dask.dataframe
taking care to load notebooks as
dask.dataframe
s offers the power to apply
direct queries, export to parquet, export to sqlite, export to duckdb, arrow..
3
executed
In
[
1
]
code
import pandas , json , jsonpointer , orjson , dask.dataframe ; from pathlib import Path
from toolz.curried import *
XXX = __name__ == "__main__" and "__file__" not in locals ()
metadata
3
0 outputs.
Out
[
1
]
4
unexecuted
In
[
]
markdown
## if you know the shape then define it
`dask` truly prefers explicit dtypes while `pandas` is more flexible.
`meta` holds our shape information for the cells, outputs, and displays
metadata
4
dask
truly prefers explicit dtypes while
pandas
is more flexible.
meta
holds our shape information for the cells, outputs, and displays
5
executed
In
[
2
]
code
class meta :
O = "object"
ANY = None , O
NB = [( "cells" , O ), ( "metadata" , O ), ( "nbformat" , int ), ( "nbformat_minor" , int )]
CELL = [
( "cell_type" , str ), ( "execution_count" , int ), ( "id" , str ),
( "metadata" , O ), ( "outputs" , O ), ( "source" , str ), ( "cell_ct" , int ),]
OUTPUT = [
( "data" , O ), ( "metadata" , O ), ( "ename" , str ), ( "evalue" , str ),
( "text" , str ), ( "execution_count" , int ), ( "output_type" , str ), ( "output_ct" , int )]
DISPLAY = [( "type" , str ), ( "value" , str )]
new_nb = pandas . Series ( index = map ( first , NB ), dtype = "O" )
new_cell = pandas . Series ( index = map ( first , CELL ), dtype = "O" )
new_output = pandas . Series ( index = map ( first , OUTPUT ), dtype = "O" )
new_display = pandas . Series ( index = map ( first , DISPLAY ), dtype = "O" )
def enumerate_list ( x , key = "cell_ct" ): return [{ key : i , ** y } for i , y in enumerate ( x )]
def get_series ( data , key = "text" , new = meta . new_output ):
if key in data :
data [ key ] = "" . join ( data [ key ])
s = new . copy ()
return s . update ( data ) or s
metadata
23
0 outputs.
Out
[
2
]
6
unexecuted
In
[
]
markdown
off to the races as we load some data from our local files.
metadata
1
off to the races as we load some data from our local files.
7
executed
In
[
3
]
code
metadata
1
0 outputs.
Out
[
3
]
8
unexecuted
In
[
]
markdown
the files we include start and remain our index. in prior iterations, there were a few set index operations, but we don't want to be opening files to do this cause that is costly. we'll store other metadata on the dataframe as we unpack the notebook shapes.
metadata
1
the files we include start and remain our index. in prior iterations, there were a few set index operations, but we don't want to be opening files to do this cause that is costly. we'll store other metadata on the dataframe as we unpack the notebook shapes.
9
executed
In
[
4
]
code
def get_files ( WHERE = WHERE ):
return dask . bag . from_sequence (
dict ( file = str ( x )) for x in WHERE . glob ( "*.ipynb" )
) . to_dataframe () . set_index ( "file" )
XXX and ( files := get_files ())
metadata
6
1 outputs.
Out
[
4
]
Dask DataFrame Structure:
npartitions=20
oct/2022-10-05-dask-search.ipynb
oct/2022-10-06-github-open-source-stats.ipynb
...
oct/2022-11-21-1.ipynb
oct/test_nbconvert_html5.ipynb
Dask Name: sort_index, 8 graph layers
10
unexecuted
In
[
]
markdown
`contents` loads our files in to a dataframe containg real cell contents.
each row is a file.
metadata
2
contents
loads our files in to a dataframe containg real cell contents.
each row is a file.
11
executed
In
[
5
]
code
def get_contents_from_files ( files ):
return files . index . to_series () . apply (
compose_left ( Path , Path . read_text , orjson . loads , partial (
get_series , new = meta . new_nb )), meta = meta . NB )
XXX and ( contents := get_contents_from_files ( files ))
metadata
5
1 outputs.
Out
[
5
]
Dask DataFrame Structure:
cells
metadata
nbformat
nbformat_minor
npartitions=20
oct/2022-10-05-dask-search.ipynb
object
object
int64
int64
oct/2022-10-06-github-open-source-stats.ipynb
...
...
...
...
...
...
...
...
...
oct/2022-11-21-1.ipynb
...
...
...
...
oct/test_nbconvert_html5.ipynb
...
...
...
...
Dask Name: apply, 11 graph layers
12
unexecuted
In
[
]
markdown
the `cells` are built by exploding the rows of the `contents`
metadata
1
the
cells
are built by exploding the rows of the
contents
13
executed
In
[
6
]
code
def get_cells_from_contents ( contents ):
cells = contents . cells
cells = cells . apply ( enumerate_list , meta = meta . ANY )
return cells . explode () . apply ( get_series , key = "source" , new = meta . new_cell , meta = meta . CELL )
if XXX :
cells = get_cells_from_contents ( contents )
meta_cells = cells [ "metadata cell_ct" . split ()]; cells . pop ( "metadata" ); display ( cells )
metadata
7
1 outputs.
Out
[
6
]
Dask DataFrame Structure:
cell_type
execution_count
id
outputs
source
cell_ct
npartitions=20
oct/2022-10-05-dask-search.ipynb
object
int64
object
object
object
int64
oct/2022-10-06-github-open-source-stats.ipynb
...
...
...
...
...
...
...
...
...
...
...
...
...
oct/2022-11-21-1.ipynb
...
...
...
...
...
...
oct/test_nbconvert_html5.ipynb
...
...
...
...
...
...
Dask Name: drop_by_shallow_copy, 16 graph layers
14
unexecuted
In
[
]
markdown
new we deal will outputs that include display_data, stdout, and stderr.
metadata
1
new we deal will outputs that include display_data, stdout, and stderr.
15
executed
In
[
7
]
code
def get_outputs_from_cells ( cells ):
outputs = cells [ "outputs cell_ct" . split ()] . dropna ( subset = "outputs" )
outputs . outputs = outputs . outputs . apply ( enumerate_list , key = "output_ct" , meta = meta . ANY )
outputs = outputs . explode ( "outputs" ) . dropna ( subset = "outputs" )
return dask . dataframe . concat ([
outputs . pop ( "outputs" ) . apply ( get_series , key = "text" , new = meta . new_output , meta = meta . OUTPUT ),
outputs
], axis = 1 )
if XXX :
outputs = get_outputs_from_cells ( cells )
meta_display = outputs [ "metadata cell_ct output_ct" . split ()]; outputs . pop ( "metadata" ); display ( outputs )
metadata
11
1 outputs.
Out
[
7
]
Dask DataFrame Structure:
data
ename
evalue
text
execution_count
output_type
output_ct
cell_ct
npartitions=20
oct/2022-10-05-dask-search.ipynb
object
object
object
object
int64
object
int64
int64
oct/2022-10-06-github-open-source-stats.ipynb
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
oct/2022-11-21-1.ipynb
...
...
...
...
...
...
...
...
oct/test_nbconvert_html5.ipynb
...
...
...
...
...
...
...
...
Dask Name: drop_by_shallow_copy, 2 graph layers
16
unexecuted
In
[
]
markdown
separating the different standard out/error displays from the rich display data. there is probably more to for managing the different types of outputs from the different reprs.
metadata
1
separating the different standard out/error displays from the rich display data. there is probably more to for managing the different types of outputs from the different reprs.
17
executed
In
[
8
]
code
def get_display_data_from_outputs ( outputs ):
display_data = outputs [ "data execution_count output_type cell_ct output_ct" . split ()] . dropna ( subset = "data" )
display_data [ "data" ] = display_data [ "data" ] . apply ( compose_left ( dict . items , list ), meta = meta . ANY )
display_data = display_data . explode ( "data" ) . dropna ( subset = "data" )
return dask . dataframe . concat ([
display_data . pop ( "data" ) . apply (
compose_left (
partial ( zip , meta . new_display . index ), dict ,
partial ( get_series , key = None , new = meta . new_display )
), meta = meta . DISPLAY ), display_data ], axis = 1 )
XXX and ( display_data := get_display_data_from_outputs ( outputs )) . compute ()
metadata
11
1 outputs.
Out
[
8
]
type
value
execution_count
output_type
cell_ct
output_ct
file
oct/2022-10-05-dask-search.ipynb
text/plain
["the expected cell keys are ['attachments', '...
3.0
execute_result
7
0
oct/2022-10-05-dask-search.ipynb
text/html
[<div><strong>Dask DataFrame Structure:</stron...
NaN
display_data
15
1
oct/2022-10-05-dask-search.ipynb
text/plain
[Dask DataFrame Structure:\n, at...
NaN
display_data
15
1
oct/2022-10-05-dask-search.ipynb
text/html
[<div>\n, <style scoped>\n, .dataframe tbo...
NaN
display_data
17
0
oct/2022-10-05-dask-search.ipynb
text/plain
[path .....
NaN
display_data
17
0
...
...
...
...
...
...
...
oct/2022-11-21-1.ipynb
application/vnd.jupyter.widget-view+json
{'model_id': 'cd4f96b3078c4adf851d419b9c8e7885...
NaN
display_data
4
1
oct/2022-11-21-1.ipynb
text/plain
[HTML(value='<pre><code>importlib._bootstrap_e...
NaN
display_data
4
1
oct/test_nbconvert_html5.ipynb
text/plain
[([], 1)]
4.0
execute_result
5
0
oct/test_nbconvert_html5.ipynb
text/plain
[([], 16)]
5.0
execute_result
7
0
oct/test_nbconvert_html5.ipynb
text/plain
[([], 5)]
6.0
execute_result
9
0
173 rows × 6 columns
18
unexecuted
In
[
]
markdown
## where to go from
* extend to other files. the notebook format is a hypermedia document format.
* save to different formats. initially we think about parquet, while in theory from this dataframe
we could go further an imagine it being the seed for documentation.
metadata
5
extend to other files. the notebook format is a hypermedia document format.
save to different formats. initially we think about parquet, while in theory from this dataframe
we could go further an imagine it being the seed for documentation.
19
executed
In
[
9
]
code
XXX and display ( * ( x . sample ( frac = .1 ) . compute () . sample ( 5 ) for x in ( cells , outputs , display_data )))
metadata
1
3 outputs.
Out
[
9
]
cell_type
execution_count
id
outputs
source
cell_ct
file
oct/2022-10-29-metadata-formatter.ipynb
markdown
NaN
50d9dcf9-18f4-4eb7-9a9a-cc3e14b3b2e2
NaN
#### dataframes
16
oct/2022-10-06-github-open-source-stats.ipynb
code
197.0
af8301f0-8a28-4350-87fa-a728867ccf67
[]
%reload_ext doit
6
oct/2022-10-21-markdown-future.ipynb
code
5.0
21fa2ad9-23f9-4868-9cb7-3bb87b69c542
[{'data': {'text/markdown': ['# tangle (code) ...
# tangle (code) and weave (display)\n\n kni...
7
oct/2022-10-19-mobius-text.ipynb
code
NaN
55228766-71c2-48f0-819f-afab637a4867
[]
18
oct/2022-10-27-axe-core-playwright-python.ipynb
code
17.0
6e7cb333-6cc3-4c73-b256-382bcb3e9c2c
[]
async def injectAxe(page): \n await page.ev...
9
data
ename
evalue
text
execution_count
output_type
output_ct
cell_ct
file
oct/2022-10-21-markdown-future.ipynb
{'text/markdown': ['i'm not saying importing m...
NaN
NaN
NaN
NaN
display_data
0
17
oct/2022-11-17--Copy1.ipynb
{'application/vnd.jupyter.widget-view+json': {...
NaN
NaN
NaN
NaN
display_data
0
6
oct/2022-10-21-markdown-future.ipynb
{'text/markdown': ['```mermaid
', 'flowchart L...
NaN
NaN
NaN
NaN
display_data
1
16
oct/2022-10-29-.ipynb
NaN
AttributeError
'dict' object has no attribute 'breaks'
NaN
NaN
error
1
12
oct/2022-10-21-pidgy-displays.ipynb
{'application/vnd.jupyter.widget-view+json': {...
NaN
NaN
NaN
NaN
display_data
0
17
type
value
execution_count
output_type
cell_ct
output_ct
file
oct/2022-10-05-dask-search.ipynb
text/html
[<div>\n, <style scoped>\n, .dataframe tbo...
NaN
display_data
19
0
oct/2022-11-21-1.ipynb
text/plain
[HTML(value='<pre><code>__</code></pre>\n')]
NaN
display_data
2
1
oct/2022-10-29-.ipynb
application/vnd.jupyter.widget-view+json
{'model_id': 'fa05bd5e43cc4785834b4b87ec154fbe...
NaN
display_data
11
0
oct/2022-11-17--Copy1.ipynb
text/plain
[HTML(value='<pre><code>notebooks = pandas.con...
NaN
display_data
5
0
oct/2022-10-21-markdown-future.ipynb
text/markdown
[## literate computing with literary machines ...
NaN
display_data
5
0
20
executed
In
[
10
]
code
from dataclasses import dataclass , field
@dataclass
class Contents :
dir : Path = field ( default_factory = Path . cwd )
contents : dask . dataframe . DataFrame = None
def __post_init__ ( self ):
self . contents = get_contents_from_files ( get_files ( self . dir ))
self . cells = get_cells_from_contents ( self . contents )
self . outputs = get_outputs_from_cells ( self . cells )
self . display_data = self . get_display_data_from_outputs ( self . outputs )
metadata
11
0 outputs.
Out
[
10
]
21
unexecuted
In
[
None
]
code
metadata
0
Out
[
None
]