Skip to content

searching notebooks¤

sorry haters. notebooks are here to stay. their growth and adoption means that they'll present newer problems. one forthcoming challenge with notebooks and their adoption is the ability to search notebooks across space and time. in this notebook, we build tooling to search notebooks and think about the question we might ask to our notebooks.

Searching notebooks as structured data.

What questions would you ask? today? a year from today? a lifetime from today?

notebook schema¤

one of the reasons we can search notebooks is their consistent structure defined by the nbformat SCHEMA. the schema provides both a description of the document format along with type information about the notebook data.

    import nbformat.v4, jsonref, IPython.display as SOME
    COMPACT = nbformat.validator._get_schema_json(nbformat.v4)    

the COMPACT should be expanded to allow for easier access the components of the schema. if we don't that we need to really on the implicit structure of the schema document.

    SCHEMA = jsonref.JsonRef.replace_refs(COMPACT)
    SOME.JSON(SCHEMA, root=SCHEMA["description"]);

for this demonstration we are going to avoid anything dealing with the top level metadata. our goal is to explore the contents of cells and think about the questions we may ask on the cell sources and outputs.

cell schema¤

below we extra the expected CELL keys from the SCHEMA

    CELL, CELLS = SCHEMA["properties"]["cells"], {"nid"}
    for s in CELL["items"]["oneOf"]:
        CELLS.update(s.get("properties", ""))
    CELLS = sorted(CELLS)
    CELLS_META_EXPLICIT = dict(execution_count="float64", nid=int, cell_type="category")
    CELLS_META = tuple((k, CELLS_META_EXPLICIT.get(k, "object")) for k in CELLS)
    F"the expected cell keys are {CELLS}"
"the expected cell keys are ['attachments', 'cell_type', 'execution_count', 'id', 'metadata', 'nid', 'outputs', 'source']"

loading our notebook data.¤

we're going to use dask to accelerate our efforts. dask will help us looking across files in a fast way, and we can speak dataframes natively.

    import dask.dataframe, pandas, jsonref, json; from dask import delayed; from pathlib import Path

our dataframe is going to be constructed from a bunch of parallel files reads. each file is passed through get_cell to return a pandas.DataFrame.

    def get_cell(path):
        with open(path) as file:
            if str(path).endswith((".ipynb",)):
                cells = json.load(file)["cells"]
            elif str(path).endswith((".md",)):
                cells = dict(metadata={}, cells=[dict(
                    cell_type="markdown", source="".join(file)
                )])
        df = pandas.DataFrame(cells)
        df.index.name = "nid"
        df = df.reset_index("nid")

        if "source" not in df:
            df = pandas.DataFrame(columns=CELLS)
        else:
            df.execution_count = df.execution_count.fillna(-1) # -1 is outside the valid schema, but we don't validate here!
            df.source = df.source.apply("".join)
        df.index = [path]*len(df)
        df.index.name = "path"

        for k, _ in CELLS_META:
            if k not in df.columns:
                df[k] = None
            df[k] = df[k].astype("O")
        return df[CELLS]

get_cells loads, tidies, and separates cells, outputs and metadata

    def get_delayeds(dir, recursive=False):
        dir = Path(dir)
        files = (recursive and dir.rglob or dir.glob)("*.ipynb")
        return dask.dataframe.from_delayed(
            list(map(delayed(get_cell), files))
        )
    def get_cells(dir=None, recursive=False):
        return get_delayeds(dir or Path.cwd(), recursive).pipe(
            lambda df: (df, df.pop("outputs"), df.pop("metadata"))
        )
    L = "__file__" not in locals()
    print(L)
    if L: cells, outputs, metadata = get_cells("../.."); display(cells)
True
Dask DataFrame Structure:
attachments cell_type execution_count id nid source
npartitions=8
object object object object object object
... ... ... ... ... ...
... ... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Dask Name: drop_by_shallow_copy, 32 tasks

find cells with imports in them

    if L: cells[cells.source.str.match("\s*import\s+.*")].compute().T.pipe(display)
path ../../2022-06-28-.ipynb ../../2022-06-24-.ipynb ../../2022-03-06-schemata-scratch.ipynb ../../2022-03-06-schemata-scratch.ipynb
attachments None None None None
cell_type code code code code
execution_count 9.0 2.0 7.0 6.0
id c744325a-8f73-4428-b381-c1f4ee5fdb06 ba46c5ef-78e6-48b2-bee3-cd6be3606fe5 ab6ff11f-a7d1-4e59-9fdd-cac7313a7cf4 5d3d2316-b2b0-4f47-ad0d-9b677b1f7e6a
nid 6 1 52 126
source \n import graphviz\n hommage = graph... import functools, abc import sys import urllib

find some urls?

    if L: cells.source.str.extract("(http[s]://\S+)").dropna().compute().T.pipe(display)
path ../../2022-06-28-.ipynb ../../2022-06-28-.ipynb ../../2022-06-28-.ipynb ../../2022-06-28-.ipynb ../../2022-06-28-.ipynb ../../2022-04-12-.ipynb ../../2022-04-12-.ipynb ../../2022-03-06-schemata-scratch.ipynb ../../2022-03-06-schemata-scratch.ipynb ../../2022-03-06-schemata-scratch.ipynb ../../2022-03-06-schemata-scratch.ipynb ../../2022-03-06-schemata-scratch.ipynb
0 https://raw.githubusercontent.com/SchemaStore/... https://joss.theoj.org/papers/in/Jupyter%20Not... https://raw.githubusercontent.com/SchemaStore/... https://raw.githubusercontent.com/jupyter/nbfo... https://c.tenor.com/JHjG5vxW9zIAAAAd/missy-ell... https://github.com/jupyterlab/lumino", https://github.com/jupyterlab/jupyterlab@master", https://json-schema.org/draft/next/meta/valida... https://json-schema.org/draft/2020-12/schema'] https://json-schema.org/draft/2020-12/schema"] https://test.json-schema.org/dynamic-resolutio... https://avatars.githubusercontent.com/u/423627...

break¤

what storage integrate with contents manager what queries

working on this notebook revealed an issue with importnb's json parser than needs some care.

this document is code and can be used with the statement from tonyfast import search