searching notebooks¤
sorry haters. notebooks are here to stay. their growth and adoption means that they'll present newer problems. one forthcoming challenge with notebooks and their adoption is the ability to search notebooks across space and time. in this notebook, we build tooling to search notebooks and think about the question we might ask to our notebooks.
Searching notebooks as structured data.
What questions would you ask? today? a year from today? a lifetime from today?
notebook schema¤
one of the reasons we can search notebooks is their consistent structure defined by the nbformat
SCHEMA
.
the schema provides both a description of the document format along with type information about the
notebook data.
import nbformat.v4, jsonref, IPython.display as SOME
COMPACT = nbformat.validator._get_schema_json(nbformat.v4)
the COMPACT
should be expanded to allow for easier access the components of the schema.
if we don't that we need to really on the implicit structure of the schema document.
SCHEMA = jsonref.JsonRef.replace_refs(COMPACT)
SOME.JSON(SCHEMA, root=SCHEMA["description"]);
for this demonstration we are going to avoid anything dealing with the top level metadata. our goal is to explore the contents of cells and think about the questions we may ask on the cell sources and outputs.
cell schema¤
below we extra the expected CELL
keys from the SCHEMA
CELL, CELLS = SCHEMA["properties"]["cells"], {"nid"}
for s in CELL["items"]["oneOf"]:
CELLS.update(s.get("properties", ""))
CELLS = sorted(CELLS)
CELLS_META_EXPLICIT = dict(execution_count="float64", nid=int, cell_type="category")
CELLS_META = tuple((k, CELLS_META_EXPLICIT.get(k, "object")) for k in CELLS)
F"the expected cell keys are {CELLS}"
loading our notebook data.¤
we're going to use dask
to accelerate our efforts.
dask
will help us looking across files in a fast way, and we can speak dataframes natively.
import dask.dataframe, pandas, jsonref, json; from dask import delayed; from pathlib import Path
our dataframe is going to be constructed from a bunch of parallel files reads.
each file is passed through get_cell
to return a pandas.DataFrame
.
def get_cell(path):
with open(path) as file:
if str(path).endswith((".ipynb",)):
cells = json.load(file)["cells"]
elif str(path).endswith((".md",)):
cells = dict(metadata={}, cells=[dict(
cell_type="markdown", source="".join(file)
)])
df = pandas.DataFrame(cells)
df.index.name = "nid"
df = df.reset_index("nid")
if "source" not in df:
df = pandas.DataFrame(columns=CELLS)
else:
df.execution_count = df.execution_count.fillna(-1) # -1 is outside the valid schema, but we don't validate here!
df.source = df.source.apply("".join)
df.index = [path]*len(df)
df.index.name = "path"
for k, _ in CELLS_META:
if k not in df.columns:
df[k] = None
df[k] = df[k].astype("O")
return df[CELLS]
get_cells
loads, tidies, and separates cells, outputs and metadata
def get_delayeds(dir, recursive=False):
dir = Path(dir)
files = (recursive and dir.rglob or dir.glob)("*.ipynb")
return dask.dataframe.from_delayed(
list(map(delayed(get_cell), files))
)
def get_cells(dir=None, recursive=False):
return get_delayeds(dir or Path.cwd(), recursive).pipe(
lambda df: (df, df.pop("outputs"), df.pop("metadata"))
)
L = "__file__" not in locals()
print(L)
if L: cells, outputs, metadata = get_cells("../.."); display(cells)
find cells with imports in them
if L: cells[cells.source.str.match("\s*import\s+.*")].compute().T.pipe(display)
find some urls?
if L: cells.source.str.extract("(http[s]://\S+)").dropna().compute().T.pipe(display)
break
¤
what storage integrate with contents manager what queries
working on this notebook revealed an issue with importnb's json parser than needs some care.
this document is code and can be used with the statement from tonyfast import search