notebook toolbar

Activate

index execution_count cell_type toolbar started_at completed_at source loc metadata outputs

markdown

# use `whoosh` to search cells/articles on disk

# use `whoosh` to search cells/articles on disk

use to search cells/articles on disk

markdown

https://whoosh.readthedocs.io/en/latest/

    !pip install  whoosh

https://whoosh.readthedocs.io/en/latest/

    !pip install  whoosh

https://whoosh.readthedocs.io/en/latest/

!pip install  whoosh

In 1

code

2023-01-25T02:23:35.068198+00:00

2023-01-25T02:23:35.093125+00:00

    import whoosh.fields, whoosh.index, whoosh.qparser, whoosh.writing
    import pathlib, shutil

    import whoosh.fields, whoosh.index, whoosh.qparser, whoosh.writing
    import pathlib, shutil


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:35.068198Z', 'iopub.status.busy': '2023-01-25T02:23:35.068077Z', 'iopub.status.idle': '2023-01-25T02:23:35.093483Z', 'shell.execute_reply': '2023-01-25T02:23:35.093125Z', 'shell.execute_reply.started': '2023-01-25T02:23:35.068165Z'}, 'tags': []}

In 2

code

2023-01-25T02:23:35.094280+00:00

2023-01-25T02:23:36.405214+00:00

    from tonyfast import nbframe
    __import__("nest_asyncio").apply()
    self = nbframe.Documents(nbframe.Finder(dir="..")).load()

    from tonyfast import nbframe
    __import__("nest_asyncio").apply()
    self = nbframe.Documents(nbframe.Finder(dir="..")).load()


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:35.094280Z', 'iopub.status.busy': '2023-01-25T02:23:35.094185Z', 'iopub.status.idle': '2023-01-25T02:23:36.405529Z', 'shell.execute_reply': '2023-01-25T02:23:36.405214Z', 'shell.execute_reply.started': '2023-01-25T02:23:35.094265Z'}, 'tags': []}

markdown

## initialize the search index

## initialize the search index

initialize the search index

In 3

code

2023-01-25T02:23:36.406185+00:00

2023-01-25T02:23:36.409622+00:00

    INDEX = pathlib.Path("search_index")
    INDEX.mkdir(exist_ok=True)

    whoosh.index.create_in(INDEX, schema := whoosh.fields.Schema(source=whoosh.fields.TEXT, path=whoosh.fields.ID(stored=True)))
    index=whoosh.index.open_dir(INDEX)

    INDEX = pathlib.Path("search_index")
    INDEX.mkdir(exist_ok=True)

    whoosh.index.create_in(INDEX, schema := whoosh.fields.Schema(source=whoosh.fields.TEXT, path=whoosh.fields.ID(stored=True)))
    index=whoosh.index.open_dir(INDEX)


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:36.406185Z', 'iopub.status.busy': '2023-01-25T02:23:36.406033Z', 'iopub.status.idle': '2023-01-25T02:23:36.409957Z', 'shell.execute_reply': '2023-01-25T02:23:36.409622Z', 'shell.execute_reply.started': '2023-01-25T02:23:36.406168Z'}, 'tags': []}

In 4

code

2023-01-25T02:23:36.410758+00:00

2023-01-25T02:23:36.871399+00:00

    from tonyfast import nbframe
    self = nbframe.Documents(nbframe.Finder(dir="..")).load()

    from tonyfast import nbframe
    self = nbframe.Documents(nbframe.Finder(dir="..")).load()


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:36.410758Z', 'iopub.status.busy': '2023-01-25T02:23:36.410526Z', 'iopub.status.idle': '2023-01-25T02:23:36.871721Z', 'shell.execute_reply': '2023-01-25T02:23:36.871399Z', 'shell.execute_reply.started': '2023-01-25T02:23:36.410738Z'}, 'tags': []}

markdown

`self.articles` is a dataframe containing notebooks and files cast to the notebook schema. the dask and dataframes are shown below.

`self.articles` is a dataframe containing notebooks and files cast to the notebook schema. the dask and dataframes are shown below.


        {'data': {'text/html': 'self.articles is a dataframe containing notebooks and files cast to the notebook schema. the dask and dataframes are shown below.\n'}}

self.articles is a dataframe containing notebooks and files cast to the notebook schema. the dask and dataframes are shown below.

In 5

code

2023-01-25T02:23:36.872419+00:00

2023-01-25T02:23:36.897872+00:00

    display(self.articles, self.articles.head(10, 5))

    display(self.articles, self.articles.head(10, 5))


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:36.872419Z', 'iopub.status.busy': '2023-01-25T02:23:36.872279Z', 'iopub.status.idle': '2023-01-25T02:23:36.898316Z', 'shell.execute_reply': '2023-01-25T02:23:36.897872Z', 'shell.execute_reply.started': '2023-01-25T02:23:36.872403Z'}, 'tags': []}

Dask DataFrame Structure:

	cell_type	execution_count	id	metadata	outputs	source	cell_ct	attachments
npartitions=85
../2023-01-19-.ipynb	object	int64	object	object	object	object	int64	object
../2023-01-19-pidgy-afforndances.ipynb	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...
../xxiii/vendor/tree-sitter-python/test/highlight/pattern_matching.py	...	...	...	...	...	...	...	...
../xxiii/what.md	...	...	...	...	...	...	...	...

Dask Name: apply, 1 graph layer

	cell_type	execution_count	id	metadata	outputs	source	cell_ct	attachments
path
../2023-01-19-.ipynb	code	None	ad5f3630-daac-4b8b-95b0-22f27ea47af2	{}	[]		0	None
../2023-01-19-pidgy-afforndances.ipynb	code	1.0	409a2348-866f-4127-a25b-7fc0adcac5fc	{}	[{'ename': 'SyntaxError', 'evalue': 'invalid s...	when i program in `pidgy`\n\n* `sys.modules` a...	0	None
../2023-01-19-pidgy-afforndances.ipynb	code	1.0	45c4aa5d-e53d-4a02-82b2-8f872906ceba	{}	[{'data': {'text/markdown': ' %reload_ext p...	%reload_ext pidgy\n from toolz.curried ...	1	None
../2023-01-19-pidgy-afforndances.ipynb	markdown	NaN	87b3dd7c-3b9a-406d-b39c-547c69a938f7	{}	None	<iframe src="http://127.0.0.1:8787/status"...	2	None
../2023-01-19-pidgy-afforndances.ipynb	markdown	NaN	d3d234cd-7d46-44b2-b26f-a80e524764ec	{}	None	# start the contents finder	3	None
../2023-01-19-pidgy-afforndances.ipynb	code	1.0	3563e7c1-6f4f-4774-84c9-f2db766238cb	{}	[{'data': {'text/html': '<div> <div style=...	\n %reload_ext pidgy\n import nbfram...	4	None
../2023-01-19-pidgy-afforndances.ipynb	code	11.0	4b4f8e4c-d192-4dd8-818e-b728d4b6673e	{}	[{'data': {'text/html': '<div> <style scoped> ...	result	5	None
../2023-01-19-pidgy-afforndances.ipynb	code	21.0	1fba636a-14c1-4011-8af1-5d54452f7e36	{}	[{'data': {'text/markdown': ' pretty neat that...	{{asyncio.sleep(1) or ""}} pretty neat that we...	6	None
../2023-01-19-pidgy-afforndances.ipynb	code	18.0	efdfc300-da33-40d5-b1ea-8636e8e1a001	{}	[{'data': {'text/markdown': ' docs= 2', 'te...	docs= 2	7	None
../2023-01-19-pidgy-afforndances.ipynb	markdown	NaN	bad77dd2-294b-4f6a-a520-4ff25be27d41	{}	None	load and persist the data	8	None

In 6

code

2023-01-25T02:23:36.899159+00:00

2023-01-25T02:23:36.905474+00:00

    def get_article_path(s): return str(s.name) + "#/cells/" + str(s.cell_ct)
    self.articles["path"] = self.articles.apply(get_article_path, meta=("path", "O"), axis=1)

    def get_article_path(s): return str(s.name) + "#/cells/" + str(s.cell_ct)
    self.articles["path"] = self.articles.apply(get_article_path, meta=("path", "O"), axis=1)


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:36.899159Z', 'iopub.status.busy': '2023-01-25T02:23:36.899002Z', 'iopub.status.idle': '2023-01-25T02:23:36.905709Z', 'shell.execute_reply': '2023-01-25T02:23:36.905474Z', 'shell.execute_reply.started': '2023-01-25T02:23:36.899138Z'}, 'tags': []}

In 7

code

2023-01-25T02:23:36.907156+00:00

2023-01-25T02:23:37.638898+00:00

    def write_documents(df):
        with whoosh.writing.AsyncWriter(index) as w:
            for _, x in df.iterrows(): w.add_document(**x)
    self.articles[["source", "path"]].applymap("".join).groupby(self.articles.index).apply(write_documents, meta=("none", int)).compute()

    def write_documents(df):
        with whoosh.writing.AsyncWriter(index) as w:
            for _, x in df.iterrows(): w.add_document(**x)
    self.articles[["source", "path"]].applymap("".join).groupby(self.articles.index).apply(write_documents, meta=("none", int)).compute()


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:36.907156Z', 'iopub.status.busy': '2023-01-25T02:23:36.907089Z', 'iopub.status.idle': '2023-01-25T02:23:37.639167Z', 'shell.execute_reply': '2023-01-25T02:23:37.638898Z', 'shell.execute_reply.started': '2023-01-25T02:23:36.907147Z'}, 'tags': []}

Series([], Name: none, dtype: int64)

markdown

## querying the documents

## querying the documents

querying the documents

In 8

code

2023-01-25T02:23:37.639714+00:00

2023-01-25T02:23:37.642199+00:00

    query = whoosh.qparser.QueryParser("source", schema)

    query = whoosh.qparser.QueryParser("source", schema)


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:37.639714Z', 'iopub.status.busy': '2023-01-25T02:23:37.639576Z', 'iopub.status.idle': '2023-01-25T02:23:37.642457Z', 'shell.execute_reply': '2023-01-25T02:23:37.642199Z', 'shell.execute_reply.started': '2023-01-25T02:23:37.639703Z'}, 'tags': []}

In 9

code

2023-01-25T02:23:37.643158+00:00

2023-01-25T02:23:37.661337+00:00

    with index.searcher() as search:
        print(search.search(query.parse("literate computing")))

    with index.searcher() as search:
        print(search.search(query.parse("literate computing")))


        {'execution': {'iopub.execute_input': '2023-01-25T02:23:37.643158Z', 'iopub.status.busy': '2023-01-25T02:23:37.643052Z', 'iopub.status.idle': '2023-01-25T02:23:37.662134Z', 'shell.execute_reply': '2023-01-25T02:23:37.661337Z', 'shell.execute_reply.started': '2023-01-25T02:23:37.643147Z'}, 'tags': []}

<Top 0 Results for And([Term('source', 'literate'), Term('source', 'computing')]) runtime=9.207999937643763e-05>