index
execution_count
cell_type
toolbar
started_at
completed_at
source
loc
metadata
outputs
1
unexecuted
In
[
]
markdown
# full text search for notebooks and files using duckdb
`duckdb` is great for moderate sized data.
maybe it would be good for searching notebooks.
i know `pandas` so we are going to use `pandas` to load in our data
1. reads files
2. load contents in the `nbformat`
3. create the table on a in memory duckdb
4. at full text search the columns
5. search the source
metadata
11
duckdb
is great for moderate sized data.
maybe it would be good for searching notebooks.
i know
pandas
so we are going to use
pandas
to load in our data
reads files
load contents in the
nbformat
create the table on a in memory duckdb
at full text search the columns
search the source
2
executed
In
[
1
]
code
import pandas , duckdb , functools
metadata
1
0 outputs.
Out
[
1
]
3
unexecuted
In
[
]
markdown
### `search` is our database goal
the use of the `search` is demonstrated at the end of the document
metadata
3
the use of the
search
is demonstrated at the end of the document
4
executed
In
[
2
]
code
def search ( q ) -> pandas . DataFrame :
return ( get_db () . execute ( F """
SELECT * FROM
(
SELECT *, fts_main_cells.match_bm25(path, ' { q } ', fields:='source') AS score FROM cells
)
WHERE score IS NOT NULL
ORDER BY score DESC;
""" )) . df ()
metadata
9
0 outputs.
Out
[
2
]
5
unexecuted
In
[
]
markdown
https://duckdb.org/docs/extensions/full_text_search
metadata
1
https://duckdb.org/docs/extensions/full_text_search
6
executed
In
[
3
]
code
@functools . lru_cache # this makes our function a singleton
def get_db () -> duckdb . DuckDBPyConnection :
con = duckdb . connect ()
con . execute ( "CREATE TABLE cells AS SELECT * FROM sources" )
con . execute ( "INSERT INTO cells SELECT * FROM sources" )
con . execute ( """PRAGMA create_fts_index('cells', 'path', 'source');""" )
return con
metadata
7
0 outputs.
Out
[
3
]
7
unexecuted
In
[
]
markdown
create a shape of the cells that duckdb can use. we ignore metadata, attachments and outputs.
metadata
1
create a shape of the cells that duckdb can use. we ignore metadata, attachments and outputs.
8
executed
In
[
4
]
code
def get_fts_sources ( cells ):
sources = cells . drop ( columns = [ "metadata" , "attachments" , "outputs" ])
sources . source = sources . source . str . join ( "" )
sources = sources . set_index ( sources . index . map ( compose_left ( map ( str ), "#/cells/" . join )) . rename ( "path" )) . reset_index ()
sources . execution_count = sources . execution_count . fillna ( - 1 )
return sources
metadata
6
0 outputs.
Out
[
4
]
9
unexecuted
In
[
]
markdown
### load all the documents in as cells
metadata
1
10
executed
In
[
5
]
code
def get_cells ( docs ):
return (
docs [ "cells" ] . apply (
compose_left ( enumerate , list )
) . explode () . apply ( pandas . Series )
. rename ( columns = { 0 : "cell_ct" , 1 : "cell" })
. set_index ( "cell_ct" , append = True )[ "cell" ]
. apply ( pandas . Series )
)
metadata
9
0 outputs.
Out
[
5
]
11
unexecuted
In
[
]
markdown
`get_files` creates our first dataframes
metadata
1
get_files
creates our first dataframes
12
executed
In
[
6
]
code
def get_files ( dir ) -> pandas . DataFrame :
files = pandas . DataFrame ( index = pandas . Index ( iter_files ( dir ), name = "file" ))
return files . assign ( suffix = files . index . map ( operator . attrgetter ( "suffix" )))
metadata
3
0 outputs.
Out
[
6
]
13
unexecuted
In
[
]
markdown
`get_markdown_file` reads a markdown file as a markdown notebook cell.
metadata
1
get_markdown_file
reads a markdown file as a markdown notebook cell.
14
executed
In
[
7
]
code
def get_markdown_file ( md ):
import nbformat
return nbformat . v4 . new_notebook ( cells = [ nbformat . v4 . new_markdown_cell ( md )])
metadata
3
0 outputs.
Out
[
7
]
15
executed
In
[
8
]
code
def get_docs ( files : pandas . DataFrame ) -> pandas . DataFrame :
files = files . assign ( text = files . index . map ( pathlib . Path . read_text ))
return pandas . concat ([
files [ files . suffix . eq ( ".ipynb" )] . text . apply ( compose_left ( orjson . loads , pandas . Series )),
files [ files . suffix . eq ( ".md" )] . text . apply ( compose_left ( get_markdown_file , pandas . Series )),
])
metadata
6
0 outputs.
Out
[
8
]
16
executed
In
[
9
]
code
def get_cells_frame ( dir ): return get_cells ( get_docs ( get_files ( dir )))
metadata
1
0 outputs.
Out
[
9
]
17
unexecuted
In
[
]
markdown
`iter_files` finds files matching an include pattern, and not matching an exclude pattern
metadata
1
iter_files
finds files matching an include pattern, and not matching an exclude pattern
18
executed
In
[
10
]
code
def iter_files ( dir = None , exclude = ".nox \n .ipynb_checkpoints \n " , include = "*.md \n *.ipynb" ):
import pathspec
exclude_spec = pathspec . PathSpec . from_lines ( pathspec . GitIgnorePattern , exclude . splitlines ())
include_spec = pathspec . PathSpec . from_lines ( pathspec . GitIgnorePattern , include . splitlines ())
dir = pathlib . Path ( dir or pathlib . Path . cwd ())
for f in dir . iterdir ():
if f . is_dir ():
if not exclude_spec . match_file ( f ):
yield from iter_files ( f )
if f . is_file ():
if include_spec . match_file ( f ):
if not exclude_spec . match_file ( f ):
yield f
metadata
13
0 outputs.
Out
[
10
]
19
unexecuted
In
[
]
markdown
`iter_files` uses a pattern i like where `pathspec` defines the files included and excluded.
sometimes include/exclude logic can be confusing. the `.gitignore` convention is adopted to rely on that and point someone else's docs.
metadata
2
iter_files
uses a pattern i like where
pathspec
defines the files included and excluded.
sometimes include/exclude logic can be confusing. the
.gitignore
convention is adopted to rely on that and point someone else's docs.
20
unexecuted
In
[
]
markdown
## using our search function
metadata
1
21
executed
In
[
11
]
code
import pathspec , dataclasses , orjson , pathlib ; from toolz.curried import *
metadata
1
0 outputs.
Out
[
11
]
22
unexecuted
In
[
]
markdown
initialize the `pandas.DataFrame` so `duckdb` can use it. our table in this work is `cells`
metadata
1
initialize the
pandas.DataFrame
so
duckdb
can use it. our table in this work is
cells
23
unexecuted
In
[
]
markdown
### initialize the `duckdb` tables from pandas
https://duckdb.org/docs/guides/python/import_pandas.html
metadata
3
https://duckdb.org/docs/guides/python/import_pandas.html
24
executed
In
[
12
]
code
if ( I := "__file__" not in locals ()):
sources = get_fts_sources ( get_cells_frame ( ".." ))
display ( get_db () . execute ( "DESCRIBE cells" ) . df ())
metadata
3
1 outputs.
Out
[
12
]
column_name
column_type
null
key
default
extra
0
path
VARCHAR
YES
NaN
NaN
NaN
1
cell_type
VARCHAR
YES
NaN
NaN
NaN
2
id
VARCHAR
YES
NaN
NaN
NaN
3
source
VARCHAR
YES
NaN
NaN
NaN
4
execution_count
DOUBLE
YES
NaN
NaN
NaN
25
unexecuted
In
[
]
markdown
metadata
1
26
executed
In
[
13
]
code
I and display ( search ( "pandas" ) . head ())
metadata
1
1 outputs.
Out
[
13
]
path
cell_type
id
source
execution_count
score
0
../xxii/oct/2022-10-29-metadata-formatter.ipyn...
code
e3a8b43e-aaeb-4b7a-9ccb-4485ae0689a0
if ACTIVE:\n import pandas\n ...
8.0
1.718594
1
../xxii/oct/2022-10-29-metadata-formatter.ipyn...
code
e3a8b43e-aaeb-4b7a-9ccb-4485ae0689a0
if ACTIVE:\n import pandas\n ...
8.0
1.718594
2
../xxiii/2023-01-02-accessible-dataframes-basi...
code
401913ff-534f-4659-aec5-0784b1f1f34c
(df := pandas.DataFrame(\n columns=...
2.0
1.679535
3
../xxiii/2023-01-11-accessible-dataframes-comp...
code
401913ff-534f-4659-aec5-0784b1f1f34c
(df := pandas.DataFrame(\n columns=...
2.0
1.679535
4
../xxiii/2023-01-02-accessible-dataframes-basi...
code
401913ff-534f-4659-aec5-0784b1f1f34c
(df := pandas.DataFrame(\n columns=...
2.0
1.679535
27
executed
In
[
14
]
code
I and display ( search ( "toolz" ) . head ( 4 ))
metadata
1
1 outputs.
Out
[
14
]
path
cell_type
id
source
execution_count
score
0
../xxii/oct/colormap-dataframes/2021-10-11-col...
code
391cab50-209e-4843-8bea-0405f6734e6f
import pandas, numpy, toolz.curried as toolz
1.0
3.776350
1
../xxii/oct/colormap-dataframes/2021-10-11-col...
code
391cab50-209e-4843-8bea-0405f6734e6f
import pandas, numpy, toolz.curried as toolz
1.0
3.776350
2
../xxiii/2023-01-11-duckdb-search.ipynb#/cells/20
code
d3a6ca2a-7b1d-4f0a-a86c-9345913468c0
import pathspec, dataclasses, orjson, path...
-1.0
3.340617
3
../xxiii/2023-01-11-duckdb-search.ipynb#/cells/26
code
589d3fb6-a4bb-434f-a06e-05d57fe57f09
I and display(search("toolz").head(4))
-1.0
3.340617