exploring many `pyproject.toml configs`¤

i composed the following query using github's graphql explorer because it has completion which helps in the composition. i also refered to GitHub GraphQL - Get files in a repository for some ideas about how to compose my query.

this work is my first time interacting with graphql for data analysis. i really preferred the iGraphQl experiences that provides completion otherwise i would have been totally lost.
it uses requests to retreieve the results, and requsts_cache for caching.
at the end, we start looking at some dataframes for our requests.

    from typing import *; from toolz.curried import *; import pandas, requests, tomli

    pyproject_query = """
    {
      search(type: REPOSITORY, query: "install in:readme language:python stars:&gt;500", first:100 %s) {
        pageInfo {
          hasNextPage endCursor
        }
        edges {  
            node {
            ... on Repository {
              url 
              stargazerCount
              object(expression:"HEAD:pyproject.toml") {
                ... on Blob {
                  text

                }
              }
            }
          }
        }
      }
    }"""

the graqhql query i wanted retrieves the pyproject.toml from a bunch of python projects. the initial goal of this query is to discover python projects and retrieve their pyproject.toml for comparison.

we are looking for pyproject.toml which outlines strict metadata specifications. it would be cool to get a high level view of the python conventions popular projects are using.

> i'd love suggestions on a better query that finds more repositories with pyproject.toml files.

paginating the requests to get a bunch of data¤

get_one_page makes a POST the github graphql endpoint - https://api.github.com/graphql

    def get_one_page(query: str, prior: requests.Response=None, fill: str = "") -&gt; requests.Response: 
        if prior and prior.json()["data"]["search"]["pageInfo"]["hasNextPage"]:
            fill = """, after: "%s" """ % prior.json()["data"]["search"]["pageInfo"]["endCursor"]
        return requests.post("https://api.github.com/graphql", json=dict(query=query % fill), **header)

get_pages yields multiple requests if there is pagination the nodes exported.

    def get_pages(query: str, prior=None, max=15):
        for i in range(max):
            prior = get_one_page(query, prior=prior)
            yield prior
            if prior.status_code != 200: break
            if not prior.json()["data"]["search"]["pageInfo"]["hasNextPage"]: break

gather a few pages into a list of responses

    def gather(query: str, max: int=2): return list(get_pages(query, max=max))

analyze some actual data¤

> boilerplate to begin the analysis

    __import__("requests_cache").install_cache(allowable_methods=['GET', 'POST'])
    from info import header # this has some api info 
    pandas.options.display.max_colwidth = None
    Ø = __name__ == "__main__" and "__file__" not in locals()

transform the responses in a big pandas dataframe of configs

tidy_responses transforms our query responses into a single dataframe.

    def tidy_responses(responses: list[requests.Response]) -&gt; pandas.DataFrame:
        return pipe(responses, map(
            compose_left(operator.methodcaller("json"), get("data"), get("search"), get("edges"), pandas.DataFrame)
        ), partial(pandas.concat, axis=1)).stack()

tidy_configs further shapes the data down to the pyproject.toml data

    def tidy_configs(df: pandas.DataFrame) -&gt; pandas.DataFrame:
        return df.apply(pandas.Series).dropna(subset="object")\
        .set_index("url")["object"].apply(pandas.Series)["text"].apply(tomli.loads).apply(pandas.Series)

    if Ø:
        configs = tidy_configs(df := tidy_responses(responses := gather(pyproject_query, max=15)))
        print(F"""we made {len(responses)} requests returning information about a {len(df)} repositories.
    we retrieved {len(configs)} pyproject configs from this scrape.""")

we made 10 requests returning information about a 1000 repositories.
we retrieved 234 pyproject configs from this scrape.

inspecting the build backend¤

    if Ø:
        builds = configs["build-system"].dropna().apply(pandas.Series)
        print(F"""{len(builds)} projects define a build backends, their specific frequencies are:""")
        display(builds["build-backend"].dropna().value_counts().to_frame("build-backend").T)

173 projects define a build backends, their specific frequencies are:

	setuptools.build_meta	poetry.core.masonry.api	hatchling.build	flit_core.buildapi	poetry.masonry.api	pdm.pep517.api	mesonpy	poetry_dynamic_versioning.backend
build-backend	88	25	15	8	4	2	1	1

inspecting the tools¤

the different tool frequencies

    if Ø:
        ranks = configs["tool"].dropna().apply(list).apply(pandas.Series).stack().value_counts()
        display(ranks[(top := ranks&gt;4)].to_frame("top").T,  ranks[~top].to_frame("rest").T)

	black	isort	pytest	mypy	coverage	poetry	setuptools_scm	hatch	setuptools	pylint	towncrier	pyright	cibuildwheel	usort	flit
top	123	85	67	42	34	32	21	15	14	14	11	10	6	5	5

	nbqa	flake8	tox	ruff	pycln	pydocstyle	autoflake	interrogate	tbump	codespell	...	versioningit	poe	distutils	setuptools-git-versioning	ufmt	hooky	bandit	versioneer	mutmut	typeshed
rest	4	3	3	3	3	3	2	2	2	2	...	1	1	1	1	1	1	1	1	1	1

1 rows × 39 columns

fin¤

i think having knowledge at this scope of projects helps making decisions about what to do with your own. if black is the zeigeist why are you stalling?

exploring many pyproject.toml configs¤