Skip to content

exploring many pyproject.toml configs¤

i composed the following query using github's graphql explorer because it has completion which helps in the composition. i also refered to GitHub GraphQL - Get files in a repository for some ideas about how to compose my query.

  • this work is my first time interacting with graphql for data analysis. i really preferred the iGraphQl experiences that provides completion otherwise i would have been totally lost.
  • it uses requests to retreieve the results, and requsts_cache for caching.
  • at the end, we start looking at some dataframes for our requests.
    from typing import *; from toolz.curried import *; import pandas, requests, tomli
    pyproject_query = """
    {
      search(type: REPOSITORY, query: "install in:readme language:python stars:>500", first:100 %s) {
        pageInfo {
          hasNextPage endCursor
        }
        edges {  
            node {
            ... on Repository {
              url 
              stargazerCount
              object(expression:"HEAD:pyproject.toml") {
                ... on Blob {
                  text

                }
              }
            }
          }
        }
      }
    }"""

the graqhql query i wanted retrieves the pyproject.toml from a bunch of python projects. the initial goal of this query is to discover python projects and retrieve their pyproject.toml for comparison.

we are looking for pyproject.toml which outlines strict metadata specifications. it would be cool to get a high level view of the python conventions popular projects are using.

> i'd love suggestions on a better query that finds more repositories with pyproject.toml files.

paginating the requests to get a bunch of data¤

get_one_page makes a POST the github graphql endpoint - https://api.github.com/graphql

    def get_one_page(query: str, prior: requests.Response=None, fill: str = "") -> requests.Response: 
        if prior and prior.json()["data"]["search"]["pageInfo"]["hasNextPage"]:
            fill = """, after: "%s" """ % prior.json()["data"]["search"]["pageInfo"]["endCursor"]
        return requests.post("https://api.github.com/graphql", json=dict(query=query % fill), **header)

get_pages yields multiple requests if there is pagination the nodes exported.

    def get_pages(query: str, prior=None, max=15):
        for i in range(max):
            prior = get_one_page(query, prior=prior)
            yield prior
            if prior.status_code != 200: break
            if not prior.json()["data"]["search"]["pageInfo"]["hasNextPage"]: break

gather a few pages into a list of responses

    def gather(query: str, max: int=2): return list(get_pages(query, max=max))

analyze some actual data¤

> boilerplate to begin the analysis

    __import__("requests_cache").install_cache(allowable_methods=['GET', 'POST'])
    from info import header # this has some api info 
    pandas.options.display.max_colwidth = None
    Ø = __name__ == "__main__" and "__file__" not in locals()

transform the responses in a big pandas dataframe of configs

tidy_responses transforms our query responses into a single dataframe.

    def tidy_responses(responses: list[requests.Response]) -> pandas.DataFrame:
        return pipe(responses, map(
            compose_left(operator.methodcaller("json"), get("data"), get("search"), get("edges"), pandas.DataFrame)
        ), partial(pandas.concat, axis=1)).stack()

tidy_configs further shapes the data down to the pyproject.toml data

    def tidy_configs(df: pandas.DataFrame) -> pandas.DataFrame:
        return df.apply(pandas.Series).dropna(subset="object")\
        .set_index("url")["object"].apply(pandas.Series)["text"].apply(tomli.loads).apply(pandas.Series)
    if Ø:
        configs = tidy_configs(df := tidy_responses(responses := gather(pyproject_query, max=15)))
        print(F"""we made {len(responses)} requests returning information about a {len(df)} repositories.
    we retrieved {len(configs)} pyproject configs from this scrape.""")
we made 10 requests returning information about a 1000 repositories.
we retrieved 234 pyproject configs from this scrape.

inspecting the build backend¤

    if Ø:
        builds = configs["build-system"].dropna().apply(pandas.Series)
        print(F"""{len(builds)} projects define a build backends, their specific frequencies are:""")
        display(builds["build-backend"].dropna().value_counts().to_frame("build-backend").T)
173 projects define a build backends, their specific frequencies are:
setuptools.build_meta poetry.core.masonry.api hatchling.build flit_core.buildapi poetry.masonry.api pdm.pep517.api mesonpy poetry_dynamic_versioning.backend
build-backend 88 25 15 8 4 2 1 1

inspecting the tools¤

the different tool frequencies

    if Ø:
        ranks = configs["tool"].dropna().apply(list).apply(pandas.Series).stack().value_counts()
        display(ranks[(top := ranks>4)].to_frame("top").T,  ranks[~top].to_frame("rest").T)
black isort pytest mypy coverage poetry setuptools_scm hatch setuptools pylint towncrier pyright cibuildwheel usort flit
top 123 85 67 42 34 32 21 15 14 14 11 10 6 5 5
nbqa flake8 tox ruff pycln pydocstyle autoflake interrogate tbump codespell ... versioningit poe distutils setuptools-git-versioning ufmt hooky bandit versioneer mutmut typeshed
rest 4 3 3 3 3 3 2 2 2 2 ... 1 1 1 1 1 1 1 1 1 1

1 rows × 39 columns

fin¤

i think having knowledge at this scope of projects helps making decisions about what to do with your own. if black is the zeigeist why are you stalling?