exploring many pyproject.toml configs¤
i composed the following query using github's graphql explorer because it has completion which helps in the composition. i also refered to GitHub GraphQL - Get files in a repository for some ideas about how to compose my query.
- this work is my first time interacting with graphql for data analysis. i really preferred the iGraphQl experiences that provides completion otherwise i would have been totally lost.
- it uses
requeststo retreieve the results, andrequsts_cachefor caching. - at the end, we start looking at some dataframes for our requests.
from typing import *; from toolz.curried import *; import pandas, requests, tomli
pyproject_query = """
{
search(type: REPOSITORY, query: "install in:readme language:python stars:>500", first:100 %s) {
pageInfo {
hasNextPage endCursor
}
edges {
node {
... on Repository {
url
stargazerCount
object(expression:"HEAD:pyproject.toml") {
... on Blob {
text
}
}
}
}
}
}
}"""
the graqhql query i wanted retrieves the pyproject.toml from a bunch of python projects.
the initial goal of this query is to discover python projects and retrieve their pyproject.toml for comparison.
we are looking for pyproject.toml which outlines strict metadata specifications.
it would be cool to get a high level view of the python conventions popular projects are using.
> i'd love suggestions on a better query that finds more repositories with pyproject.toml files.
paginating the requests to get a bunch of data¤
get_one_page makes a POST the github graphql endpoint - https://api.github.com/graphql
def get_one_page(query: str, prior: requests.Response=None, fill: str = "") -> requests.Response:
if prior and prior.json()["data"]["search"]["pageInfo"]["hasNextPage"]:
fill = """, after: "%s" """ % prior.json()["data"]["search"]["pageInfo"]["endCursor"]
return requests.post("https://api.github.com/graphql", json=dict(query=query % fill), **header)
get_pages yields multiple requests if there is pagination the nodes exported.
def get_pages(query: str, prior=None, max=15):
for i in range(max):
prior = get_one_page(query, prior=prior)
yield prior
if prior.status_code != 200: break
if not prior.json()["data"]["search"]["pageInfo"]["hasNextPage"]: break
gather a few pages into a list of responses
def gather(query: str, max: int=2): return list(get_pages(query, max=max))
analyze some actual data¤
> boilerplate to begin the analysis
__import__("requests_cache").install_cache(allowable_methods=['GET', 'POST'])
from info import header # this has some api info
pandas.options.display.max_colwidth = None
Ø = __name__ == "__main__" and "__file__" not in locals()
transform the responses in a big pandas dataframe of configs
tidy_responses transforms our query responses into a single dataframe.
def tidy_responses(responses: list[requests.Response]) -> pandas.DataFrame:
return pipe(responses, map(
compose_left(operator.methodcaller("json"), get("data"), get("search"), get("edges"), pandas.DataFrame)
), partial(pandas.concat, axis=1)).stack()
tidy_configs further shapes the data down to the pyproject.toml data
def tidy_configs(df: pandas.DataFrame) -> pandas.DataFrame:
return df.apply(pandas.Series).dropna(subset="object")\
.set_index("url")["object"].apply(pandas.Series)["text"].apply(tomli.loads).apply(pandas.Series)
if Ø:
configs = tidy_configs(df := tidy_responses(responses := gather(pyproject_query, max=15)))
print(F"""we made {len(responses)} requests returning information about a {len(df)} repositories.
we retrieved {len(configs)} pyproject configs from this scrape.""")
inspecting the build backend¤
if Ø:
builds = configs["build-system"].dropna().apply(pandas.Series)
print(F"""{len(builds)} projects define a build backends, their specific frequencies are:""")
display(builds["build-backend"].dropna().value_counts().to_frame("build-backend").T)
inspecting the tools¤
the different tool frequencies
if Ø:
ranks = configs["tool"].dropna().apply(list).apply(pandas.Series).stack().value_counts()
display(ranks[(top := ranks>4)].to_frame("top").T, ranks[~top].to_frame("rest").T)
fin¤
i think having knowledge at this scope of projects helps making decisions about what to do with your own.
if black is the zeigeist why are you stalling?