exploring many pyproject.toml configs
¤
i composed the following query using github's graphql explorer because it has completion which helps in the composition. i also refered to GitHub GraphQL - Get files in a repository for some ideas about how to compose my query.
- this work is my first time interacting with graphql for data analysis. i really preferred the iGraphQl experiences that provides completion otherwise i would have been totally lost.
- it uses
requests
to retreieve the results, andrequsts_cache
for caching. - at the end, we start looking at some dataframes for our requests.
from typing import *; from toolz.curried import *; import pandas, requests, tomli
pyproject_query = """
{
search(type: REPOSITORY, query: "install in:readme language:python stars:>500", first:100 %s) {
pageInfo {
hasNextPage endCursor
}
edges {
node {
... on Repository {
url
stargazerCount
object(expression:"HEAD:pyproject.toml") {
... on Blob {
text
}
}
}
}
}
}
}"""
the graqhql query i wanted retrieves the pyproject.toml
from a bunch of python projects.
the initial goal of this query is to discover python projects and retrieve their pyproject.toml
for comparison.
we are looking for pyproject.toml
which outlines strict metadata specifications.
it would be cool to get a high level view of the python conventions popular projects are using.
> i'd love suggestions on a better query that finds more repositories with pyproject.toml
files.
paginating the requests to get a bunch of data¤
get_one_page
makes a POST
the github graphql endpoint - https://api.github.com/graphql
def get_one_page(query: str, prior: requests.Response=None, fill: str = "") -> requests.Response:
if prior and prior.json()["data"]["search"]["pageInfo"]["hasNextPage"]:
fill = """, after: "%s" """ % prior.json()["data"]["search"]["pageInfo"]["endCursor"]
return requests.post("https://api.github.com/graphql", json=dict(query=query % fill), **header)
get_pages
yields multiple requests if there is pagination the nodes exported.
def get_pages(query: str, prior=None, max=15):
for i in range(max):
prior = get_one_page(query, prior=prior)
yield prior
if prior.status_code != 200: break
if not prior.json()["data"]["search"]["pageInfo"]["hasNextPage"]: break
gather
a few pages into a list
of responses
def gather(query: str, max: int=2): return list(get_pages(query, max=max))
analyze some actual data¤
> boilerplate to begin the analysis
__import__("requests_cache").install_cache(allowable_methods=['GET', 'POST'])
from info import header # this has some api info
pandas.options.display.max_colwidth = None
Ø = __name__ == "__main__" and "__file__" not in locals()
transform the responses in a big pandas
dataframe of configs
tidy_responses
transforms our query responses into a single dataframe.
def tidy_responses(responses: list[requests.Response]) -> pandas.DataFrame:
return pipe(responses, map(
compose_left(operator.methodcaller("json"), get("data"), get("search"), get("edges"), pandas.DataFrame)
), partial(pandas.concat, axis=1)).stack()
tidy_configs
further shapes the data down to the pyproject.toml
data
def tidy_configs(df: pandas.DataFrame) -> pandas.DataFrame:
return df.apply(pandas.Series).dropna(subset="object")\
.set_index("url")["object"].apply(pandas.Series)["text"].apply(tomli.loads).apply(pandas.Series)
if Ø:
configs = tidy_configs(df := tidy_responses(responses := gather(pyproject_query, max=15)))
print(F"""we made {len(responses)} requests returning information about a {len(df)} repositories.
we retrieved {len(configs)} pyproject configs from this scrape.""")
inspecting the build backend¤
if Ø:
builds = configs["build-system"].dropna().apply(pandas.Series)
print(F"""{len(builds)} projects define a build backends, their specific frequencies are:""")
display(builds["build-backend"].dropna().value_counts().to_frame("build-backend").T)
inspecting the tools¤
the different tool frequencies
if Ø:
ranks = configs["tool"].dropna().apply(list).apply(pandas.Series).stack().value_counts()
display(ranks[(top := ranks>4)].to_frame("top").T, ranks[~top].to_frame("rest").T)
fin¤
i think having knowledge at this scope of projects helps making decisions about what to do with your own.
if black
is the zeigeist why are you stalling?