Search Non-Text Files in a GitHub Repository
We will be teaching how to combine the GitHub Search API with Mixpeek's Index API to search the contents of files in a repository.
GitHub is Your Product's Source of Truth
When developing an Open Source Software (OSS), it's common to include product screenshots, demo videos, and even documents in your repositories. As the product evolves it becomes challenging to maintain the "versions" of these assets.
When a repo's assets reference older versions, it results in hundreds or thousands of confused users.
In this tutorial, we'll use the Mixpeek API to index an example GitHub repository, discover stale screenshots, and flag them for updating.
Find Screenshots
First let's write a script that finds all pictures in the main branch of a repository:
import os
def list_of_pictures(username, repository, branch):
# Set the directory where the GitHub repository is located
# This example assumes the repository is in a directory named "repo"
full_repo_path = "https://raw.githubusercontent.com/{}/{}/{}/".format(username, repository, branch)
list_of_pictures = []
# Loop through all the files and directories in the repository directory
for item in os.listdir(repository):
# Check if the item is a file
if os.path.isfile(os.path.join(directory, item)):
# Check if the file has a .png extension
if item.endswith(".png"):
# append full repository path to array
list_of_pictures.append(full_repo_path + item)
return list_of_pictures
Index and Search
Next we find all pictures in the main branch and index them:
from mixpeek import Mixpeek
mix = Mixpeek(mixpeek_key="MIXPEEK_API_KEY")
pictures = list_of_pictures(
username="mixpeek"
repository="use-cases"
branch="main"
)
mix.index(pictures)
Now we can search for text from a button that has recently changed in our product.
mix.search("sync now")
[
{
"file_url": "https://raw.githubusercontent.com/mixpeek/use-cases/main/screenshot-1.png",
"file_id": "63738f90829faf6a25053f61",
"importance": "100%"
},
{
"file_url": "https://raw.githubusercontent.com/mixpeek/use-cases/main/screenshot-2.png",
"file_id": "63738f90829faf6a25053f62",
"importance": "98%"
},
{
"file_url": "https://raw.githubusercontent.com/mixpeek/use-cases/main/screenshot-3.png",
"file_id": "63738f90829faf6a25053f63",
"importance": "90%"
}
]
Now we can update each of these product screenshots. 😅 We've just saved hours of time and rescued our prized users from confusion.
Other Use Cases
- Word Count: Doing an anlysis on the count of certain words across your repo
- Animations & Video: Often we include GIFs in our repos, they too can be searched for updating.
- Security: Ensure nobody is including API Keys, or Secrets in the product screenshots.