Search Non-Text Files in a GitHub Repository

We will be teaching how to combine the GitHub Search API with Mixpeek's Index API to search the contents of files in a repository.

Search Non-Text Files in a GitHub Repository
Photo by Yancy Min / Unsplash

GitHub is Your Product's Source of Truth

When developing an Open Source Software (OSS), it's common to include product screenshots, demo videos, and even documents in your repositories. As the product evolves it becomes challenging to maintain the "versions" of these assets.

When a repo's assets reference older versions, it results in hundreds or thousands of confused users.

In this tutorial, we'll use the Mixpeek API to index an example GitHub repository, discover stale screenshots, and flag them for updating.

Find Screenshots

First let's write a script that finds all pictures in the main branch of a repository:

import os

def list_of_pictures(username, repository, branch):
  # Set the directory where the GitHub repository is located
  # This example assumes the repository is in a directory named "repo"
  full_repo_path = "{}/{}/{}/".format(username, repository, branch)
  list_of_pictures = []

  # Loop through all the files and directories in the repository directory
  for item in os.listdir(repository):
    # Check if the item is a file
    if os.path.isfile(os.path.join(directory, item)):
      # Check if the file has a .png extension
      if item.endswith(".png"):
        # append full repository path to array
        list_of_pictures.append(full_repo_path + item)

  return list_of_pictures 

Next we find all pictures in the main branch and index them:

from mixpeek import Mixpeek

mix = Mixpeek(mixpeek_key="MIXPEEK_API_KEY")

pictures = list_of_pictures(


Now we can search for text from a button that has recently changed in our product."sync now")  

    "file_url": "",
    "file_id": "63738f90829faf6a25053f61",
    "importance": "100%"
    "file_url": "",
    "file_id": "63738f90829faf6a25053f62",
    "importance": "98%"
    "file_url": "",
    "file_id": "63738f90829faf6a25053f63",
    "importance": "90%"

Now we can update each of these product screenshots. 😅 We've just saved hours of time and rescued our prized users from confusion.

Other Use Cases

  • Word Count: Doing an anlysis on the count of certain words across your repo
  • Animations & Video: Often we include GIFs in our repos, they too can be searched for updating.
  • Security: Ensure nobody is including API Keys, or Secrets in the product screenshots.