I recently enrolled in Harvard's prestigious, CS50: Introduction to Computer Science course and after several weeks I got completely lost. I had to re-learn the dreaded bubble sort.

This online course is on edX, and they unfortunately didn't have a search bar for me to find the exact time where "bubble sort" was mentioned.

An intelligent file store enables students like myself to search the contents of all my enrolled curriculums. Super useful, right? So I got cracking to figure out what a darn bubble sort is from the professor's mouth himself.

💡
Not only is this tedious, it's extremely computationally expensive. There is a better way

If you're more of a visual learner, here's a video walkthrough:

Split the Lecture into Chunks

Now we could skip this and just chunk the audio exports, but saving the original video chunks gives us flexibility to do advanced OCR and Object Detection later on.

# Split the audio into 30-second chunks based on silence
chunks = split_on_silence(
    sound,
    min_silence_len=500,
    silence_thresh=-16,
    keep_silence=500,
    seek_step=1
)

# Save each chunk as a separate audio file with the timestamp range in the filename
for i, chunk in enumerate(chunks):
    # Calculate the start and end times for the chunk
    start = timedelta(milliseconds=chunk.start_time)
    end = timedelta(milliseconds=chunk.end_time)

    # Format the start and end times as strings
    start_str = start.strftime("%H-%M-%S-%f")[:-3]
    end_str = end.strftime("%H-%M-%S-%f")[:-3]

    # Save the chunk as an audio file with the timestamp range in the filename
    chunk.export(f"audio_{start_str}_{end_str}.mp3", format="mp3")                      
                    

Transcribe the Audio of each Chunk

Iterate a directory of audio files and transcribes each one saving to a text file

# Open the audio file using moviepy
audio = AudioFileClip(filepath)

# Transcribe the audio using your chosen transcription service
transcript = transcribe_audio_file(audio)                      
                    

Generate and Store Vector Embeddings

We need to convert the transcribed audio into their contextual vector representations.

💡
Vector embeddings are a complicated topic. Read More About them Here
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Generate vector embeddings from the transcribed audio
 vector = model.encode(transcript)

Do the Same for the Syllabus and Slides

Using out of the box OCR scripts, we extract the contents and also convert them into vector representations.

from tika import parser

# Extract the syllabus content via OCR
parsed = parser.from_file('syllabus.pdf')

# Generate vector embeddings from the pdf syllabus
 vector = model.encode(parsed['content'])      

Store in Vector Search Engine

We will be using MongoDB's kNN Vector Search to store and retrieve the vector representations.

# create a new embedding field for each transcription object
for transcript in transcripts:
  # convert to embedding, then to array
    embeddings = model.encode(transcript).tolist()
    vector_collection.insert(embeddings)       

There are many vector search engines out there, please review this comparison to make your own selection.

Search Across Everything

Running our query, "bubble sort" through the same encoder generates a vector which we supply to MongoDB's kNN operator.

query = "bubble sort"                      
vector_query = model.encode(query).tolist()

pipeline = [
    {
        "$search": {
            "knnBeta": {
                "vector": vector_query,
                "path": "embedding",
                "k": 10
            }
        }
    }
]                      
                    

Resources Used

All this Messiness in a Single API

Point the Mixpeek agent to your S3 bucket, and let us handle all the work. Then just search like you would google.

from mixpeek import Mixpeek

mix = Mixpeek(
    api_key="mixpeek_api_key",
    access_key="aws_access_key",
    secret_key="aws_secret_key",
    region="region"
)

mix.index_bucket("mixpeek-public-demo")

Notice how the response includes instances of "bubble sort" across both the lecture videos and slides:

[
    {
        "file_url": "s3://bucket/lecture_03_video.mp4",
        "file_id": "63738f90829faf6a25053f64",
        "importance": "100%",
        "meta": {
            "timestamp": 5785
        } 
    },
    {
        "file_url": "s3://bucket/lecture_03_slides.pdf",
        "file_id": "63738fa2829faf6a25053f65",
        "importance": "98%",
        "meta": {
            "slide_number": 69
        } 
    }    
]

Voilà, "bubble sort" was mentioned in Lecture 03 at timestamp 1:36:37 in addition to Lecture 03 slide 69. Now I can spend more time studying and less time finding content.

We can even get all the timestamps where similar topics are mentioned:

mix.tooling(results, get_similar=True)

Other Education Use Cases

  • Multimedia: Creating multimedia learning materials that allow students to search for information using text, voice, and images.
  • Lessons: Developing interactive learning games that help students learn and retain information.
  • Tutoring: Creating virtual tutoring and mentoring programs that help students find answers to their questions in real-time.
  • Personalization: Developing personalized learning experiences tailored to individual students' stages of learning.
About the author
Ethan Steininger

Ethan Steininger

Former GTM Lead of MongoDB's NLP platform, Atlas Search. Occasionally off the grid in his self-converted camper van.

Multimodal Makers | Mixpeek

Multimodal Pipelines for AI

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.