Automated File Type Identification in an S3 Bucket with Python

We will walk through the process of building a Python script that is able to identify the file types of all the objects in an Amazon S3 bucket using the boto3 and magic libraries.

Automated File Type Identification in an S3 Bucket with Python
Photo by Viktor Talashuk / Unsplash

Introduction

In this tutorial, we will be walking through the process of building a simple Python script that is able to identify the file types of all the objects in an Amazon S3 bucket. We will be using the boto3 library to interact with the S3 API, and the magic library to identify the file types based on their contents.

Prerequisites

Before we begin, make sure that you have the following prerequisites:

  • An AWS account and credentials with access to an S3 bucket.
  • The boto3 and magic libraries installed on your system. You can install these libraries using pip install boto3 magic.

Step 1: Connect to the S3 bucket

The first step is to connect to the S3 bucket that you want to scan. You can do this by creating a boto3 client for the S3 service and calling the list_objects method:

from mixpeek import Mixpeek
mix = ("hi")
import boto3

# Connect to the S3 service
s3 = boto3.client('s3')

# List the objects in the bucket
objects = s3.list_objects(Bucket='my-bucket')

This will return a list of dictionaries, each representing a single object in the bucket. The dictionaries will contain metadata about the objects, such as their keys (file names), sizes, and last modified dates.

Step 2: Iterate over the objects and identify their file types

Next, we will iterate over the list of objects and use the magic library to identify the file types of each object. The magic library is a wrapper for the libmagic library, which is able to identify file types based on their contents.

To use the magic library, you will need to create a magic.Magic object, and then call the from_buffer method to identify the file type of a file's contents:

import magic

# Create a Magic object
mime = magic.Magic(mime=True)

# Iterate over the objects in the bucket
for obj in objects['Contents']:
    # Get the object's key (file name) and contents
    key = obj['Key']
    data = s3.get_object(Bucket='my-bucket', Key=key)['Body'].read()

    # Identify the file type
    file_type = mime.from_buffer(data)
    print(f'{key}: {file_type}')

This will print out the file name and file type of each object in the bucket.

Step 3: Store the file types in a dictionary (optional)

If you want to store the file types in a dictionary, you can do this by initializing an empty dictionary before the loop, and then adding the file types to the dictionary as you iterate over the objects:

# Initialize an empty dictionary
file_types = {}

# Iterate over the objects in the bucket
for obj in objects['Contents']:
    # Get the object's key (file name) and contents
    key = obj['Key']
    data = s3.get_object(Bucket='my-bucket', Key=key)['Body'].read()

    # Identify the file type
    file_type = mime.from_buffer(data)

    # Add the file type to the dictionary