Traindex | Multipart Upload for Large Files using Pre-Signed URLs

It’s mind-blowing how fast data is growing. It is now possible to collect raw data with a frequency of more than a million requests per second. Storage is quicker and cheaper. It is normal to store data practically forever, even if it is rarely accessed.

Users of Traindex can upload large data files to create a semantic search index. This article will explain how we implemented the multipart upload feature that allows Traindex users to upload large files.

Problems and their Solutions

We wanted to allow users of Traindex to upload large files, typically 1-2 TB, to Amazon S3 in minimum time and with appropriate access controls.

In this article, I will discuss how to set up pre-signed URLs for the secure upload of files. This allows us to grant temporary access to objects in AWS S3 buckets without needing permission.

So how do you go from a 5GB limit to a 5TB limit in uploading to AWS S3? Using multipart uploads, AWS S3 allows users to upload files partitioned into 10,000 parts. The size of each part may vary from 5MB to 5GB.

The table below shows the upload service limits for S3.

Capture

Apart from the size limitations, it is better to keep S3 buckets private and only grant public access when required. We wanted to give the client access to an object without changing the bucket ACL, creating roles, or creating a user on our account. We ended up using S3 pre-signed URLs.

What will you learn?

For a standard multipart upload to work with pre-signed URLs, we need to:

Initiate a multipart upload
Create pre-signed URLs for each part
Upload the parts of the object
Complete multipart upload

Prerequisites

You have to make sure that you have configured your command-line environment not to require the credentials at the time of operations. Steps 1, 2, and 4 stated above are server-side stages. They will need an AWS access keyID and secret key ID. Step 3 is a client-side operation for which the pre-signed URLs are being set up, and hence no credentials will be needed.

If you have not configured your environment to perform server-side operations, then you must complete it first by following these steps:

Download AWS-CLI from this link according to your OS and install it. To configure your AWS-CLI, you need to use the command aws configure and provide the details it requires, as shown below.

$ aws configure

AWS Access Key ID [None]: EXAMPLEFODNN7EXAMPLE
AWS Secret Access Key [None]: eXaMPlEtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: xx-xxxx-x
Default output format [None]: json

Implementation

1. Initiate a Multipart Upload

At this stage, we request AWS S3 to initiate a multipart upload. In response, we will get the UploadId, which will associate each part to the object they are creating.

import boto3

s3 = boto3.client('s3')

bucket = "[XYZ]"
key = "[ABC.pqr]"

response = s3.create_multipart_upload(
    Bucket=bucket, 
    Key=key
)

upload_id = response['UploadId']

Executing this chunk of code after setting up the bucket name and key, we get the UploadID for the file we want to upload. After setting up the bucket name and key, we get the UploadID for the file that needs to be uploaded. It will later be required to combine all parts.

2. Create pre-signed URLs for each part

The parts can now be uploaded via a PUT request. As explained earlier, we are using a pre-signed URL to provide a secure way to upload and grant access to an object without changing the bucket ACL, creating roles, or providing a user on your account. The permitted user can generate the URL for each part of the file and access the S3. The following line of code can generate it:

signed_url = s3.generate_presigned_url(
    ClientMethod ='upload_part',
    Params = {
       'Bucket': bucket,
       'Key': key, 
       'UploadId': upload_id, 
       'PartNumber': part_no
    }
)

As described above, this particular step is a server-side stage and hence demands a preconfigured AWS environment. The pre-signed URLs for each of the parts can now be handed over to the client. They can simply upload the individual parts without direct access to the S3. It means that the service provider does not have to worry about the ACL and change in permission anymore.

3. Upload the parts of the object

This step is the only client-side stage of the process. The default pre-signed URL expiration time is 15 minutes, while the one who is generating it can change the value. Usually, it is kept as minimal as possible for security reasons.

The client can read the part of the object, i.e., file_data, and request to upload the chunk of the data concerning the part number. It is essential to use the pre-signed URLs in sequence as the part number, and the data chunks must be in sequence; otherwise, the object might break, and the upload ends up with a corrupted file. For that reason, a dictionary, i.e., parts, must be managed to store the unique identifier, i.e., eTag of every part concerning the part number. A dictionary must be a manager to keep the unique identifier or eTag of every part of the number.

response = requests.put(signed_url, data=file_data)

etag = response.headers['ETag']  

parts.append({'ETag': etag, 'PartNumber': part_no})

As far as the size of data is concerned, each chunk can be declared into bytes or calculated by dividing the object’s total size by the no. of parts. Look at the example code below:

max_size = 5 * 1024 * 1024    # Approach 1: Assign the size  

max_size = object_size/no_of_parts    # Approach 2: Calculate the size 
  
with open(fileLocation) as f:
	file_data = f.read(max_size)

4. Complete Multipart Upload

Before this step, check the data’s chunks and the details uploaded to the bucket. Now, we need to merge all the partial files into one. The dictionary parts (about which we discussed in step 3) will be passed as an argument to keep the chunks with their part numbers and eTags to avoid the object from corrupting.

You can refer to the code below to complete the multipart uploading process.

response = s3.complete_multipart_upload(
    Bucket = bucket,
    Key = key,
    MultipartUpload = {'Parts': parts},
    UploadId= upload_id
)

5. Additional step

To avoid any extra charges and cleanup, your S3 bucket and the S3 module stop the multipart upload on request. In case anything seems suspicious and one wants to abort the process, they can use the following code:

response = s3.abort_multipart_upload(
    Bucket = bucket,
    Key = key,
    UploadId = upload_id
)

In this article, we discussed the process of implementing the process of multipart uploading in a secure way pre-signed URLs. The suggested solution is to make a CLI tool to upload large files which saves time and resources and provides flexibility to the users. It is a cheap and efficient solution for users who need to do this frequently.

Multipart Upload for Large Files using Pre-Signed URLs - AWS