Mar 31, 2024
Client-side mechanisms for loading data onto AWS Snowball Edge – Addressing Disconnected Scenarios with AWS Snow Family

With the exception of AWS DataSync, file loading is a push operation from your data loader workstation. Thus, you will need to use an appropriate client application to communicate with the target you have selected.

Performance tip – batching

Regardless of the client-side mechanism you use to copy data, there is a certain amount of per-file overhead incurred by operations such as encryption. This is why copying a thousand 1 KB files is slower than copying one 1,000 KB file. If the data you are loading consists of many small files spread across many subdirectories, you will probably save time by batching them up into one large archive with utilities such as zip, gzip, or tar. This is true even if you obtain zero compression by doing so.

AWS OpsHub for Snow Family

The simplest thing to do is use the drag-and-drop interface in the AWS OpsHub application. Customers who prefer a GUI interface download and use this anyway to unlock the device and make configuration changes to it. This option also requires no special target configuration.

While it might be convenient, as you can see from Figure 4.8, it is also quite slow with a maximum speed of around 0.3 Gbps:

Figure 4.8 – Uploading files via drag and drop in AWS OpsHub

NFS client

When using the NFS endpoint, your data loader workstation must have an NFS client installed. This is usually installed by default on macOS or Linux. While Windows does offer an NFS client, it is not installed by default, and the performance tends to be lower.

AWS CLI

The AWS CLI should be installed anyway on your data loader workstation. It can be used to target the locally running S3 endpoint on the AWS Snowball Edge device. Using the aws s3 sync command, you can do bulk data transfer operations the same way you would with an S3 bucket in an AWS region.

s5cmd

The AWS CLI is a general-purpose utility written in Python. It wasn’t explicitly designed to maximize file transfer speed. This means it can’t usually push as fast as the S3 endpoint can receive. Fortunately, s5cmd can. It is an open source project available on GitHub. It is written in Go and focuses on maximum parallelization. The more CPU cores your data loader has, the faster it can move data. However, given most laptops or even desktops don’t have 128 cores and 25 Gbps interfaces, this option tends to be used when the loader itself is a server in the customer’s data center.

More Details

Leave a Reply

Your email address will not be published. Required fields are marked *