Protocol (client) selection

- hadoop-aws (s3a protocol)
- stocator (cos protocol)
You may easily select one of the two S3 clients by specifying appropriate protocol in the objectStorageURI parameter. The URI should be in cos://bucket/ or s3a://bucket/ format. Replace bucket with the name of the bucket you created. Concretely, when the s3a protocol is specified the toolkit uses hadoop-aws client. When cos protocol is specified the toolkit uses stocator client.
Note: The S3ObjectStorageSink, S3ObjectStorageSource and S3ObjectStorageScan operators have a protocol parameter for the client selection and a bucket parameter to specify the name of the bucket.
For ObjectStorageScan and ObjectStorageSource operator there is no big difference regarding the S3 client and you can select any of both clients. The ObjectStorageSink works different depending on the client selection.
- Select s3a protocol when your application creates large objects one after another. Large objects are uploaded in multiple parts in parallel.
- Select cos protocol when your application creates many objects within a narrow time frame. Multiple threads are uploading the entire object per thread in parallel.
When writing objects in partitioned parquet format both clients work similar and you may select one of the clients because of the different buffering mechanism:
- hadoop-aws (s3a protocol): supports buffering in memory or disk (parameter s3aFastUploadBuffer). default:: memory buffer prior to upload
- stocator (cos protocol): buffers on local disk prior to upload