AWS Storage

Reference

Study Notes @ http://danielcreager.com/AWS Storage.html
Slide Deck @ http://slides.com/dcreager/aws-storage

Amazon Simple Storage Service (S3)

  1. Definition
    1. S3 is an Object Store
    2. Secure, durable, and highly-scalable cloud storage
    3. Optimized for reads and intentionally light weight
    4. Accessible from anywhere on the web
    5. One of the AWS Foundational Services
  2. Features
    1. Storage Classes`
      1. General Purpose
      2. Infrequent Access
      3. Archival
    2. Lifecycle Policies
      1. Automatically migrate to the most appropriate Storage class
    3. Rich set of Access Controls
    4. Replication is Automatic:
      1. Multiple devices
      2. Multiple facilities within a region
      3. Across availability zones
    5. Scalability - Automatically partitions buckets supporting:
      1. High request rates
      2. Simultaneous access
      3. Multiple concurrent users

Amazon Glacier

  1. Definition
    1. S3 optimized for long-term backup and archival.
    2. 3-5 hour retrieval time
    3. Dual Product Offering
      1. An S3 Storage Class
      2. Archival Storage Service

Background

  1. Types of Storage
    1. Block - Storage Device Level
      1. Organizes data into numbered, fixed size blocks
      2. Storage Area Network (SAN)
      3. Common protocols
        1. iSCSI
        2. Fibre Channel
    2. File - Server and Operating System Level
      1. Organizes data into named hierarchy of folders and files
      2. Network Attached Storage (NAS)
      3. Common Protocols
        1. CIFS
        2. NFS
    3. Object
      1. Independent of Servers, Operating Systems
      2. Accessed over a network
      3. The native interface for S3 is a ReST API.

S3 Basics

  1. S3 Object Characteristics
    1. Each S3 object contains BOTH data and metadata
    2. Each S3 object is uniquely identifed by: <bucket><key>[<versionId>]
      1. Unicode characters whose UTF-8 encoding <= 1024 bytes.
    3. Size range is 0 bytes up to 5 terabytes
    4. Operations (GET, PUT) are on whole objects
    5. Data
      1. S3 treats all objects as a stream of bytes.
      2. S3 is completely format agnostic
    6. Metadata
      1. A set of name/value pairs
      2. System metadata with object characteristics.
      3. Optional User metadata
  2. Bucket Characteristics
    1. A bucket is a container (web folder) for objects (files) stored in S3.
    2. Each account may define 100 buckets, by default.
    3. buckets are created and stored within specific AWS regions
    4. buckets are the top-level, global namespace in S3
      1. Must be globally unique across all AWS
      2. Naming Conventions
        1. Must be between 3 and 63 chars long.
        2. Contain only: lower-case characters, numbers, periods, and dashes.
        3. Additional restrictions apply
        4. Best Practice:
          1. Include your domain name
          2. Conform to DNS naming conventions
            1. Object Key and Metadata
            2. Domain Name System
            3. DNS Naming Conventions
    5. Can hold an unlimited number of objects
    6. A simple flat folder with no hierarchy
      Note: For your convenience, the Amazon S3 console and the Prefix and Deliter feature allow you to navigate within an Amazon S3 bucket as if there were a folder hierarchy.
      However, remember that a bucket is a single flat namespace of keys with no structure.
  3. Accessing S3 Objects
    1. Operations
      1. Intentionally simple
      2. Based on a ReST implementation of CRUD operations
        1. Bucket Operations
          1. Create
          2. Delete
          3. List
        2. Object Operations
          1. Write
          2. Read
          3. Delete
          4. Note: the absence of an Update. Why?
    2. Direct Interface
      1. Representationale State Transfer (ReST)
      2. Create, Read, Update, Delete (CRUD) operations mapped to HTTP methods
        Ref: POST Object
        1. Create -> HTTP PUT (or POST to accomadate use of HTML forms)
        2. Read -> HTTP GET
        3. Update -> HTTP POST ( or PUT)
        4. Delete -> HTTP DELETE
    3. High Level Interfaces
      1. AWS Software Developent KIT (SDK)
      2. Wrapper Libraries
      3. AWS Command Line Interface (CLI)
      4. AWS Management Console
  4. Durability and Availability
    1. Durability
      1. Will my data still be there ?
      2. S3 is 99.999999999% Durable
    2. Availability
      1. Can I access my data ?
      2. S3 is 99.99% available
    3. Reduced Redundancy Storage (RRS)
      1. Reduced Cost Alternative
      2. RRS is 99.99% durable
    4. Best Practice
      1. Protect against user mistakes
        1. Versioning
        2. Cross-Region Replication
        3. MFA Delete
  5. Data Consistency
    1. S2 is an Eventually Consistent system
    2. Immediately after an update, a read may return stale data. This is applicable to:
      1. PUTs to existing Objects
      2. Object Deletes
    3. Updates are Atomic - Partial updates cannot occur
  6. Access Control
    1. S3 is secure by default. Initially only creator has access.
    2. Coarse-grained access control:
      1. S3 ACLs
      2. READ, WRITE, or FULL_CONTROL
      3. Bucket or Object level
      4. Best Use Cases
        1. Enabling Bucket Logging
        2. Hosting a static website
    3. Fine-grained access controls:
      1. S3 Bucket Policies
        1. Recommended access control mechanism
        2. Similar to IAM policies
        3. Access Control over who, from where, and when
      2. AWS IAM
      3. Query String Authentication
  7. Static Website Hosting
    1. Very common use case
    2. Every S3 Object has a URL
    3. Configure the bucket
      1. Create a bucket with the same name as the desired website hostname.
      2. Upload the static files to the bucket.
      3. Make all the files public (world readable).
      4. Enable static website hosting for the bucket.
        This includes specifying an Index document and an Error document.
      5. The website will now be available at the S3 website URL:
        <bucket-name>.s3-website-<AWS-region>.amazonaws.com
      6. Create a friendly DNS name in your own domain for the website using a DNS CNAME, or an Amazon Route 53 alias that resolves to the Amazon S3 website URL.
      7. The website will now be available at your website domain name.
    4. upload the static content

Advanced Features

  1. Prefixes and Delimiters
    1. While Amazon S3 uses a flat structure in a bucket, it supports the use of prefix and delimiter parameters when listing key names. This emulates a file and folder hierarchy within the flat object key namespace of a bucket. For example:
      logs/2016/January/server42.1og
      logs/2016/February/server42.1og
      logs/2016/March/server42.1og
    2. Supporting products
      1. REST API
      2. Wrapper SDKs
      3. AWS CLI
      4. Amazon Management Console
    3. Amazon S3 is not really a file system.
  2. Storage Classes
    1. Amazon S3 offers a range of storage classes suitable for various use cases.
    2. Amazon S3 Standard:
      1. High durability
      2. High availability
      3. Low latency
      4. High performance object storage
    3. Amazon S3 Standard — Infrequent Access (Standard-IA)
      1. Different Availability profile from Standard
      2. Designed for long-lived, less frequently accessed data
      3. Lower per GB-month storage cost than Standard
      4. Minimums and Costs
        1. Object size (128KB)
        2. Duration (30 days)
        3. Per-GB retrieval costs
    4. Amazon S3 Reduced Redundancy Storage (RRS) offers slightly lower durability (4 nines) than Standard or Standard-IA at a reduced cost.
    5. Amazon Glacier storage class
      1. Data that does not require real-time access
      2. Retrieval time of several (3-5) hours is suitable.
      3. Note: restore creates a copy in Amazon S3 RRS and original remains in Amazon Glacier
      4. Retrieval of up to 5% of the data is free each month
    6. Amazon Glacier is also a standalone storage service
      1. Separate API and some unique characteristics.
  3. Object Lifecycle Management
    1. Equivalent to automated storage tiering
    2. Attached to the Bucket
    3. Contents may be filtered by name prefixes
    4. Reduce storage costs by automatically transitioning data from one storage class to another. For Example:
      1. Store backup data initially in Amazon S3 Standard.
      2. After 30 days, transition to Amazon Standard-IA.
      3. After 90 days, transition to Amazon Glacier.
      4. After 3 years, delete.
  4. Encryption
    1. In Flight
      1. Use the Amazon 53 SSL API endpoints to encrypt data
    2. At Rest
      1. S3 encrypts data at the object level as it writes and decrypts on read
        1. S3's SSE uses the 256-bit Advanced Encryption Standard (AES)
      2. Use Client-Side Encryption before sending it to Amazon S3
  5. SSE-S3 (AWS-Managed Keys)
    1. AWS handles the key management and key protection
    2. Every object is encrypted with a unique key.
    3. The actual object key itself is then further encrypted by a separate master key.
  6. SSE-KMS (AWS KMS Keys)
    1. Amazon handles your key management and protection for Amazon S3
    2. You manage the keys
    3. Additional Benefits
      1. There are separate permissions for using the master key
      2. Auditing is provided by AWS
      3. Allows you to view any failed attempts to access data
  7. SSE-C (Customer-Provided Keys)
    1. Maintain your own encryption keys
    2. AWS will encrypt/decrypt your objects
    3. You maintain full control of the keys
  8. Client-Side Encryption
    1. Encrypting data before sending it
  9. Versioning
    1. Protection against accidental or malicious deletion
    2. Preserve, retrieve, and restore every version of every object
    3. Restore objects to their original state simply by referencing the version ID
    4. Turned on at the bucket level.
    5. Once enabled, versioning cannot be removed from a bucket; it can only be suspended.
  10. Multi-Factor Authentication (MFA) Delete
    1. On top of bucket versioning.
    2. Requires additional authentication to permanently delete an object
    3. Requires an authentication code (a temporary, one-time password) generated by a hardware or virtual Multi-Factor Authentication (MFA) device.
    4. Note: that MFA Delete can only be enabled by the root account.
  11. Pre-Signed URLs
    1. Object owner can create
    2. Valid only for the specified duration
  12. Multipart Upload
    1. An API that allows uploading large objects as a set of parts with the ability to:
      1. pause
      2. resume
      3. upload objects, where the size is initially unknown
    2. When object > 100 MB, multipart upload is recommended
    3. When object > 5 GB, multipart upload is required
    4. When using the low-level APIs, the file to be uploaded must be broken into parts, which are managed by the caller.
    5. High-level APIs, use automatically
    6. Lifecycle policy to abort incomplete multipart uploads after a specified number of days
  13. Range GETs
    1. Download (GET) only a portion of an object
    2. Use a Range HTTP header to specify a range of bytes of the object
    3. Useful when you have poor connectivity or download a known subset of a large object
  14. Cross-Region Replication
    1. Asynchronously replicate all new objects to a target bucket in another region
    2. Any metadata and ACLs associated with the object are also part of the replication.
    3. Any changes trigger a new replication to the destination bucket
    4. Versioning must be turned on for both source and destination buckets
    5. Requires a TAM policy to give Amazon S3 permission to replicate
    6. Used to reduce the latency by placing objects closer to a set of users
    7. Used to meet locality requirements
    8. ISC is an option to replication only new objects
  15. Logging
    1. Track S3 requests by enabling Amazon S3 server access logs.
    2. Logging is off by default
    3. Store access logs in the same or a different bucket
  16. Event Notifications
    1. Trigger notification events based on S3 object actions. Enables:
      1. Running workflows
      2. Sending alerts
      3. Transcoding media files
      4. Processing data files
      5. Synchronizing S3 objects with other data stores
    2. Set up at the bucket level
    3. Publish notifications when:
      1. New objects are created
      2. Objects are removed (by a DELETE)
      3. S3 detects that an RRS object was lost
    4. Set up event notifications based on Object name prefixes and suffixes
    5. Notifications can be sent through:
      1. Amazon Simple Notification Service (Amazon SNS)
      2. Amazon Simple Queue Service (Amazon SQS)
      3. AWS Lambda to invoke AWS Lambda functions
  17. Best Practices, Patterns, and Performance
    1. Use S3 storage in hybrid IT environments and applications
    2. For example, backed up over the Internet to S3 or Glacier
    3. Use S3 as bulk "blob" storage for data, while keeping an index
    4. S3 will scale automatically to support very high request rates
    5. For request rates higher than 100 requests per second, ensure some level of random distribution of keys
    6. In a GET-intensive mode, consider using an Amazon CloudFront as a caching layer for S3.

Amazon Glacier

  1. Extremely low-cost storage service for data archiving and online backup.
  2. Designed for infrequently accessed data
  3. Retrieval time is three to five hours
  4. Common use cases:
    1. Long-term backup, archive, and storage of compliance data
    2. Usually consists of large TAR (Tape Archive) or ZIP files
  5. Designed for 99.999999999% durability
  6. Stores data on multiple devices across multiple facilities in a region.

Archives

  1. Data is stored in archives, whichcan contain up to 40 TB of data
  2. Unlimited number of archives
  3. Each archive is assigned a unique archive ID at the time of creation.
  4. Automatically encrypted, and immutable

Vaults

  1. Containers for archives
  2. Max: 1,000 vaults per account
  3. Control accessusing IAM policies or vault access policies

Vaults Locks

  1. Specify controls such as Write Once Read Many (WORM) in a vault lock policy
  2. Once locked, the policy can no longer be changed.

Data Retrieval

  1. Retrieve up to 5% of your data for free each month
  2. Eliminate or minimize fees, by setting a data retrieval policy

Amazon Glacier versus Amazon S3

  1. Supports 40 TB archives versus 5 TB objects in S3
  2. Identified by system-generated archive IDs
  3. Automatically encrypted, encryption is optional in Amazon S3

Summary

  1. Amazon S3 is the core object storage service on AWS, allowing you to store an unlimited amount of data with very high durability.
  2. Common Amazon S3 use cases include backup and archive, web content, big data analytics, static website hosting, mobile and cloud-native application hosting, and disaster recovery.
  3. Amazon S3 is integrated with many other AWS cloud services, including AWS IAM, AWS KMS, Amazon EC2, Amazon EBS, Amazon EMR, Amazon DynamoDB, Amazon Redshift, Amazon SQS, AWS Lambda, and Amazon CloudFront.
  4. Object storage differs from traditional block and file storage. Block storage manages data at a device level as addressable blocks, while file storage manages data at the operating system level as files and folders. Object storage manages data as objects that contain both data and metadata, manipulated by an API.
  5. Amazon S3 buckets are containers for objects stored in Amazon S3. Bucket names must be globally unique. Each bucket is created in a specific region, and data does not leave the region unless explicitly copied by the user.
  6. Amazon S3 objects are files stored in buckets. Objects can be up to 5TB and can contain any kind of data. Objects contain both data and metadata and are identified by keys. Each Amazon S3 object can be addressed by a unique URL formed by the web services endpoint, the bucket name, and the object key.
  7. Amazon S3 has a minimalistic API—create/delete a bucket, read/write/delete objects, list keys in a bucket —and uses a REST interface based on standard HTTP verbs—GET, PUT, POST, and DELETE. You can also use SDK wrapper libraries, the AWS CLI, and the AWS Management Console to work with Amazon S3.
  8. Amazon S3 is highly durable and highly available, designed for n nines of durability of objects in a given year and four nines of availability.
  9. Amazon S3 is eventually consistent, but offers read-after-write consistency for new object PUTs.
  10. Amazon S3 objects are private by default, accessible only to the owner. Objects can be marked public readable to make them accessible on the web. Controlled access may be provided to others using ACLs and AWS IAM and Amazon S3 bucket policies.
  11. Static websites can be hosted in an Amazon S3 bucket.
  12. Prefixes and delimiters may be used in key names to organize and navigate data hierarchically much like a traditional file system.
  13. Amazon S3 offers several storage classes suited to different use cases: Standard is designed for general-purpose data needing high performance and low latency. Standard-IA is for less frequently accessed data. RRS offers lower redundancy at lower cost for easily reproduced data. Amazon Glacier offers low-cost durable storage for archive and long-term backups that can are rarely accessed and can accept a three- to five-hour retrieval time.
  14. Object lifecycle management policies can be used to automatically move data between storage classes based on time.
  15. Amazon S3 data can be encrypted using server-side or client-side encryption, and encryption keys can be managed with Amazon KMS.
  16. Versioning and MFA Delete can be used to protect against accidental deletion.
  17. Cross-region replication can be used to automatically copy new objects from a source bucket in one region to a target bucket in another region.
  18. Pre-signed URLs grant time-limited permission to download objects and can be used to protect media and other web content from unauthorized "web scraping."
  19. Multipart upload can be used to upload large objects, and Range GETs can be used to download portions of an Amazon S3 object or Amazon Glacier archive.
  20. Server access logs can be enabled on a bucket to track requestor, object, action, and response.
  21. Amazon S3 event notifications can be used to send an Amazon SQS or Amazon SNS message or to trigger an AWS Lambda function when an object is created or deleted.
  22. Amazon Glacier can be used as a standalone service or as a storage class in Amazon S3.
  23. Amazon Glacier stores data in archives, which are contained in vaults. You can have up to 1,000 vaults, and each vault can store an unlimited number of archives.
  24. Amazon Glacier Vaults can be locked for compliance purposes.