{{appName}}

What is AWS

  • Cloud Service Provider (Infrastructure as a Service - IaaS)
  • Storage (S3)
  • Computing (EC2)
  • Databases (RDS)
  • Networking
  • Security
  • Virtualization
  • Developer Tools
  • Analytics

Benefits of Cloud Services

  • Sharing
  • Backups
  • High Availablity
  • Fault Tolerant
  • Provision as needed - Scalablity
  • Pay as you go
  • Elasticity - grow or shring as needed

Computing (EC2) - Elastic Cloud Computing

  • EC2 Instance
    • CPU
    • OS
    • Memory
    • Local Storage
    • Network Card
    • Firewall
  • Common Use - Web Hosting

Databases (RDS) - Relational Database Service

  • Web service that makes it easier to set up, operate, and scale a relational database in the cloud
  • Each DB Instance runs DB Engine
  • Supports the MySQL, MariaDB, PostgreSQL, Oracle, and Microsoft SQL Server DB engines.
  • Network Time Protocol (NTP) to synchronize the time on DB Instances
References

Storage (S3) - Simple Storage Service

  • Object storage built to store and retrieve any amount of data from anywhere
  • Comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements
  • Query-in-place functionality: analytics directly on your data at rest in S3 - without moving the data into a separate analytics system
  • Amazon Athena (uses Presto) is an interactive query service that makes it easy to analyze data in S3 using standard SQL
  • Uses machine learning to automatically discover, classify, and protect sensitive data in AWS
  • Formats: CSV, JSON, ORC, Avro, and Parquet
  • Security standards: PCI-DSS, HIPAA/HITECH, FedRAMP, EU Data Protection Directive, and FISMA

Data Formats

  • OCR - self-describing type-aware columnar file format designed for Hadoop workloads
  • Avro is a data serialization system which relies on schemas
  • Parquet is a columnar storage format available to any project in the Hadoop ecosystem

Presto: https://prestodb.io

  • Presto is an open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of data
  • It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions
  • It can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3.

Amazon Redshift - Data warehouse

  • Fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.
  • Loadup the cluster and connect a BI tool
  • Amazon Redshift is based on PostgreSQL 8.0.2
  • Uses - Columnar storage on high-performance local disks, and massively parallel query execution
  • Monitoring and Backup
  • Under $1,000 / TB / Year
Amazon Redshift

BI tools supported with Redshift

  • Jaspersoft
  • Microstrategy
  • Hitachi Pentaho
  • Tableau
  • Business Objects
  • IBM Cognos

Redshift Architecture

  • Amazon Redshift is based on PostgreSQL 8.0.2
  • PostgreSQL + MPP (- massive parallel processing - scale it horizontally upto 128 compute nodes) + Columnar Storage Engine + OLAP ( online analytic processing)

Redshift Architecture

Redshift Architecture

Columnar Database

  • Optimized for reading and writing columns of data as opposed to rows of data
  • Important factor in analytic query performance
  • Drastically reduces the overall disk I/O requirements
Columnar Storage

Amazon Elastic Block Store (EBS)

  • Provides persistent block storage volumes for use with Amazon EC2 instances

VPC - Virtual Private Cloud

  • Lets you provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network that you define
  • You have complete control over your virtual networking environment - including selection of your own IP address range
  • creation of subnets, and configuration of route tables and network gateways.
  • You can use both IPv4 and IPv6 in your VPC for secure and easy access to resources and applications