Domain 3: Analytics Services

CLF-C02 Exam Domain 3 - Part 5 | 34% of Scored Content

Learning Objectives

By the end of this section, you will be able to:

Understand AWS analytics services overview
Compare Amazon QuickSight, Athena, and Redshift
Identify use cases for different analytics services
Understand data ingestion and processing services

AWS Analytics Services Overview

Analytics Services Comparison

Service	Type	Use Case	Cost Model
QuickSight	BI Tool	Interactive dashboards, visualizations	Per user/per session
Athena	Query Service	Ad-hoc SQL queries on S3	Per TB scanned
Redshift	Data Warehouse	Complex analytics, petabyte-scale	Per node hour
EMR	Big Data Processing	Hadoop/Spark workloads	Per instance hour
Kinesis	Streaming	Real-time data streaming	Per shard hour/data
Glue	ETL	Data catalog and ETL	Per DPU hour
MSK	Streaming	Managed Apache Kafka	Per broker hour

Amazon QuickSight

Overview

Amazon QuickSight is a fully managed, serverless business intelligence (BI) service.

Key Characteristics

Feature	Description
Serverless	No infrastructure to manage
Integration	Connect to RDS, Redshift, S3, Athena, etc.
ML Insights	Anomaly detection, forecasting
Embedding	Embed dashboards in apps
SPICE	In-memory calculation engine

SPICE (Super-fast, Parallel, In-memory Calculation Engine)

Purpose: Super-fast performance for interactive dashboards.

Capacity:

Standard Edition: 10 GB per user (up to 500 GB total)
Enterprise Edition: 10 GB per user (up to 1 TB total)

Benefits:

Sub-second response times
Handles millions of rows
Auto-scales capacity

QuickSight Editions

Edition	Price	Features
Standard	$9/user/month	Basic BI, SPICE
Enterprise	$18/user/month	ML insights, embedding, AD integration

Use Cases

Executive Dashboards: Business metrics visualization
Self-Service BI: Business users create own analyses
Embedded Analytics: Dashboards in applications
Anomaly Detection: ML-powered outlier detection

Amazon Athena

Overview

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL.

Key Characteristics

Feature	Description
Serverless	No infrastructure to provision
S3 Queries	Directly query data in S3
Standard SQL	SQL based on Trino/Presto
Pay-per-query	$5 per TB of data scanned

Athena Features

Supported Formats:

CSV, JSON, ORC, Avro, Parquet
Logs (CloudTrail, VPC Flow Logs)
Custom formats via SerDe

Data Sources:

S3 (primary)
CloudWatch Logs
AWS Glue Data Catalog
DynamoDB (via federated query)

Cost Optimization:

Partition data: Reduce scanned data
Columnar formats: Parquet, ORC (scan only needed columns)
Compression: Reduce data size

Athena vs Redshift

Feature	Athena	Redshift
Type	Serverless query	Data warehouse
Setup	None	Provision clusters
Performance	Slower (seconds to minutes)	Faster (sub-second to seconds)
Concurrency	Unlimited	Limited by WLM
Use Case	Ad-hoc queries, occasional analytics	Complex queries, high concurrency

Use Cases

Log Analysis: CloudTrail, VPC Flow Logs, ELB logs
Ad-hoc Analysis: Quick queries on S3 data
Data Discovery: Explore data before building pipeline
Reporting: Periodic reporting on S3 data

Amazon Redshift

Overview

Amazon Redshift is a fully managed, petabyte-scale data warehouse service.

Key Characteristics

Feature	Description
Massively Parallel	Distributed across nodes
Columnar Storage	Optimized for analytics
Compression	Reduces storage needs
Scalable	Scale compute and storage independently

Redshift Architecture

Leader Node:

Receives queries
Plans execution
Aggregates results

Compute Nodes:

Execute queries
Store data
Distributed across slices

Node Types:

Type	Family	Use Case
DC2	Dense Compute	High performance, frequently accessed
RA3	Managed Storage	Separation of compute and storage
DC1	Legacy	Being replaced by RA3

Redshift Features

Concurrency Scaling:

Automatically add capacity
Support nearly unlimited concurrent users
Free for compatible workloads

Materialized Views:

Pre-computed results
Faster query performance
Auto-refresh options

Redshift Spectrum:

Query data directly in S3
No data loading required
Extends data warehouse to data lake

Distribution Styles:

Style	Description	Use Case
KEY	Distributed on specific column	Joins on that column
ALL	Copy to all nodes	Small tables
EVEN	Round-robin distribution	Default

Use Cases

Data Warehousing: Centralized analytics
Business Intelligence: Power BI, Tableau, QuickSight
High-performance Analytics: Complex queries
Data Lake Integration: Redshift Spectrum

Amazon Kinesis

Overview

Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information.

Kinesis is a collection of services for processing streams of various data. Data is processed in "shards" with each shard able to ingest 1000 records per second.

Key Characteristics:

Transient data store (default retention of 24 hours, configurable up to 7 days)
Default limit of 500 shards (can request increase to unlimited)
Records consist of partition key, sequence number, and data blob (up to 1 MB)
Synchronous replication across three AZs

Kinesis Services

Service	Purpose	Use Case
Kinesis Data Streams	Real-time streaming	IoT, clickstreams, logs
Kinesis Data Firehose	Load streaming data	S3, Redshift, OpenSearch
Kinesis Data Analytics	Real-time SQL on streams	Anomaly detection, filtering
Kinesis Video Streams	Video streaming	Camera feeds, video analysis

Kinesis Data Streams

Purpose: Real-time processing of streaming big data.

Key Characteristics:

Stores data for later processing (key difference from Firehose which delivers directly)
Producers push data via Kinesis API, Producer Library (KPL), or Kinesis Agent
Consumers process data in real time (EC2 instances, Lambda)
Records accessible for 24 hours by default (can be extended to 7 days)

Shards:

Base throughput unit of Kinesis Data Streams
One shard provides 1 MB/sec data input and 2 MB/sec data output
Each shard supports up to 1000 PUT records per second
Stream is composed of one or more shards
Total capacity = sum of capacities of all shards

Resharding:

Shard Split: Divide single shard into two (increases capacity and cost)
Shard Merge: Combine two shards into one (decreases capacity and cost)
Adjust number of shards to adapt to data flow changes

Pricing:

Shard Hour: On-demand capacity
Data In/Out: Per GB
Extended data retention (beyond 24 hours)

Security:

KMS master keys for encryption
IAM policies for access control
HTTPS endpoints for encryption in flight
VPC endpoints available

Use Cases:

Real-time analytics
IoT data ingestion
Log and event data collection
Clickstream tracking
Accelerated log and data feed intake

Kinesis Data Firehose

Purpose: Easiest way to load streaming data into data stores and analytics tools.

Key Characteristics:

Serverless (no resources to manage, no capacity planning)
Captures, transforms, and loads streaming data
Near real-time analytics with existing BI tools
Synchronous replication across three AZs during transport
No shards, fully automated

Destinations:

Amazon S3
Amazon Redshift
Amazon Elasticsearch Service (OpenSearch)
Splunk

Features:

Data transformation (Lambda)
Data conversion (Parquet, ORC)
Compression, encryption
Batch processing
Can back up source data to S3 before transformation

Data Flow:

For S3: Delivers directly to bucket
For Redshift: Delivers to S3 first, then issues COPY command to Redshift
For Elasticsearch: Delivers to cluster, optionally backs up to S3
For Splunk: Delivers to Splunk, optionally backs up to S3

Record Size: Maximum 1000 KB per record (before Base64-encoding)

Use Cases:

Load streaming data to S3
Real-time data lake ingestion
Log aggregation
ETL automation

Kinesis Data Analytics

Purpose: Easiest way to process and analyze real-time streaming data.

Key Characteristics:

Standard SQL queries to process Kinesis streams
Real-time analysis
Sits over Kinesis Data Streams and Kinesis Data Firehose
Can ingest from both Streams and Firehose

Application Components:

Input: Streaming source for application
Application Code: SQL statements that process input and produce output
Output: In-application streams for intermediate results

Input Types:

Streaming data sources (continuously generated)
Reference data sources (static data for enrichment)

Destinations:

S3, Redshift, Elasticsearch
Kinesis Data Streams

Use Cases:

Time-series analytics
Real-time dashboards
Real-time alerts and notifications
Anomaly detection

Kinesis Video Streams

Purpose: Securely stream video from connected devices to AWS.

Key Characteristics:

Durably stores, encrypts, and indexes video data streams
Easy-to-use APIs for access
Stores data for 24 hours by default (up to 7 days)
Stores data in shards
Encryption at rest with KMS

Shard Capacity:

5 transactions per second for reads
Max read rate of 2 MB per second
1000 records per second for writes
Max write of 1 MB per second

Use Cases:

Camera feeds
Video analysis
Machine learning on video
Security monitoring

Kinesis Client Library (KCL)

Purpose: Java library that helps read records from Kinesis Stream with distributed applications.

Key Differences:

KCL vs Kinesis Data Streams API:
- Kinesis Data Streams API: Manage streams, resharding, putting/getting records
- KCL: Abstraction specifically for processing data in consumer role

KCL Functions:

Connects to stream and enumerates shards
Coordinates shard associations with other workers
Instantiates record processor for every shard
Pulls data records and pushes to record processor
Checkpoints processed records
Balances shard-worker associations when instances change

Scaling:

Each shard processed by exactly one KCL worker
One worker can process multiple shards
Number of instances should not exceed number of shards
Progress checkpointed into DynamoDB (requires IAM access)

Use Cases:

Distributed stream processing
EC2, Elastic Beanstalk, on-premises servers

Kinesis vs SQS vs SNS

Feature	Kinesis	SQS	SNS
Data Model	Pull	Pull	Push
Data Persistence	Up to 7 days	Deleted after consumed	Not persisted
Throughput	Must provision shards	No provisioning needed	No provisioning needed
Ordering	Shard-level ordering	FIFO queues only	No ordering
Consumers	As many as needed	As many as needed	Up to 10M subscribers
Use Case	Real-time big data, ETL	Decoupling, buffering	Fan-out, notifications

AWS Glue

Overview

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data.

Glue Components

Component	Purpose
Data Catalog	Central metadata repository
Crawlers	Discover data and populate catalog
ETL Jobs	Transform and move data
Jobs	Spark or Python scripts
Triggers	Schedule ETL jobs
Workflows	Orchestrate multiple jobs

Glue Data Catalog

Purpose: Central metadata repository for data assets.

Features:

Tables, schemas, partitions
Integration with Athena, Redshift Spectrum, EMR
Column-level statistics

Use Cases

Data Discovery: Catalog data across S3
ETL Pipelines: Transform data for analytics
Data Lake: Build and maintain data lake
Schema Evolution: Handle schema changes

Amazon EMR (Elastic MapReduce)

Overview

Amazon EMR is a managed cluster platform that simplifies running big data frameworks.

Supported Applications

Application	Type	Use Case
Apache Spark	In-memory processing	ETL, machine learning, graph processing
Hadoop MapReduce	Batch processing	Big data processing
Presto	Distributed SQL	Interactive queries
Hive	Data warehouse	SQL on Hadoop
HBase	NoSQL database	Real-time read/write
Flink	Stream processing	Real-time analytics

EMR Use Cases

Big Data Processing: Large-scale data processing
Machine Learning: Spark MLlib
Data Transformation: ETL at scale
Log Processing: Web logs, sensor data

MSK (Managed Streaming for Kafka)

Overview

Amazon MSK is a fully managed Apache Kafka service.

Key Features

Feature	Description
Managed	AWS handles provisioning, patching
Highly Available	Multi-AZ deployment
Compatible	Native Kafka APIs
Secure	IAM authentication, encryption

Use Cases

Event Streaming: Real-time event processing
Data Pipelines: Stream processing
Log Aggregation: Centralized logging
Microservices: Event-driven architecture

Exam Tips - Analytics Services

High-Yield Topics

QuickSight:
- Serverless BI tool
- SPICE = in-memory engine
- ML insights = anomaly detection, forecasting
Athena:
- Serverless SQL queries on S3
- Pay per TB scanned ($5/TB)
- Use partitions, columnar formats to reduce cost
Redshift:
- Petabyte-scale data warehouse
- Columnar storage, massively parallel
- RA3 = managed storage (separate compute/storage)
- Spectrum = query S3 data
Kinesis:
- Data Streams = real-time streaming, stores data (24h-7d), 1 MB/s input, 2 MB/s output per shard
- Firehose = serverless loading to S3/Redshift/OpenSearch, no shards
- Analytics = SQL on streams, real-time processing
- Video = video streaming for camera feeds
- KCL = Java library for distributed stream processing
Glue:
- Data Catalog = metadata repository
- ETL = serverless data transformation
- Crawlers = discover data

Additional Resources

DigitalCloud Training Cheat Sheets

AWS Analytics Services Cheat Sheet - Comprehensive analytics services reference
Amazon Kinesis Cheat Sheet - Detailed Kinesis guide for exam preparation

Official AWS Documentation

Amazon QuickSight Documentation
Amazon Athena Documentation
Amazon Redshift Documentation
Amazon Kinesis Documentation
AWS Skill Builder - Free analytics courses and certification prep
AWS Builder Labs - Hands-on analytics labs and practice environments
AWS Analytics Learning Plans - Comprehensive analytics learning paths

AWS Analytics Resources

QuickSight Pricing - BI pricing
Athena Pricing - Serverless query pricing
Redshift Pricing - Data warehouse pricing
Kinesis Pricing - Streaming data pricing

Next: AI/ML Services

Domain 3: Analytics Services ​

Learning Objectives ​

AWS Analytics Services Overview ​

Analytics Services Comparison ​

Amazon QuickSight ​

Overview ​

Key Characteristics ​

SPICE (Super-fast, Parallel, In-memory Calculation Engine) ​

QuickSight Editions ​

Use Cases ​

Amazon Athena ​

Overview ​

Key Characteristics ​

Athena Features ​

Athena vs Redshift ​

Use Cases ​

Amazon Redshift ​

Overview ​

Key Characteristics ​

Redshift Architecture ​

Redshift Features ​

Use Cases ​

Amazon Kinesis ​

Overview ​

Kinesis Services ​

Kinesis Data Streams ​

Kinesis Data Firehose ​

Kinesis Data Analytics ​

Kinesis Video Streams ​

Kinesis Client Library (KCL) ​

Kinesis vs SQS vs SNS ​

AWS Glue ​

Overview ​

Glue Components ​

Glue Data Catalog ​

Use Cases ​

Amazon EMR (Elastic MapReduce) ​

Overview ​

Supported Applications ​

EMR Use Cases ​

MSK (Managed Streaming for Kafka) ​

Overview ​

Key Features ​

Use Cases ​

Exam Tips - Analytics Services ​

High-Yield Topics ​

Additional Resources ​

DigitalCloud Training Cheat Sheets ​

Official AWS Documentation ​

AWS Analytics Resources ​

Domain 3: Analytics Services

Learning Objectives

AWS Analytics Services Overview

Analytics Services Comparison

Amazon QuickSight

Overview

Key Characteristics

SPICE (Super-fast, Parallel, In-memory Calculation Engine)

QuickSight Editions

Use Cases

Amazon Athena

Overview

Key Characteristics

Athena Features

Athena vs Redshift

Use Cases

Amazon Redshift

Overview

Key Characteristics

Redshift Architecture

Redshift Features

Use Cases

Amazon Kinesis

Overview

Kinesis Services

Kinesis Data Streams

Kinesis Data Firehose

Kinesis Data Analytics

Kinesis Video Streams

Kinesis Client Library (KCL)

Kinesis vs SQS vs SNS

AWS Glue

Overview

Glue Components

Glue Data Catalog

Use Cases

Amazon EMR (Elastic MapReduce)

Overview

Supported Applications

EMR Use Cases

MSK (Managed Streaming for Kafka)

Overview

Key Features

Use Cases

Exam Tips - Analytics Services

High-Yield Topics

Additional Resources

DigitalCloud Training Cheat Sheets

Official AWS Documentation

AWS Analytics Resources