Skip to content

Domain 3: Analytics Services

CLF-C02 Exam Domain 3 - Part 5 | 34% of Scored Content

Learning Objectives

By the end of this section, you will be able to:

  • Understand AWS analytics services overview
  • Compare Amazon QuickSight, Athena, and Redshift
  • Identify use cases for different analytics services
  • Understand data ingestion and processing services

AWS Analytics Services Overview

Analytics Services Comparison

ServiceTypeUse CaseCost Model
QuickSightBI ToolInteractive dashboards, visualizationsPer user/per session
AthenaQuery ServiceAd-hoc SQL queries on S3Per TB scanned
RedshiftData WarehouseComplex analytics, petabyte-scalePer node hour
EMRBig Data ProcessingHadoop/Spark workloadsPer instance hour
KinesisStreamingReal-time data streamingPer shard hour/data
GlueETLData catalog and ETLPer DPU hour
MSKStreamingManaged Apache KafkaPer broker hour

Amazon QuickSight

Overview

Amazon QuickSight is a fully managed, serverless business intelligence (BI) service.

Key Characteristics

FeatureDescription
ServerlessNo infrastructure to manage
IntegrationConnect to RDS, Redshift, S3, Athena, etc.
ML InsightsAnomaly detection, forecasting
EmbeddingEmbed dashboards in apps
SPICEIn-memory calculation engine

SPICE (Super-fast, Parallel, In-memory Calculation Engine)

Purpose: Super-fast performance for interactive dashboards.

Capacity:

  • Standard Edition: 10 GB per user (up to 500 GB total)
  • Enterprise Edition: 10 GB per user (up to 1 TB total)

Benefits:

  • Sub-second response times
  • Handles millions of rows
  • Auto-scales capacity

QuickSight Editions

EditionPriceFeatures
Standard$9/user/monthBasic BI, SPICE
Enterprise$18/user/monthML insights, embedding, AD integration

Use Cases

  • Executive Dashboards: Business metrics visualization
  • Self-Service BI: Business users create own analyses
  • Embedded Analytics: Dashboards in applications
  • Anomaly Detection: ML-powered outlier detection

Amazon Athena

Overview

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL.

Key Characteristics

FeatureDescription
ServerlessNo infrastructure to provision
S3 QueriesDirectly query data in S3
Standard SQLSQL based on Trino/Presto
Pay-per-query$5 per TB of data scanned

Athena Features

Supported Formats:

  • CSV, JSON, ORC, Avro, Parquet
  • Logs (CloudTrail, VPC Flow Logs)
  • Custom formats via SerDe

Data Sources:

  • S3 (primary)
  • CloudWatch Logs
  • AWS Glue Data Catalog
  • DynamoDB (via federated query)

Cost Optimization:

  • Partition data: Reduce scanned data
  • Columnar formats: Parquet, ORC (scan only needed columns)
  • Compression: Reduce data size

Athena vs Redshift

FeatureAthenaRedshift
TypeServerless queryData warehouse
SetupNoneProvision clusters
PerformanceSlower (seconds to minutes)Faster (sub-second to seconds)
ConcurrencyUnlimitedLimited by WLM
Use CaseAd-hoc queries, occasional analyticsComplex queries, high concurrency

Use Cases

  • Log Analysis: CloudTrail, VPC Flow Logs, ELB logs
  • Ad-hoc Analysis: Quick queries on S3 data
  • Data Discovery: Explore data before building pipeline
  • Reporting: Periodic reporting on S3 data

Amazon Redshift

Overview

Amazon Redshift is a fully managed, petabyte-scale data warehouse service.

Key Characteristics

FeatureDescription
Massively ParallelDistributed across nodes
Columnar StorageOptimized for analytics
CompressionReduces storage needs
ScalableScale compute and storage independently

Redshift Architecture

Leader Node:

  • Receives queries
  • Plans execution
  • Aggregates results

Compute Nodes:

  • Execute queries
  • Store data
  • Distributed across slices

Node Types:

TypeFamilyUse Case
DC2Dense ComputeHigh performance, frequently accessed
RA3Managed StorageSeparation of compute and storage
DC1LegacyBeing replaced by RA3

Redshift Features

Concurrency Scaling:

  • Automatically add capacity
  • Support nearly unlimited concurrent users
  • Free for compatible workloads

Materialized Views:

  • Pre-computed results
  • Faster query performance
  • Auto-refresh options

Redshift Spectrum:

  • Query data directly in S3
  • No data loading required
  • Extends data warehouse to data lake

Distribution Styles:

StyleDescriptionUse Case
KEYDistributed on specific columnJoins on that column
ALLCopy to all nodesSmall tables
EVENRound-robin distributionDefault

Use Cases

  • Data Warehousing: Centralized analytics
  • Business Intelligence: Power BI, Tableau, QuickSight
  • High-performance Analytics: Complex queries
  • Data Lake Integration: Redshift Spectrum

Amazon Kinesis

Overview

Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information.

Kinesis is a collection of services for processing streams of various data. Data is processed in "shards" with each shard able to ingest 1000 records per second.

Key Characteristics:

  • Transient data store (default retention of 24 hours, configurable up to 7 days)
  • Default limit of 500 shards (can request increase to unlimited)
  • Records consist of partition key, sequence number, and data blob (up to 1 MB)
  • Synchronous replication across three AZs

Kinesis Services

ServicePurposeUse Case
Kinesis Data StreamsReal-time streamingIoT, clickstreams, logs
Kinesis Data FirehoseLoad streaming dataS3, Redshift, OpenSearch
Kinesis Data AnalyticsReal-time SQL on streamsAnomaly detection, filtering
Kinesis Video StreamsVideo streamingCamera feeds, video analysis

Kinesis Data Streams

Purpose: Real-time processing of streaming big data.

Key Characteristics:

  • Stores data for later processing (key difference from Firehose which delivers directly)
  • Producers push data via Kinesis API, Producer Library (KPL), or Kinesis Agent
  • Consumers process data in real time (EC2 instances, Lambda)
  • Records accessible for 24 hours by default (can be extended to 7 days)

Shards:

  • Base throughput unit of Kinesis Data Streams
  • One shard provides 1 MB/sec data input and 2 MB/sec data output
  • Each shard supports up to 1000 PUT records per second
  • Stream is composed of one or more shards
  • Total capacity = sum of capacities of all shards

Resharding:

  • Shard Split: Divide single shard into two (increases capacity and cost)
  • Shard Merge: Combine two shards into one (decreases capacity and cost)
  • Adjust number of shards to adapt to data flow changes

Pricing:

  • Shard Hour: On-demand capacity
  • Data In/Out: Per GB
  • Extended data retention (beyond 24 hours)

Security:

  • KMS master keys for encryption
  • IAM policies for access control
  • HTTPS endpoints for encryption in flight
  • VPC endpoints available

Use Cases:

  • Real-time analytics
  • IoT data ingestion
  • Log and event data collection
  • Clickstream tracking
  • Accelerated log and data feed intake

Kinesis Data Firehose

Purpose: Easiest way to load streaming data into data stores and analytics tools.

Key Characteristics:

  • Serverless (no resources to manage, no capacity planning)
  • Captures, transforms, and loads streaming data
  • Near real-time analytics with existing BI tools
  • Synchronous replication across three AZs during transport
  • No shards, fully automated

Destinations:

  • Amazon S3
  • Amazon Redshift
  • Amazon Elasticsearch Service (OpenSearch)
  • Splunk

Features:

  • Data transformation (Lambda)
  • Data conversion (Parquet, ORC)
  • Compression, encryption
  • Batch processing
  • Can back up source data to S3 before transformation

Data Flow:

  • For S3: Delivers directly to bucket
  • For Redshift: Delivers to S3 first, then issues COPY command to Redshift
  • For Elasticsearch: Delivers to cluster, optionally backs up to S3
  • For Splunk: Delivers to Splunk, optionally backs up to S3

Record Size: Maximum 1000 KB per record (before Base64-encoding)

Use Cases:

  • Load streaming data to S3
  • Real-time data lake ingestion
  • Log aggregation
  • ETL automation

Kinesis Data Analytics

Purpose: Easiest way to process and analyze real-time streaming data.

Key Characteristics:

  • Standard SQL queries to process Kinesis streams
  • Real-time analysis
  • Sits over Kinesis Data Streams and Kinesis Data Firehose
  • Can ingest from both Streams and Firehose

Application Components:

  • Input: Streaming source for application
  • Application Code: SQL statements that process input and produce output
  • Output: In-application streams for intermediate results

Input Types:

  • Streaming data sources (continuously generated)
  • Reference data sources (static data for enrichment)

Destinations:

  • S3, Redshift, Elasticsearch
  • Kinesis Data Streams

Use Cases:

  • Time-series analytics
  • Real-time dashboards
  • Real-time alerts and notifications
  • Anomaly detection

Kinesis Video Streams

Purpose: Securely stream video from connected devices to AWS.

Key Characteristics:

  • Durably stores, encrypts, and indexes video data streams
  • Easy-to-use APIs for access
  • Stores data for 24 hours by default (up to 7 days)
  • Stores data in shards
  • Encryption at rest with KMS

Shard Capacity:

  • 5 transactions per second for reads
  • Max read rate of 2 MB per second
  • 1000 records per second for writes
  • Max write of 1 MB per second

Use Cases:

  • Camera feeds
  • Video analysis
  • Machine learning on video
  • Security monitoring

Kinesis Client Library (KCL)

Purpose: Java library that helps read records from Kinesis Stream with distributed applications.

Key Differences:

  • KCL vs Kinesis Data Streams API:
    • Kinesis Data Streams API: Manage streams, resharding, putting/getting records
    • KCL: Abstraction specifically for processing data in consumer role

KCL Functions:

  • Connects to stream and enumerates shards
  • Coordinates shard associations with other workers
  • Instantiates record processor for every shard
  • Pulls data records and pushes to record processor
  • Checkpoints processed records
  • Balances shard-worker associations when instances change

Scaling:

  • Each shard processed by exactly one KCL worker
  • One worker can process multiple shards
  • Number of instances should not exceed number of shards
  • Progress checkpointed into DynamoDB (requires IAM access)

Use Cases:

  • Distributed stream processing
  • EC2, Elastic Beanstalk, on-premises servers

Kinesis vs SQS vs SNS

FeatureKinesisSQSSNS
Data ModelPullPullPush
Data PersistenceUp to 7 daysDeleted after consumedNot persisted
ThroughputMust provision shardsNo provisioning neededNo provisioning needed
OrderingShard-level orderingFIFO queues onlyNo ordering
ConsumersAs many as neededAs many as neededUp to 10M subscribers
Use CaseReal-time big data, ETLDecoupling, bufferingFan-out, notifications

AWS Glue

Overview

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data.

Glue Components

ComponentPurpose
Data CatalogCentral metadata repository
CrawlersDiscover data and populate catalog
ETL JobsTransform and move data
JobsSpark or Python scripts
TriggersSchedule ETL jobs
WorkflowsOrchestrate multiple jobs

Glue Data Catalog

Purpose: Central metadata repository for data assets.

Features:

  • Tables, schemas, partitions
  • Integration with Athena, Redshift Spectrum, EMR
  • Column-level statistics

Use Cases

  • Data Discovery: Catalog data across S3
  • ETL Pipelines: Transform data for analytics
  • Data Lake: Build and maintain data lake
  • Schema Evolution: Handle schema changes

Amazon EMR (Elastic MapReduce)

Overview

Amazon EMR is a managed cluster platform that simplifies running big data frameworks.

Supported Applications

ApplicationTypeUse Case
Apache SparkIn-memory processingETL, machine learning, graph processing
Hadoop MapReduceBatch processingBig data processing
PrestoDistributed SQLInteractive queries
HiveData warehouseSQL on Hadoop
HBaseNoSQL databaseReal-time read/write
FlinkStream processingReal-time analytics

EMR Use Cases

  • Big Data Processing: Large-scale data processing
  • Machine Learning: Spark MLlib
  • Data Transformation: ETL at scale
  • Log Processing: Web logs, sensor data

MSK (Managed Streaming for Kafka)

Overview

Amazon MSK is a fully managed Apache Kafka service.

Key Features

FeatureDescription
ManagedAWS handles provisioning, patching
Highly AvailableMulti-AZ deployment
CompatibleNative Kafka APIs
SecureIAM authentication, encryption

Use Cases

  • Event Streaming: Real-time event processing
  • Data Pipelines: Stream processing
  • Log Aggregation: Centralized logging
  • Microservices: Event-driven architecture

Exam Tips - Analytics Services

High-Yield Topics

  1. QuickSight:

    • Serverless BI tool
    • SPICE = in-memory engine
    • ML insights = anomaly detection, forecasting
  2. Athena:

    • Serverless SQL queries on S3
    • Pay per TB scanned ($5/TB)
    • Use partitions, columnar formats to reduce cost
  3. Redshift:

    • Petabyte-scale data warehouse
    • Columnar storage, massively parallel
    • RA3 = managed storage (separate compute/storage)
    • Spectrum = query S3 data
  4. Kinesis:

    • Data Streams = real-time streaming, stores data (24h-7d), 1 MB/s input, 2 MB/s output per shard
    • Firehose = serverless loading to S3/Redshift/OpenSearch, no shards
    • Analytics = SQL on streams, real-time processing
    • Video = video streaming for camera feeds
    • KCL = Java library for distributed stream processing
  5. Glue:

    • Data Catalog = metadata repository
    • ETL = serverless data transformation
    • Crawlers = discover data

Additional Resources

DigitalCloud Training Cheat Sheets

Official AWS Documentation

AWS Analytics Resources


Next: AI/ML Services

Released under the MIT License.