Querying Data In Place With Amazon S3 And Analytics Tools

Querying data in place with amazon s3 and analytics tools

File Name: Querying_data_in_place_with_Amazon_S3_and_analytics_tools_STG402.pdf

File Size: 959.70 KB

File Type: Application/pdf

Last Modified: 2 years

Status: Available

Last checked: 7 days ago!

This Document Has Been Certified by a Professional

100% customizable

Language: English

We recommend downloading this file onto your computer


Querying data in place with
Amazon S3 and analytics tools
John Mallory
Storage Business Development
Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved

How it works: Data lakes and analytics on AWS
Build data lakes quickly Simplify security management Enable self-service and combined analytics
• Identify, crawl, and catalog sources • Enforce encryption • Analysts discover all data available for analysis
• Ingest and clean data • Define access policies from a single data catalog
• Transform into optimal formats • Implement audit login • Use multiple analytics tools over the same data
OLTP Amazon FSx for Lustre
AI Services
Amazon Athena
AWS Glue
Devices Data
Catalog Amazon EMR
Amazon S3 Amazon Redshift
Amazon Kinesis
Data Streams
Amazon QuickSight
Focus: Reduced Time to Business Outcomes
Processing and querying in place
User-Defined Functions Fully Managed Process & Query
AWS Glue Amazon Amazon Amazon
AWS Lambda
Athena Redshift SageMaker
Bring your own functions & code Catalog, transform & query data in Amazon S3
Execute without provisioning servers No physical instances to manage
Focus on Agility and Extracting Data Value
Process and query data in place on Amazon S3
Amazon Athena Amazon Redshift Amazon SageMaker AWS Glue
Amazon S3
Today: All of these tools…
retrieve a lot of data they don’t need from Amazon S3 and then scan
and filter objects
Amazon S3
Amazon S3 Select changes the equation
Select a subset of an object’s data with a SQL expression
Amazon S3 scans and filters the object and returns matching results
Amazon S3 Select
Easy to use Integrated Open
Standard SQL expression Works just like a GET request AWS SDK, adopted for many
AWS services and by many
APN Partners
Amazon S3 Select usage
SELECT s.country, s.city from S3Object s where s.city = 'Seattle'
Operates within the Amazon S3 system
SQL Statement operates on a per-object basis
Returns SQL query filtered results
Supports CSV, JSON, and Parquet formats
Integrated with Spark, Hive, and Presto on Amazon EMR
Scan Range Selects: Up to 10x performance boost for large objects NEW!
SELECT * FROM s3object s WHERE s.Category = 'ASSAULT'
and s.PdDistrict = 'MISSION' and s.\"Date\" BETWEEN '2010-12-31' AND '2012-01-01'
Amazon S3 Select-Serverless Applications
Amazon AWS Amazon S3 Amazon
S3 Lambda Select SNS
Set up a catalog, ETL, and data prep
with AWS Glue
Serverless provisioning, configuration,
and scaling to run your ETL jobs on
Apache Spark
Pay only for the resources used for jobs
Crawl your data sources, identify data
formats, and suggest schemas and
Automates the effort in building,
maintaining, and running ETL jobs
Event-driven batch ingest pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
New raw Run Crawl SLA
Crawl Ready
“optimize” optimized
data arrives raw dataset for reporting deadline
job dataset
< 22:00 Start Start Start Reporting 02:00
UTC crawler job or trigger crawler dataset UTC
Data arrives Crawler Job
in Amazon S3 succeeds succeeds

How it works: Data lakes and analytics on AWS Build data lakes quickly • Identify, crawl, and …

Download Now

Documemt Updated


Popular Download


Frequently Asked Questions

Can i run a sql query on amazon s3?

Amazon S3 Select and Amazon S3 Glacier Selectenable customers to run structured query language SQL queries directly on data stored in S3 and Amazon S3 Glacier. With S3 Select, you simply store your data on S3 and query using SQL statements to filter the contents of S3 objects, retrieving only the data that you need.

What is amazon s3 select and how does it work?

With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency of retrieving this data.

How can i perform in place querying of data assets in amazon s3?

Another way to perform in-place querying of data assets in a data lake built on Amazon S3 is to use Amazon Redshift Spectrum. Amazon Redshift is a large-scale, managed data warehouse service that supports massive parallel processing.

What is the best way to analyze data in aws s3?

In this case, we need to use AWS Athena which is an interactive query service that makes it easy to analyze data in S3 using standard SQL and you pay only for the queries that you run.