File Name: Querying_data_in_place_with_Amazon_S3_and_analytics_tools_STG402.pdf
File Size: 959.70 KB
File Type: Application/pdf
Last Modified: 2 years
Last checked: 7 days ago!
This Document Has Been Certified by a Professional
We recommend downloading this file onto your computer
STG402Querying data in place withAmazon S3 and analytics toolsJohn MalloryStorage Business DevelopmentAmazon Web Services © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved
How it works: Data lakes and analytics on AWS Build data lakes quickly Simplify security management Enable self-service and combined analytics • Identify, crawl, and catalog sources • Enforce encryption • Analysts discover all data available for analysis • Ingest and clean data • Define access policies from a single data catalog • Transform into optimal formats • Implement audit login • Use multiple analytics tools over the same data OLTP Amazon FSx for Lustre OLTP IAM KMS ERP ERP AI Services CRM CRM LOB LOB Amazon Athena AWS Glue Devices Devices Data Catalog Amazon EMR Sensors Amazon S3 Amazon Redshift Web Amazon Kinesis Data Streams Social Amazon QuickSight Focus: Reduced Time to Business Outcomes Processing and querying in place User-Defined Functions Fully Managed Process & Query AWS Glue Amazon Amazon Amazon AWS Lambda Athena Redshift SageMaker Bring your own functions & code Catalog, transform & query data in Amazon S3Execute without provisioning servers No physical instances to manage Focus on Agility and Extracting Data Value Process and query data in place on Amazon S3 Amazon Athena Amazon Redshift Amazon SageMaker AWS Glue Spectrum Amazon S3 Today: All of these tools… retrieve a lot of data they don’t need from Amazon S3 and then scan and filter objects Amazon S3 Amazon S3 Select changes the equation Select a subset of an object’s data with a SQL expressionAmazon S3 scans and filters the object and returns matching results Amazon S3 Select Easy to use Integrated OpenStandard SQL expression Works just like a GET request AWS SDK, adopted for many AWS services and by many APN Partners Amazon S3 Select usageSELECT s.country, s.city from S3Object s where s.city = 'Seattle' Operates within the Amazon S3 system SQL Statement operates on a per-object basis Returns SQL query filtered results Supports CSV, JSON, and Parquet formats Integrated with Spark, Hive, and Presto on Amazon EMR Scan Range Selects: Up to 10x performance boost for large objects NEW!SELECT * FROM s3object s WHERE s.Category = 'ASSAULT'and s.PdDistrict = 'MISSION' and s.\"Date\" BETWEEN '2010-12-31' AND '2012-01-01' Amazon S3 Select-Serverless Applications Lambda Trigger Amazon AWS Amazon S3 Amazon S3 Lambda Select SNS Set up a catalog, ETL, and data prepwith AWS GlueServerless provisioning, configuration,and scaling to run your ETL jobs onApache SparkPay only for the resources used for jobsCrawl your data sources, identify dataformats, and suggest schemas andtransformationsAutomates the effort in building,maintaining, and running ETL jobs Event-driven batch ingest pipelineLet Amazon CloudWatch Events and AWS Lambda drive the pipeline New raw Run Crawl SLA Crawl Ready “optimize” optimized data arrives raw dataset for reporting deadline job dataset < 22:00 Start Start Start Reporting 02:00 UTC crawler job or trigger crawler dataset UTC ready Data arrives Crawler Job in Amazon S3 succeeds succeeds
How it works: Data lakes and analytics on AWS Build data lakes quickly • Identify, crawl, and …
Amazon S3 Select and Amazon S3 Glacier Selectenable customers to run structured query language SQL queries directly on data stored in S3 and Amazon S3 Glacier. With S3 Select, you simply store your data on S3 and query using SQL statements to filter the contents of S3 objects, retrieving only the data that you need.
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency of retrieving this data.
Another way to perform in-place querying of data assets in a data lake built on Amazon S3 is to use Amazon Redshift Spectrum. Amazon Redshift is a large-scale, managed data warehouse service that supports massive parallel processing.
In this case, we need to use AWS Athena which is an interactive query service that makes it easy to analyze data in S3 using standard SQL and you pay only for the queries that you run.