BlowStack
Tools
Cloud Services Hub
AWS Glue

Icon source: AWS

AWS Glue

Cloud Provider: AWS

What is AWS Glue

Amazon Glue is a fully managed extract, transform, and load (ETL) service that simplifies data preparation and loading for analytics for both customers and users.

Amazon Glue is a cloud-based service provided by Amazon Web Services (AWS) that serves as a fully managed extract, transform, and load (ETL) solution. It is designed to streamline the process of preparing and combining data for analytics and data processing workflows.

Amazon Glue automates the time-consuming steps of data preparation, such as discovering, categorizing, cleaning, enriching, and moving data between various data stores. It features a data catalog that organizes data sources, transforms data, and controls data access across AWS. This catalog provides a unified view of all data sources and simplifies the management of data across different AWS services and your data warehouse.

Amazon Glue dynamically generates ETL code in Python or Scala, which can be further customized and stored for future use. It is serverless, meaning that it automatically provisions the compute resources required to process data according to the workload. This makes it a scalable solution that adjusts to the volume of data, whether it’s processing a few records or terabytes of data. Glue is integrated with a wide range of AWS services, making it a versatile tool for different ETL tasks and enabling seamless data integration across the AWS ecosystem.

In addition to ETL capabilities, Amazon Glue offers features like the Glue DataBrew, which allows users to clean and normalize data without writing code. It provides a visual interface to easily combine, transform, and categorize data. Amazon Glue's ability to handle both batch and streaming data makes it suitable for various data processing needs, from big data analytics to real-time analytics applications.

This comprehensive service simplifies data preparation and integration, facilitating more efficient and effective data analysis and decision-making processes within organizations.

Key AWS Glue Features

Offers both visual and code-based interfaces for data preparation and transformation, with features like crawling, schema inference, format conversion, and dataset joining.

Acts as a central repository for metadata about data assets, facilitating discovery and query across AWS services like Athena, EMR, and Redshift Spectrum.

Enables running ETL jobs without provisioning infrastructure, where you pay only for the resources consumed during job execution.

Manages schema evolution and compatibility, ensuring data consistency across the Glue Data Catalog and external systems.

Incorporates machine learning techniques like data quality checks, entity extraction, and fuzzy matching to enhance data readiness for analytics.

Automatically generates ETL code in Python or Scala, which users can customize and reuse, streamlining the script development process.

Provides a user-friendly interface for creating, executing, and monitoring ETL jobs, making it accessible to non-developers.

Detects and understands data schemas to facilitate accurate data transformation and transfer.

Offers tools for setting up job schedules and monitoring their execution, complete with logging and notifications to track ETL process efficiency.

AWS Glue Use Cases

Amazon Glue streamlines the integration of various data sources, such as databases, data lakes, and data warehouses, offering a unified view for analytics, and supports BI initiatives by enabling comprehensive reporting and dashboard creation.

It facilitates the creation and execution of ETL (extract, transform, load) and ELT (extract, load, transform) workflows through both visual and code-based interfaces, aiding in data cleaning, normalization, and preparation.

Glue aids in forming and analyzing data lakes on Amazon S3, working seamlessly with services like Athena, EMR, and Redshift Spectrum to analyze large datasets.

Provides tools for schema inference, data quality checks, and machine learning transformations, thus readying data for machine learning applications.

Simplifies data access via its data catalog, enabling applications to easily utilize analytics-ready data.

Supports both batch processing and real-time data integration, accommodating streaming sources like Kafka and Amazon Kinesis for continuous data analysis.

Automates the migration process from legacy systems to cloud-based solutions such as Amazon Redshift, enhancing query and analysis capabilities.

Processes and transforms log data and streams from various sources for real-time analytics, aiding in monitoring, troubleshooting, and insight extraction.

Services AWS Glue integrates with

AWS Glue can catalog data that can be queried directly with Amazon Athena.

AWS Glue can read from and write to Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose.

AWS Glue Data Catalog is used by AWS Lake Formation to manage and secure data lakes.

AWS Glue can load data into and extract data from Amazon Redshift data warehouses.

AWS Glue can connect to and perform ETL operations on data from Amazon DynamoDB tables.

AWS Glue can connect to and perform ETL operations on data stored in Amazon RDS databases.

AWS Glue can extract, transform, and load (ETL) data to and from Amazon S3 buckets.

AWS Glue pricing models

AWS Glue uses DPUs as a measure of resource consumption for running ETL jobs and development endpoints. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. The cost is based on the number of DPUs consumed by your jobs and endpoints.

You can be charged for storing and accessing metadata in the Glue Data Catalog. There is a monthly cost for each stored table definition, and additional charges for accessing, creating, updating, and deleting table definitions.

DataBrew is a feature of AWS Glue that allows you to clean and normalize data without writing code. Pricing is based on the number of DataBrew nodes (virtual servers) used and the amount of time the jobs run.

You are charged based on the number of DPUs used by your ETL jobs and the duration of the jobs in hourly increments. DPU usage is measured in one-second increments.

There is a minimum charge of 2 DPUs for 30 minutes. You pay hourly rates based on the number of DPUs.

Visual previews in Studio incur a fixed charge of 2 DPUs for 30 minutes. You also pay for underlying AWS services used like storage, databases etc.

AWS Glue

Cloud Provider: AWS

What is AWS Glue

Key AWS Glue Features

ETL Capabilities

Offers both visual and code-based interfaces for data preparation and transformation, with features like crawling, schema inference, format conversion, and dataset joining.

AWS Glue Data Catalog

Acts as a central repository for metadata about data assets, facilitating discovery and query across AWS services like Athena, EMR, and Redshift Spectrum.

Serverless Execution

Enables running ETL jobs without provisioning infrastructure, where you pay only for the resources consumed during job execution.

Schema Registry

Manages schema evolution and compatibility, ensuring data consistency across the Glue Data Catalog and external systems.

Machine Learning for Data Preparation

Incorporates machine learning techniques like data quality checks, entity extraction, and fuzzy matching to enhance data readiness for analytics.

Flexible Scripting

Automatically generates ETL code in Python or Scala, which users can customize and reuse, streamlining the script development process.

Visual ETL Tool (Glue Studio)

Provides a user-friendly interface for creating, executing, and monitoring ETL jobs, making it accessible to non-developers.

Automatic Schema Recognition

Detects and understands data schemas to facilitate accurate data transformation and transfer.

Job Scheduling and Monitoring

Offers tools for setting up job schedules and monitoring their execution, complete with logging and notifications to track ETL process efficiency.

AWS Glue Use Cases

Data Integration and Analytics

Amazon Glue streamlines the integration of various data sources, such as databases, data lakes, and data warehouses, offering a unified view for analytics, and supports BI initiatives by enabling comprehensive reporting and dashboard creation.

ETL/ELT Workflows

It facilitates the creation and execution of ETL (extract, transform, load) and ELT (extract, load, transform) workflows through both visual and code-based interfaces, aiding in data cleaning, normalization, and preparation.

Data Lake Management and Analytics

Glue aids in forming and analyzing data lakes on Amazon S3, working seamlessly with services like Athena, EMR, and Redshift Spectrum to analyze large datasets.

Machine Learning Preparation

Provides tools for schema inference, data quality checks, and machine learning transformations, thus readying data for machine learning applications.

Application Development Support

Simplifies data access via its data catalog, enabling applications to easily utilize analytics-ready data.

Batch and Real-time Data Processing

Supports both batch processing and real-time data integration, accommodating streaming sources like Kafka and Amazon Kinesis for continuous data analysis.

Data Warehouse Modernization

Automates the migration process from legacy systems to cloud-based solutions such as Amazon Redshift, enhancing query and analysis capabilities.

Log and Stream Analytics

Processes and transforms log data and streams from various sources for real-time analytics, aiding in monitoring, troubleshooting, and insight extraction.

Services AWS Glue integrates with

Amazon Athena

AWS Glue can catalog data that can be queried directly with Amazon Athena.

Amazon Kinesis

AWS Glue can read from and write to Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose.

AWS Lake Formation

AWS Glue Data Catalog is used by AWS Lake Formation to manage and secure data lakes.

Amazon Redshift

AWS Glue can load data into and extract data from Amazon Redshift data warehouses.

Amazon DynamoDB

AWS Glue can connect to and perform ETL operations on data from Amazon DynamoDB tables.

Amazon RDS

AWS Glue can connect to and perform ETL operations on data stored in Amazon RDS databases.

Amazon S3

AWS Glue can extract, transform, and load (ETL) data to and from Amazon S3 buckets.

AWS Glue pricing models

Glue Data Catalog

You can be charged for storing and accessing metadata in the Glue Data Catalog. There is a monthly cost for each stored table definition, and additional charges for accessing, creating, updating, and deleting table definitions.

Glue DataBrew

DataBrew is a feature of AWS Glue that allows you to clean and normalize data without writing code. Pricing is based on the number of DataBrew nodes (virtual servers) used and the amount of time the jobs run.

Glue ETL jobs

You are charged based on the number of DPUs used by your ETL jobs and the duration of the jobs in hourly increments. DPU usage is measured in one-second increments.

Glue Interactive Sessions

There is a minimum charge of 2 DPUs for 30 minutes. You pay hourly rates based on the number of DPUs.

Glue Studio

Visual previews in Studio incur a fixed charge of 2 DPUs for 30 minutes. You also pay for underlying AWS services used like storage, databases etc.