AWS is one of the biggest cloud providers in the world. It offers a multitude of services for cloud storage and computational needs. AWS S3 is one of the most popular services on the AWS platform. It offers unparalleled durability and availability of data while also being one of the cheapest storage options in the cloud. Given its multitude of features and theoretically infinite storage, it is possible for you to have terabytes or petabytes of data in S3 buckets. Analyzing such data is almost impossible if we were to open each and every file and read through petabytes of data manually. This is where the AWS Athena Service comes in. In simple words, AWS Athena is used as a data analysis service by simply accessing the data available in the S3 bucket using SQL queries. So, if you understand even the basics of SQL, you can start working on analyzing S3 data with AWS Athena. Let us understand this with a short example. Let us assume you have configured one of your buckets as the access log bucket for all the balancers you have in multiple accounts in your organization. How would you query years of log data and get important meaningful insights from these log files? The answer is AWS Athena.
Features of AWS Athena
SQL Based tool: AWS Athena is a very simple to use, SQL-based service. You simply point Athena to one of your buckets, define the schema of your data and then start using the SQL queries in your bucket.Serverless: You do not have to maintain an infrastructure for running AWS Athena. Athena is serverless and is optimized to use multiple compute resources automatically as per your requirements.Fast and optimized: Athena has been optimized to use an efficient number of resources to deliver your query results as soon as possible. It works great with small and complex analyses of the S3 data.Cost-effective: Athena is a pay-as-you-use service. This means that there is no base cost for using AWS Athena; you only pay for the queries you run in the Athena Service.Durability and Availability of the data: Since Athena relies on the data in your S3 buckets, you can rest assured that the data is highly available and durable.Support: Athena supports several file formats such as JSON, CSV, Avro, ORC, and more.Security: Athena utilizes security features like IAM, bucket policies, and ACLs, which make it highly secure.Athena Backend: Athena uses the open-source Presto as a backend. Presto is a distributed SQL engine for querying and analyzing big data workloads.
Pricing and optimization of AWS Athena
When using AWS Athena, you will be charged a fee of 5$ per terabyte scanned when using AWS Athena. This price may vary slightly for some of the AWS regions.
Efficient queries: If you are familiar with SQL, you must know that there can be more than one way to get certain results from data using SQL. To optimize Athena, you can use efficient queries that should take lesser time to run your queries. Data Transformation: If you want to optimize your queries further, you can compress, partition, or convert your data to a smaller dataset, further reducing your query running time. By using data transformation, you can optimize your query by up to 90%.Joining virtual tables: Joining tables is a very important feature of SQL. While it may seem like a simple operation, it can be a very complex operation. It is recommended to keep larger tables to the left and tables with fewer data to the right.
Difference between AWS Athena and Redshift Spectrum
Redshift Spectrum is another service that can be used to run queries on AWS S3 buckets. Both Redshift Spectrum and Athena are serverless, can run complex queries on S3, and are priced at 5% per Terrabyte of data processed, so what is the difference?
Performance
AWS Athena uses computational resources from a pool of resources provided by AWS. In contrast, the Redshift spectrum uses the resources allocated according to the Redshift cluster size. This allows you to have more control over the resources being used by the Redshift Spectrum service, and if you want enhanced performance, you can always increase the size of your Redshift cluster.
Loading the data for processing
Both the services use virtual tables for running SQL queries on your data. Virtual Tables are made using the Glue Data Catalog for schema management. Athena can directly use the data from Glue Data Catalog schema, whereas when using Redshift Spectrum, you will need to configure external tables from the Glue Data Catalog Schema. These are the main differences between the two services, so when choosing between Redshift spectrum and Athena. You should use Redshift Spectrum if you want to query data in S3 along with the data stored in the Redshift data warehouse or if you are willing to pay higher costs to improve your query performance in S3. Athena can be useful when all your data is only in the S3 buckets.
Difference between AWS Athena and S3 Select
S3 select is another serverless service from AWS to query data in S3 using SQL. However, the main difference between S3 Select and Athena is that you can only use the SQL SELECT queries when using S3 Select, whereas Athena can be used for all kinds of SQL queries. Another limitation of S3 select is that you can only perform the SELECT operation on one object at a time. So, if your requirement is only to pull data or a subset of data from an S3 object, you should use S3 Select. For complex queries and operations like JOIN or to process the data in an entire S3 bucket, you should use AWS Athena.
Advantages of using AWS Athena
Athena eliminates the need to develop a complex and expensive data analysis tool for your data.Athena is serverless, which makes it a fairly easy-to-use service. There is no need for you to maintain the infrastructure.AWS has optimized Athena to be able to retrieve query results within seconds of you running the Athena query.Since Athena is serverless, you do not have to pay for the Athena service. You only pay for the queries you choose to run. Even if you were to cancel a query, you would only be charged for the data processed and not the whole query.Athena can be integrated with other AWS services easily. One of the most important and valuable integrations for AWS Athena is with the AWS Glue service. AWS Glue is an ETL service that can be used to transform data into a more efficient and readable form, which can then be analyzed with AWS Athena.Athena allows you to run multiple queries simultaneously.
Limitations of AWS Athena
Row size: The row size in a virtual AWS Athena table should not exceed 32 Megabytes. This limit can be increased in very limited cases for CSV and JSON files to up to 100 Megabytes, but it is highly recommended to limit the row size to 32 Megabytes to avoid unwanted errors.Hidden Files: Files with names starting with an underscore (_) or dot (.) are treated as hidden by the Athena Service. This can be used as a feature to avoid processing unwanted files.Athena is unable to process data in S3 Glacier or S3 Glacier Deep Archive. These storage classes are only for data archiving options and have a retrieval time from minutes to hours, so it is understood if AWS Athena cannot retrieve data from these classes.Athena does not support stored procedures.Athena version 1 does not support parameterized queries. This is supported in the Athena version 2.Statements like MERGE, UPDATE, CREATE TABLE LIKE, DESCRIBE INPUT and DESCRIBE OUTPUT are not supported.
Conclusion
In this article, we have discussed the data analysis tool from AWS the AWS Athena, its features, advantages, and some limitations. Athena is one of the most powerful tools for processing and analyzing data in S3 buckets. Even the limitations of the service are pretty simple and can be worked around if needed. You may also look at some best practices to secure AWS S3 Storage.