AWS Certified Solutions Architect – Associate Certification Quiz 5

1. What is the main value proposition of data lakes?

The ability to define the data schema before ingesting and storing data.
The ability to ingest and store data that could be the answer for future questions when they are processed with the correct data processing mechanisms.
The ability to combine multiple databases together to expand their capacity and availability.
The ability to store user-generated data, such as data from antennas and sensors.
Note: A data lake is a centralized repository that stores data as-is, without needing to first structure the data and run different types of analytics. The ingested data can be later processed and visualized for specific needs.

2. True or False: Two of the fundamental components of data lakes are data catalog and search.

True
False
Note: Mature data lakes provide efficient data cataloging (otherwise known as indexing) and searching mechanisms to quickly discover what data is stored, and where.

3. A company sorts and structures data before entering information into a database. They also store unstructured data in another storage location. These two data locations are siloed from each other. How can the company benefit from using a data lake?

Data lakes mostly process data after it has been stored in the cloud or on-premises.
A data lake provides the most secure way to store data in the AWS Cloud.
With a data lake, a company can store structured and unstructured data at virtually any scale.
A data lake is a direct replacement of a data warehouse.
Note: Companies can store data as-is, without needing to first structure the data. After analyzing raw data, companies can identify and act upon opportunities for business growth more quickly by attracting and retaining customers, boosting productivity, proactively maintaining devices, and making informed decisions. For more information, see the Data Lake Characteristics and Components reading.

4. Which statements about data lakes and data warehouses are true? (Choose TWO.)

Data lakes use schema-on-write architectures and data warehouses use schema-on-read architectures.
Data lakes offer more choices in terms of the technology that is used for processing data. In contrast, data warehouses are more restricted to using Structured Query Language (SQL) as the query technology.
The solutions architect can combine both data lakes and data warehouses to better extract insights and turn data into information.
The solutions architect cannot attach data visualization tools to data warehouses.
Data lakes are not future-proof, which means that they must be reconfigured each time new data is ingested.

5. True or False: Data lakes integrate with analytics tools that can help companies eliminate costly and complex extract, transform, and load (ETL) processes.

True
False
Note: The breadth and depth of analytics services on AWS makes it easier to provision the appropriate resources to run whatever analysis is most appropriate for a specific need.

6. Which term indicates that a data lake lacks curation, management, cataloging, lifecycle or retention policies, and metadata?

Data swamp
Data warehouse
Data catalog
Database
Note: Data swamp is an informal term that represents a data lake with disorganized data.

7. Which service can be used to run simple queries against data in a data lake?

Amazon Kinesis Agent
Amazon Simple Storage Service (Amazon S3)
Amazon Kinesis Data Firehose
Amazon Athena
Note: After the dataset reaches Amazon Simple Storage Service (Amazon S3), Structured Query Language (SQL) queries can be run in Amazon Athena to gain insights on the data.

8. True or False: AWS Lake Formation is a centralized repository, such as a data lake, that stores structured and unstructured data at any scale.

True
False
Note: AWS Lake Formation is a service that automates many of the manual steps that are needed to create data lakes. The steps include collecting, cleansing, moving, and cataloging data. The steps also include securely making that data available for deriving insights.

9. What is the structure of the AWS Glue Metadata Catalog?

The AWS Glue Metadata Catalog is the storage that is associated with automated database backups and any active database snapshots. It consists of the General Purpose SSD, Provisioned IOPS SSD, Throughput Optimized HDD, and Cold HDD volume types.
The AWS Glue Metadata Catalog consists of tables. Each table has a schema, which outlines the structure of a table, including columns, data type definitions, and more. The tables are organized into logical groups that are called databases.
The AWS Glue Metadata Catalog contains buckets with different types of storage options. AWS Glue Metadata Catalog stores data as objects in these buckets.
The AWS Glue Metadata Catalog consists of file systems or databases for any applications that require fine, granular updates and access to raw, unformatted, block-level storage.
Note: The AWS Glue Metadata Catalog is a central repository that contains a collection of tables that are organized into databases. A table in the AWS Glue Data Catalog consists of the names of columns, data type definitions, partition information, and other metadata about a base dataset.

10. True or False: Customers can use Amazon API Gateway to ingest real-time data in a RESTful manner through the creation of an HTTP-based API, which acts as the front door or interface to ingestion logic or data lake storage on the backend.

True
False
Note: Amazon API Gateway is a service that customers can use to host APIs that act as the front door or an interface to a backend. The backend could be an application running on an Amazon Elastic Compute Cloud (Amazon EC2) instance, an AWS Lambda function, or even another AWS service, such as Amazon Kinesis.

11. Which statement about AWS Lake Formation is true?

AWS Lake Formation registers the Amazon Simple Storage Service (Amazon S3) buckets and paths where the data lake will reside.
AWS Lake Formation deploys, operates, and scales clusters in the AWS Cloud
AWS Lake Formation ingests, cleanses, and transforms the structured and organized data.
AWS Lake Formation runs big data frameworks, such as Apache Hadoop.
Note: AWS Lake Formation makes it easier for customers to build, secure, and manage data lakes.

12. Which service is commonly used for real-time data processing when Amazon Kinesis Data Streams is used for data ingestion?

AWS Glue Jobs
Amazon Kinesis Data Analytics
Amazon EMR
Amazon Athena
Note: Amazon Kinesis Data Analytics processes data streams and generates real-time dashboards. For more information, see the AWS Services for Analytics video.

13. Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets. A solutions architect is working for a company that currently uses Apache Hadoop on-premises for data processing jobs. The company wants to use AWS for these jobs, but they also want to continue using the same technology. Which service should the solutions architect choose for this use case?

AWS Lambda
Amazon OpenSearch Service
Amazon Kinesis Data Analytics
Amazon EMR
Note: Amazon EMR is a managed cluster platform that simplifies running big data frameworks (such as Apache Hadoop and Apache Spark) on AWS to process and analyze large amounts of data.

14. A team of machine learning (ML) experts are working for a company. The company wants to use the data in their data lake to train an ML model that they create. The company wants the most control that they can have over this model and the environment that it is trained in. Which AWS ML approach should the team take?

Create an AWS Lambda function with the training logic in the handler, and run the training based on an event.
Use a pretrained model from an AWS service, such as Amazon Rekognition.
Launch an Amazon Elastic Compute Cloud (Amazon EC2) instance by using an AWS Deep Learning Amazon Machine Image (AMI) to host the application that will train the model.
Launch an Amazon Elastic Compute Cloud (Amazon EC2) instance and run Amazon SageMaker on it to train the model.
Note: The team of ML experts will probably use EC2 instances for their compute power on AWS. They can launch an EC2 instance with the AWS Deep Learning AMIs that provide the greatest control over building and managing deep learning models and clusters.

15. A solutions architect needs to process and analyze data as it is ingested into a data lake in real time. They want to get timely insights about the streaming data. Which service should the solutions architect use for this use case?

Amazon API Gateway
Amazon EMR
Amazon Kinesis
AWS Lambda
Note: With Amazon Kinesis, the solutions architect can collect, process, and analyze real-time, streaming data to get timely insights and react quickly to new information. For more information, see the Data Movement reading.

16. Which services can query data that is needed to build reports? (Choose TWO.)

Amazon Athena
AWS Lambda
Amazon Glue
Amazon Redshift
Amazon Elastic Compute Cloud

17. True or False: Amazon Kinesis processes and analyzes data as it arrives, and there is no need to wait until all data is collected before the processing can begin.

True
False
Note: Amazon Kinesis collects, processes, and analyzes real-time data—such as video, audio, application logs, website clickstreams, and Internet of Things (IoT) telemetry data—for machine learning, analytics, and other applications. For more information, see the Diving Deep on Amazon Kinesis reading.

18. A company wants to transfer files into and out of Amazon Simple Storage Service (Amazon S3) storage by using AWS Transfer Family. Several users in the company need permissions to access a specific object storage bucket that hosts the files from AWS Transfer Family. What is the BEST way to provide the needed bucket-access permissions to these users?

AWS account root user
AWS Identity and Access Management (IAM) user
AWS Identity and Access Management (IAM) role
Access keys
Note: An IAM user can assume a role to temporarily take on different permissions for a specific task. An IAM role does not have any credentials (password or access keys) that are associated with it. Instead of being uniquely associated with one person, a role can be assumed by anyone who needs it. For more information, see the Batch Data Ingestion with AWS Transfer Family video.

19. What is the most common way of categorizing data in terms of structure?

Structured data, unstructured data, and semi-structured data
Ready data, not-ready data, and semi-ready data
The good data, the bad data, and the ugly data
Development data, quality assurance (QA) data, and production data
Note: Data that is categorized in structured and semi-structured formats have some consistency that makes it easier for computer systems to consume without further modification. In contrast, unstructured data contains content that does not have a predefined data model.

20. A solutions architect is tasked with transporting data from their organization to AWS by using the AWS Snow Family. The organization wants to transport data that needs less than 8 terabytes (TB) of usable storage. Which Snow Family device should the solutions architect recommend for this use case?

AWS Snowcone
AWS Snowball
AWS Snowmobile
AWS Glue
Note: AWS Snowcone is the most compact and portable device in the AWS Snow Family. It is designed to move up to 8 TB of data. For more information, see the Batch Data Ingestion with AWS Services video.

21. Which statement about the AWS Glue crawler is TRUE?

An AWS Glue crawler can scan a data store, such as an Amazon Simple Storage Service (Amazon S3) bucket, and use the data from the data store to create or update tables in the AWS Glue Data Catalog.
An AWS Glue crawler collects and catalogs data from databases and object storage, moves the data into a new Amazon Simple Storage Service (Amazon S3) data lake, and classifies the data by using machine learning algorithms.
An AWS Glue crawler runs Structured Query Language (SQL) queries to analyze data directly in Amazon Simple Storage Service (Amazon S3).
An AWS Glue crawler performs interactive log analytics, real-time application monitoring, website search, and more.
Note: AWS Glue crawlers scan various data stores to automatically infer schemas and populate the AWS Glue Data Catalog with corresponding table definitions and statistics. The catalog that AWS Glue generates can be used by Amazon Athena, Amazon Redshift, Amazon Redshift Spectrum, Amazon EMR, and third-party analytics tools that use a standard Apache Hive metastore catalog.

22. True or False: AWS future-proofs data lakes with a standardized storage solution that has capabilities to grow and scale with an organization’s needs.

False
True
Note: It is important that data lakes can non-disruptively evolve as needed. By building data lakes on AWS, customers can evolve their business around data assets. They can then use these data assets to quickly drive more business value and competitive differentiation with virtually no limits.

23. Which statements about the Amazon Kinesis Family are true? (Choose TWO.)

Amazon Kinesis Data Streams stores data only in the JSON format.
The Amazon Kinesis Family can ingest a high volume of small bits of data that are being processed in real time.
By writing data consumers, customers can move data that is ingested into Amazon Kinesis Data Streams to an Amazon Simple Storage Service (Amazon S3) bucket with minimum modification.
Amazon Kinesis Data Analytics loads data streams into AWS databases.
Amazon Kinesis Data Analytics provides an option to author non-Structured Query Language (SQL) code to process and analyze streaming data.

24. A company is receiving large amounts of streaming data from mobile devices, websites, servers, and sensors. They want to run analytics on the data that they are receiving. Which service should the company use for this use case?

Account monitoring with AWS CloudTrail
Log monitoring with Amazon CloudWatch
Log analysis with Amazon Kinesis Family
Log analysis with Amazon Pinpoint
Note: One of the most common forms of data analysis with big data is log analytics.

25. True or False: It is a best practice that companies treat the original, ingested version of data in their data lake as immutable. Any data processing that is done to the original data should be stored as a secondary copy or extra copy of the data, which will then be analyzed.

True
False
Note: Companies can make copies of the data, but the original data that was ingested must remain untouched. With Amazon Simple Storage Service (Amazon S3) Lifecycle policies, companies can move raw data to more cost-effective storage tiers when it becomes more infrequently accessed over time.

26. Which scenario represents AWS Glue jobs as the BEST tool for the job?

Transform data on a schedule or on demand.
Analyze data in batches on schedule or on demand.
Analyze data in real time as data comes into the data lake.
Transform data in real time as data comes into the data lake.
Note: An AWS Glue job runs extract, transform, and load (ETL) scripts that connect to your source data, process it, and then write it out to your data target. AWS Glue triggers can start jobs based on a schedule or event, or on demand.

27. A company collects and analyzes large amounts of data daily. Why should the company use a compression strategy for their processes?

Compressed data uses a row-based data format that works well for data optimization.
Compressed data slows the time to process and analyze information.
Compressed data increases the risk of losing valuable information.
By using compressed data, data-processing systems can optimize for memory and cost.
Note: Most high performance big data technologies copy data to RAM for faster performance. In this case, if companies compress data, they can fit more content into the same memory space. By doing so, companies can save on costs if they use services that charge per usage.

28. A software developer recently uploaded data logs from their application to Amazon Simple Storage Service (Amazon S3). Who is responsible for encrypting both the data at rest in the S3 bucket and the data in transit to the S3 bucket, according to the AWS shared responsibility model?

AWS
Customer
Both AWS and the customer
Third-party security company
Note: According to the AWS shared responsibility model, customers are responsible for the security in the cloud. AWS provides many features that customers can use to encrypt data both at rest and in transit. For more information about the correct answer, see the Security and Compliance reading.

29. Which statement about data visualization is TRUE?

Raw data is generally formatted to be read and used by a human eye.
Visualization data is always captured in a text editor.
If there is more data, making sense of the data will be more difficult without using visualization tools.
A click map is the main reason to invest into data visualization.
Note: Data visualization takes abstract data (or data that is not easily digested or understood) and represents the data in a visual and interactive way. The goal of data visualization is to enhance understanding of the data and to derive insights from it.

30. What makes Amazon QuickSight different, compared to other traditional business intelligence (BI) tools?

Data encryption at every layer
Super-fast, Parallel, In-memory Calculation Engine (SPICE)
The ability to create sharable dashboards
The ability to visualize data
Note: SPICE is a QuickSight feature that is engineered to rapidly perform advanced calculations and serve data.

31. What is the purpose of the Registry of Open Data on AWS?

Help people discover and share datasets that are available through AWS resources.
Provide a service that people can use to transform public datasets that are published by data providers through an API.
Provide a service that people can use to ingest software as a service (SaaS) application data into a data lake.
Help people discover and share datasets that are available outside of AWS resources.
*Note: The Registry of Open Data on AWS is a collection of publicly available datasets that users can access from AWS resources. *

32. True or False: Amazon QuickSight is a cloud-scale business intelligence (BI) service that developers can use to deliver interactive visualizations and dashboards for data analysis and forecasting.

True
False
Note: QuickSight helps organizations create interactive dashboards from different data sources.

33. What does the AWS Glue Metadata Catalog service do?

The AWS Glue Metadata Catalog provides a repository where a company can store, find, and access metadata, and use that metadata to query and transform the data.
The AWS Glue Metadata Catalog is a query service that uses standard Structured Query Language (SQL) to retrieve data.
The AWS Glue Metadata Catalog provides a data transformation service where a company can author and run scripts to transform data between data sources and targets.
The AWS Glue Metadata Catalog provides a repository where a company can store and find metadata to keep track of user permissions to data in a data lake.
Note: AWS Glue Metadata Catalog is the central metadata repository, and it consists of highly scalable collection of tables that are organized into databases. For more information, see the AWS Glue Data Catalog video.

34. A solutions architect is working for a customer who wants to build a data lake on AWS to store different types of raw data. Which AWS service should the solutions architect recommend to the customer to meet their requirements?

AWS Glue Metadata Catalog
Amazon OpenSearch Service
Amazon EMR
Amazon Simple Storage Service (Amazon S3)
Note: Amazon S3 stores data contents of any type together in buckets with virtually unlimited storage. This storage type is best suited for data lakes. For more information, see the video Amazon S3.

35. Which statement BEST describes batch data ingestion?

Batch data ingestion is the process of collecting and transferring large amounts of data that have already been produced and stored on premises or in the cloud.
By using batch data ingestion, a user can create a unified metadata repository across various services on AWS.
Batch data ingestion is the process of capturing gigabytes (GB) of data per second from multiple sources, such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events.
Batch data ingestion is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning, and application development.
Note: Batch-based data ingestion processes large amounts of data that have already been produced or are being ingested periodically. Batch ingestion works best for environments where producing data insights are not time-sensitive. For more information, see the Batch Data Ingestion with AWS Transfer Family video.

36. A company plans to explore data lakes and their components. What are reasons to invest in a data lake? (Choose TWO.)

Limit data movement
Make data available from integrated departments
Lower transactional costs
Offload capacity from databases and data warehouses
Increase operational overhead

37. Which statement about whether data lakes make it easier to follow the “right tool for the job” approach is TRUE?

No, data lakes do not make it easier to follow “the right tool for the job approach” because you are tied to a specific AWS service.
Yes, data lakes make it easier to follow “the right tool for the job” approach because data lakes can only handle structured data.
No, data lakes do not make it easier to follow “the right tool for the job approach” because data lakes can only handle structured data.
Yes, data lakes make it easier to follow “the right tool for the job” approach because storage can be decoupled from processing and ingestion.
Note: In traditional data warehouse solutions, storage and compute are tightly coupled, which can make it difficult to optimize costs and data processing workflows. With Amazon Simple Storage Service (Amazon S3), users can cost-effectively store all data types in their native formats. Users can then launch as many (or as few) virtual servers as they need by using Amazon Elastic Compute Cloud (Amazon EC2) to run analytical tools. They can also use services in the AWS analytics portfolio—such as Amazon Athena, AWS Lambda, Amazon EMR, and Amazon QuickSight—to process data. For more information, see the Use the Right Tool for the Job video.

38. Which task is performed by an AWS Glue crawler?

Store metadata in a catalog for indexing.
Analyze all data in the data lake to create an Apache Hive metastore.
Map data from one schema to another schema.
Populate the AWS Glue Data Catalog with tables.
Note: A crawler can populate the AWS Glue Data Catalog with tables. The crawler is the primary method used by most AWS Glue users. For more information, see the Using S3, Glue and Athena to Get Insights about NYC Taxi Data video in week 4.

39. Which statements about data organization and categorization in data lakes are TRUE? (Choose TWO.)

Users must delete the original raw data to keep their data lake organized and cataloged.
Amazon Simple Storage Service (Amazon S3) is mostly used for storage, and AWS Glue is mostly used for categorizing data.
Data lakes need to be schema-on-write. In this case, users need to transform all the data before they load it into the data lake.
When cataloging data, it is a best practice to organize the data according to the access pattern of the user who will access it.
Data lakes are not future-proof, which means that they must be reconfigured each time new data is ingested.

40. Which type of data has the HIGHEST probability of containing structured data?

Raw data from marketing research surveys
Data that is sitting in a relational MySQL table
Customer reviews on products in retailer websites
Video files from mobile phone photo libraries
Note: Structural data is data that is easy for computer systems to consume in its original format, without further modification. A relational database table has schemas, primary keys, foreign keys, and associated data relationships—which means that it works well for storing highly structured data. The database engine is a relational table, and might even reject data that does not fit a rigid structure. For more information, see the Understanding Data Structure and When to Process Data video.

41. Which statement about data consumption in Amazon Kinesis Data Streams is TRUE?

Data consumers must use an AWS SDK to correctly fetch data from Kinesis in the same order that it was ingested. However, AWS Lambda functions do not need to fetch data from Kinesis in a specific order because Lambda integrates natively with AWS services, including Kinesis.
If data is not consumed within 15 minutes, Kinesis will delete the data that was added to the stream. This case is true even though the data-retention window is greater than 15 minutes.
Data is automatically pushed to each consumer that is connected to Kinesis. Thus, consumers are notified that new data is available, even when they are not running the Kinesis SDK for data consumption.
If data is consumed by a consumer, that consumer can never get that same data again. This case is true even if the data is still in the stream, according to the data-retention window.
Note: AWS Lambda integrates natively with Amazon Kinesis as a consumer to process data that is ingested through a data stream. The complexities of polling, checkpointing, and error handling are abstracted when users use this native integration. For more information, see the Data Streaming Ingestion with Kinesis Services video.