Broadly, it consists of four components: 1. A stream processor or message broker, to collect data and redistribute it 2. Data transformation tools (ETL, ELT, and so on), to ready data for querying 3. Query engines, to extract the business value 4. Cost-effective storage for high volumes of streaming data – file storage and object storage
Stream processors are high-capacity (>1 Gb/second) but perform no other data transformation or task scheduling.
Note that, depending on your needs and on the architecture you create, data transformation may occur directly on the data as it streams in and before it’s stored in a lake or other repository, or after it’s been ingested and stored.
Building data transformations in Spark requires lengthy coding in Scala with expertise in implementing dozens of Hadoop best practices around object storage, partitioning, and merging small files.
Gaining this visibility on read rather than trying to infer it on write saves you much trouble down the line, because as schema drift occurs (unexpected new, deleted and changed fields) you can build ETL pipelines based on the most accurate and available data.
Store your data in open columnar file
Retain raw historical data in inexpensive object storage, such as Amazon S3.
Use a well-supported central metadata repository such as AWS Glue or the Hive metastore.