The Cloud Data Lake

rw-book-cover

Metadata

Author: Rukmani Gopalan
Full Title:: The Cloud Data Lake
Category:: 📚Books
Finished date:: 2023-04-02

Highlights

Blue I wonder if this is true for Clickhouse or BQ

Most data warehouses promise the ability to scale to multiple PBs of data and operate on unstructured data, and they are relentlessly improving support for both higher volume and variety. It’s important to remember that data warehouses are not designed to store and process tens or hundreds of PBs, at least as they stand today. An additional consideration is cost, where, depending on your scenarios, it could be a lot cheaper to store data in your data lake as compared with the data warehouse. Additionally, while data warehouses offer support for unstructured data, their highly optimized path is to process structured data that is in a proprietary format specific to that warehouse. Although the line between data lakes and data warehouses continues to blur, it is important to keep these original value propositions in mind when picking the right architecture for your data platform. (Location 289)

Blue

This high-value, structured data is then either loaded into an enterprise data warehouse for consumption or consumed directly from the data lake. (Location 501)

Blue

design your data lake for the company’s future. Make your implementation choices based on what you need immediately! (Location 554)

Blue

Big data may mean more information, but it also means more false information. Naseem Taleb (Location 613)

Blue

This article on Future.com provides a comprehensive overview of the various components of a modern data architecture. (Location 630)

New highlights added 2023-04-03

Blue

Recently, data warehouses have started supporting open data formats like Apache Iceberg, a very promising trend that directionally supports the data lakehouse architecture, (Location 781)

Blue

there are two major scenarios that are common consumption patterns in an organization: Business intelligence Data is used by BI analysts to create dashboards or work on interactive queries to answer key business problems that are well defined and work on highly structured data. Data science and machine learning (Location 972)

Blue

In a modern data warehouse architecture, both the data lake and the data warehouse peacefully coexist, each serving a distinct purpose. (Location 981)

Blue

The data lake serves as low-cost storage for a large amount of data and supports exploratory scenarios such as data science and machine learning. The data warehouse stores high-value data and powers dashboards used by the business. It is also used by BI users to query the highly structured data to gain insights about the business. (Location 987)

Blue

Data lakes cost a lot less than a data warehouse and can act as your long-term repository of data. (Location 1025)

Blue

data science (Location 1027)

Blue

There are also a few challenges with this approach. The data engineers and administrators still need to maintain two sets of infrastructures: a data lake and a data warehouse. (Location 1103)

Blue

Well, if we had the option to run our BI scenarios on the data lake already, why didn’t we do this in the first place? The simple answer is because data lakes by themselves are not really structured to support BI queries, and there are various technologies that have made the lakehouse a reality. (Location 1145)

Blue

there has been a healthy growth of mindshare contributing to key technologies that make the data lakehouse paradigm a reality today. Some of these technologies include Delta Lake, which originated in Databricks; Apache Iceberg, which originated in Netflix; and Apache Hudi, which originated in Uber. (Location 1197)

Blue

To enable a data lakehouse architecture, you need to ensure that you leverage one of the open data technologies, such as Apache Iceberg, Delta Lake, or Apache Hudi, as well as a compute framework that understands and respects these formats. (Location 1218)

Blue

The data format is key to a lakehouse architecture for the following reasons: (Location 1247)

Blue

The data stored is optimized for queries, especially to support the BI use cases that largely use SQL-like queries. This optimization is crucial to support query performance that is comparable to a data warehouse. (Location 1255)

Blue

They all derive from a fundamental data format, Apache Parquet, (Location 1267)

Blue

each was designed with a specific purpose in mind, (Location 1291)

Blue But are there performance improvementget -only or Hive tables? I would need to check the paper.

Delta Lake by Databricks is optimized for running highly performant SQL queries on the data lake, leveraging the metadata to do intelligent data skipping to read only the data required to serve the queries. (Location 1295)

Blue

in the case of Delta Lake, a format developed by Databricks, the compute component—that is, their Spark engine—is optimized for operating on Delta tables and further enhances performance with caching for faster performance and a Bloom filter index for effective data skipping. (Location 1310)

Blue

The lakehouse provides a key advantage over the modern data warehouse by eliminating the need to have two places to store the same data. Let’s say that the data science team leveraged its new datasets, such as the weather data, and built a new dataset that correlates sales with the weather. The business analysts have this data ready to go for their deeper analysis since everyone is using the same data store, and possibly the same data formats. Similarly, if the business analysis generated a specific filtered dataset, the data scientists can start using this for their analysis. (Location 1331)

Blue Doesn’t this lead to more entropy?

This completely explodes the scenarios, promoting the cross-pollination of insights between the different classes of consumers of the data platform. (Location 1341)

Blue

the tooling ecosystem and the barriers to getting started are high, due to the skills and engineering complexities. Very much like microservices, the data lakehouse is seeing rapid innovations in this space, such that this barrier will only get lower and lower with time. (Location 1368)

Blue

Data mesh architecture (Location 1438)

Blue The question here is how difficult it then becomes to share data across the mesh

Architecturally, there is a shift from a monolithic implementation of a large central data warehouse or a data lake to a distributed mesh of data lakes and data warehouses that still make a single logical representation of data by sharing insights and data between them. (Location 1446)

Blue

a data lake architecture comes with its own complexities from the diversity of the data and the ecosystem; adding a distributed layer increases this complexity. (Location 1501)

Blue

I have worked with customers who always assumed that the data engineering team would be the sole team with access to data in the data lake and did not implement the right set of security and access controls, only to find the scenarios growing rapidly, with everyone having access to everything and causing accidental data deletes. (Location 1554)

Blue

Have no fear of perfection—you will never reach it. Salvador Dali (Location 1641)

Blue Data scientist>> BI?

you need to determine who the customers of your data lake (Location 1690)

Blue

The very first step she takes is to inventory the problems across the organization, and she comes up with the list outlined in Table 3-1. (Location 1713)

Blue

Based on this inventory, Alice defined the goals of her data lake implementation as follows and reviews it with her stakeholders to finalize the goals: (Must have) Support better scale and performance for existing sales and marketing dashboards as measured by a 50% increase in query performance at the 75th percentile. (Must have) Support data science models on the data lake as measured by a pilot engagement on product offering recommendations to the executive team. (Nice to have) Support more data science models on the data lake as measured by the next set of scenarios on partnership identification for sales and by influencer recommendation for marketing. (Location 1748)

Blue

Note that cloud data warehouses like Snowflake are blurring the boundaries between data warehouse and data lake. At the point when I’m writing this book, I would personally qualify Snowflake as a data warehouse, mainly because the primary use case for Snowflake is operating on structured data. (Location 1810)

Blue

upskilling is required for the tooling support and automation end to end. (Location 1825)

New highlights added 2023-04-04

Blue Roll-up tables and dimensional modeling

Data in this zone is processed by performing aggregations, filtering, correlating data from different datasets, and other complex calculations that provide the key insights into solutions for business problems. (Location 2019)

Blue

mapping all of this together is key to ensuring data quality. (Location 2285)

Blue

A useful framework for data reliability is to ensure there is a measurable metric for the five key pillars: data freshness, distribution, volume, schema, and lineage. (Location 2286)

New highlights added 2023-04-05

Blue

A large file could be chunked up to do a parallel copy, but you cannot combine multiple files into a single-copy job. If you have a lot of small files to be copied, you can expect this to take a long time because the listing operation for the operator could take longer, (Location 2680)

Blue

If you need to have a hybrid or multicloud solution, pay attention to the data transfers, and ideally ensure that the data transfers across these multiple environments are minimal and carefully thought through. (Location 2846)

Blue

They repeated this analysis for their other datasets and ensured that the partitioning strategy met their usage patterns. (Location 3655)

Blue Take a look to this

Apache Spark offers serialization in Java as well as in Kryo libraries, which offer faster, more efficient serialization compared to Java. Leveraging the Apache Spark configuration to use the Kryo serializer will provide performance optimizations, especially for networking-intensive applications, where large data transfers are going over the network with more complex transformations or using cloud data lake storage to persist the datasets. You can read more about the Kryo serializer in the Apache Spark Performance Tuning—Data serialization documentation. (Location 3681)

Blue

Keeping the data relatively flat and minimizing many nested structures can ensure that it will use less memory. (Location 3691)

Blue

Some customers I know have spent more time tuning for performance than they have authoring their Spark jobs, so plan and budget for that time in your Spark job pipelines. (Location 3715)

New highlights added 2023-04-12

Blue

If you have scenarios where you need to use datasets across regions, complete all your computations in the region where the datasets originate, and transfer only the completely processed datasets. (Location 3739)

Blue

With Apache Spark, customers could use a single programming model that worked for both data engineers for core data processing as well as data scientists for machine learning scenarios. (Location 3896)

Blue This is quite old

The architecture pattern that supports both of these paths is referred to as a lambda architecture, (Location 3920)

Blue

Apache Hive stored the data in files and folders on the object storage file system. This meant that any time data needed to be queried, the files and folders needed to be listed to find the data of interest. As the size of the data grew to a petabyte scale, the need to list files at that scale became really expensive and created performance bottlenecks for the queries. (Location 4048)

New highlights added 2023-04-14

Blue

The cloud data lake is a rapidly evolving field, so your decisions need to be grounded in the problems you face in your current implementation as well as the opportunities that new innovations on the cloud data lake could bring to your organization. (Location 4396)

Blue

An important aspect to remember here is that investing in the assess phase is critical to derisking your data lake design, implementation, and release. Take your time to build the prioritization and stakeholder alignment with your customers and business leaders. Although you may not know everything and things will change, an initial list of prioritized requirements and stakeholder buy-in ensure that changes are managed appropriately. (Location 4400)

Blue

To your data platform team (Location 4478)

Blue

How much effort and time are spent on your data operations today? (Location 4479)

Blue

What is the cost of running your data operations today? If you were to lower this cost, how would that help your operations? (Location 4480)

Blue

A good rule of thumb is to aim for 60%–70% accuracy and completeness and target a one- to two-year time horizon, so you have a solid-enough plan that is also adaptable to any new changes. (Location 4530)

Blue

A data lake is very inexpensive in terms of operations but costs more in terms of development effort to build your solution. The development cost is high here because you will be assembling your solution with different compute and storage components, as opposed to getting a ready-to-use solution. (Location 4544)

Blue

Cloud data lakes, as we saw, require relatively high skill sets; factor that in when making architecture choices. (Location 4563)

Blue

did a PoC on the three cloud providers, (Location 4657)

Blue Dies in waterfall

When you exit Phase 2, you have finalized the technical architecture and design, and you have a project plan ready with a final plan and scope. (Location 4721)

Blue

As far as my personal experience goes, I’ve often seen the data platform be a lean organization where the demands are always higher than what the team can support at a given time. (Location 4988)

Blue

You are the data experts; your customers have other areas of expertise, and data expertise should not be a requirement for them. In your investments, prioritize for end-to-end customer experiences that are seamless, and err on the side of saying no rather than offering a poorly implemented or half-baked solution. (Location 4991)

Dr. Mario's 2nd 🧠

Explorer

The Cloud Data Lake

Metadata

Highlights

New highlights added 2023-04-03

New highlights added 2023-04-04

New highlights added 2023-04-05

New highlights added 2023-04-12

New highlights added 2023-04-14

Webmentions

❤️ Likes

🔄 Reposts

💬 Replies

🔗 Mentions

Graph View

Table of Contents