Metadata
- Authors: Pedro Pedreira Orri Erling Konstantinos Karanasos Scott Schneider Wes McKinney Satya R Valluri Mohamed Zait Jacques Nadeau
- Full Title:: The Composable Data Management System Manifesto
- Category:: 🗞️Articles
- URL:: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf
- Finished date:: 2023-09-09
Highlights
The Composable Data Management System Manifesto (View Highlight)
New highlights added 2023-09-10
Time-to-market fallacy. It is also common for developers to believe that a quick prototype containing a subset of functionality decreases the time-to-market for their products. In cases where this holds true, it frequently understates the high cost of stabilizing (hardening) the software, and the long-tail of features required to turn the prototype into a real product. Ultimately, time-to-market does not simply depend on writing the code, but on stabilizing it against a real workload. This usually results in products with incomplete and inconsistent features, hard to maintain (once the engineers who wrote the prototype move to a different project), and generalized tech debt (View Highlight)
bullshit jobs. the rise of pointless work and what can we do about it Bullshit Jobs
Lack of incentives. Developers are usually not compelled to write reusable components because there are few incentives to do so. From an individual developer’s perspective, it takes more effort to develop a modular system than to develop a monolith, and it is more difficult to build a business model for a data management system component than it is for an end-to-end system. In the shortterm, it is usually easier for a particular group to develop their own system and internal components (local optimum), than to share with other groups, reuse, and collaborate (global optimum). (View Highlight)
we believe a state-of-art language frontend library should also provide support for the features below, which are increasingly relevant: (View Highlight)
For example, comparisons oftwo integers representing different concepts (UserID and DeviceID) can be statically avoided during type checking (View Highlight)
Type Checked Macros (View Highlight)
dataframe-like APIs and other DSLs, offering a more programmatic way to express the same type ofcomputation without requiring error-prone concatenations of pieces of SQL statements. (View Highlight)
an informal survey conducted at Meta identified at least 12 different implementations of the simple string manipulation function 𝑠𝑢𝑏𝑠𝑡𝑟(), presenting different parameter semantics (0- vs. 1-based indices), null handling, and exception behavior [18]. (View Highlight)
Despite sounding idealistic, a reasonably functional stack can be built today by solely leveraging open source projects like Ibis (language), Substrait (IR), Calcite (optimizer), Velox (execution), and a distributed runtime such as Spark, Ray, or a serverless architecture. (View Highlight)
if the hypothetical proposition is a better query optimizer, one could fully replace the optimizer layer by a custom implementation, as long as the APIs are maintained, and keep the remaining layers of the stack intact (View Highlight)
Recently, frameworks like Ray and Dask push the computation flexibility even further and allow arbitrary functions to be executed at every worker, offering tighter integration with the Python ecosystem and targeting data science and machine learning workloads (View Highlight)