Demystifying apache arrow - Dr. Mario's 🧠

![rw-book-cover](https://www.robinlinacre.com/images/favicon-32x32.png) ## Metadata - Author: [[robinlinacre.com|Robinlinacre]] - Full Title:: Demystifying Apache Arrow - Category:: #🗞️Articles - URL:: https://www.robinlinacre.com/demystifying_arrow/ - Finished date:: [[2023-02-06]] ## Highlights ##### Faster csv reading > Data in Arrow is stored in-memory in record batches, a 2D data structure containing contiguous columns of data of equal length. A ‘[table](https://arrow.apache.org/docs/python/data.html#tables)’ can be created from these batches without requiring additional memory copying, because tables can have ‘chunked’ columns (i.e. sections of data, each part representing a contiguous chunk of memory). This design means that data can be read in parallel rather than the single-threaded approach of pandas. ([View Highlight](https://read.readwise.io/read/01grk0y07f15ek75zva3q29wrg)) ##### Faster User Defined Functions (UDF) in PySpark > The use of Arrow almost completely eliminates the serialisation and deserialisation step, and also allows data to be processed in columnar batches, meaning more efficient vectorised algorithms can be used. ([View Highlight](https://read.readwise.io/read/01grk10edj0bf0r2q0a5pjwdfr)) > Arrow provides translators that are then able to convert this into language-specific in-memory formats like the pandas dataframe ([View Highlight](https://read.readwise.io/read/01grk123gpm4hcd8ztcrkhe324)) > This problem of translation is one that Arrow aims eventually to eliminate altogether: ideally there would be a single in-memory format for dataframes rather than each tool having its own representation. ([View Highlight](https://read.readwise.io/read/01grk12cnxq9wqkhnevywkc51m)) ##### Writing parquet files > Why not just persist the data to disk in Arrow format, and thus have a single, cross-language data format that is the same on-disk and in-memory? One of the biggest reasons is that Parquet generally produces smaller data files, which is more desirable if you are IO-bound. This will especially be the case if you are loading data from cloud storage like such as AWS S3. ([View Highlight](https://read.readwise.io/read/01grk13txz67m7zyvqsyk4vr3n)) > The trade-offs for columnar data are different for in-memory. For data on disk ([View Highlight](https://read.readwise.io/read/01grk17a169brz9atvm0ecbcxp)) > This makes it desirable to have close cooperation in the development of the in-memory and on-disk formats, with predictable and speedy translation between the two representations — and this is what is offered by the Arrow and Parquet formats. ([View Highlight](https://read.readwise.io/read/01grk17hqeqbs0ajq9vzs9medr)) > A common (cross-language) in memory representation of data frames ([View Highlight](https://read.readwise.io/read/01grk193eawvqaje5av0h19fgd)) > These ideas come together in the description of Arrow as an ‘API for data’. The idea is that Arrow provides a cross-language in-memory data format, and an associated query execution language that provide the building blocks for analytics libraries. Like any good API, Arrow provides a performant solution to common problems without the user needing to fully understand the implementation ([View Highlight](https://read.readwise.io/read/01grk19hmqb4pwv2cfnvyzdvnz)) > This paves the way for a step change in data processing capabilities, with analytics libraries: > • being parallelised by default > • applying highly optimised calculations on dataframes > • no longer needing to work with the constraint that the full dataset must fit into memory ([View Highlight](https://read.readwise.io/read/01grk19rxdde0b3yh4h0pjqg7g))