
## Metadata
- Author: [[robinlinacre.com|Robinlinacre]]
- Full Title:: Demystifying Apache Arrow
- Category:: #🗞️Articles
- URL:: https://www.robinlinacre.com/demystifying_arrow/
- Finished date:: [[2023-02-06]]
## Highlights
##### Faster csv reading
> Data in Arrow is stored in-memory in record batches, a 2D data structure containing contiguous columns of data of equal length. A ‘[table](https://arrow.apache.org/docs/python/data.html#tables)’ can be created from these batches without requiring additional memory copying, because tables can have ‘chunked’ columns (i.e. sections of data, each part representing a contiguous chunk of memory). This design means that data can be read in parallel rather than the single-threaded approach of pandas. ([View Highlight](https://read.readwise.io/read/01grk0y07f15ek75zva3q29wrg))
##### Faster User Defined Functions (UDF) in PySpark
> The use of Arrow almost completely eliminates the serialisation and deserialisation step, and also allows data to be processed in columnar batches, meaning more efficient vectorised algorithms can be used. ([View Highlight](https://read.readwise.io/read/01grk10edj0bf0r2q0a5pjwdfr))
> Arrow provides translators that are then able to convert this into language-specific in-memory formats like the pandas dataframe ([View Highlight](https://read.readwise.io/read/01grk123gpm4hcd8ztcrkhe324))
> This problem of translation is one that Arrow aims eventually to eliminate altogether: ideally there would be a single in-memory format for dataframes rather than each tool having its own representation. ([View Highlight](https://read.readwise.io/read/01grk12cnxq9wqkhnevywkc51m))
##### Writing parquet files
> Why not just persist the data to disk in Arrow format, and thus have a single, cross-language data format that is the same on-disk and in-memory? One of the biggest reasons is that Parquet generally produces smaller data files, which is more desirable if you are IO-bound. This will especially be the case if you are loading data from cloud storage like such as AWS S3. ([View Highlight](https://read.readwise.io/read/01grk13txz67m7zyvqsyk4vr3n))
> The trade-offs for columnar data are different for in-memory. For data on disk ([View Highlight](https://read.readwise.io/read/01grk17a169brz9atvm0ecbcxp))
> This makes it desirable to have close cooperation in the development of the in-memory and on-disk formats, with predictable and speedy translation between the two representations — and this is what is offered by the Arrow and Parquet formats. ([View Highlight](https://read.readwise.io/read/01grk17hqeqbs0ajq9vzs9medr))
> A common (cross-language) in memory representation of data frames ([View Highlight](https://read.readwise.io/read/01grk193eawvqaje5av0h19fgd))
> These ideas come together in the description of Arrow as an ‘API for data’. The idea is that Arrow provides a cross-language in-memory data format, and an associated query execution language that provide the building blocks for analytics libraries. Like any good API, Arrow provides a performant solution to common problems without the user needing to fully understand the implementation ([View Highlight](https://read.readwise.io/read/01grk19hmqb4pwv2cfnvyzdvnz))
> This paves the way for a step change in data processing capabilities, with analytics libraries:
> • being parallelised by default
> • applying highly optimised calculations on dataframes
> • no longer needing to work with the constraint that the full dataset must fit into memory ([View Highlight](https://read.readwise.io/read/01grk19rxdde0b3yh4h0pjqg7g))