Dask for Computing Slots

Result:
- Backlog Data Engineering (notion.so)
- Preliminary results
  - Time it takes in seconds to compute_occupation_snapshots (the inner loop of compute_availability) for all checkouts of January, 2019, against the number of cores (for those implementations that allow multicore computation).
  - No implementation is covered by unit tests, thus, take these metrics with a grain of salt.
  - Numba has the ability to auto-parallelize code, but it was not used on these tests.
  - In short,
    - Switching to NumPy shortens computation time by a factor of ~4.
    - Parallelizing with Dask, with 16 cores, shortens computation by a factor of ~9 to ~15.
    - The most performant solution (Numba + Dask) shortens computation by a factor of~38.
  - On the most performant solution (Numba + Dask), computing the occupation snapshots for all 2020 checkouts takes:
    - With 10 cores: 4h 55min.
    - With 15 cores: 3h 55min.
- Lessons Learnt
  - A simple switch to NumPy already offers a dramatic improvement, and changing Pandas code to NumPy was not particularly time consuming.
  - Numba is quite restrictive in the data types that it allows for full optimization (nopythonmode), but such option did not offer much improvement over the NumPy solution in this case.
  - Find the branch with the experiments here.
- TODO
  - Cleanup code
  - Can I use Numba/Cython to speed up the loop? / Can we get rid of the loop and do it vectorized somehow?
  - Write complete implementation with Numba.
  - Write unit tests.
  - Validate results and intermediate data (Great Expectations, Bulwark´s…).
  - Other ideas:
    - Offloading of certain functions to Big Query (extra steps of ETL? queries on the fly?).
    - Use a columnar data format such as Parquet in the ETLs.

Dr. Mario's 2nd 🧠

Explorer

Dask for Computing Slots

Webmentions

❤️ Likes

🔄 Reposts

💬 Replies

🔗 Mentions

Graph View