• Result:
    • Backlog Data Engineering (notion.so)

    • Preliminary results

      • Time it takes in seconds to compute_occupation_snapshots (the inner loop of compute_availability) for all checkouts of January, 2019, against the number of cores (for those implementations that allow multicore computation).

      • https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6011253e-c0a1-4ec4-8d15-7650be30c9a9/GetImage(4).png

      • No implementation is covered by unit tests, thus, take these metrics with a grain of salt.

      • Numba has the ability to auto-parallelize code, but it was not used on these tests.

      • In short,

        • Switching to NumPy shortens computation time by a factor of ~4.

        • Parallelizing with Dask, with 16 cores, shortens computation by a factor of ~9 to ~15.

        • The most performant solution (Numba + Dask) shortens computation by a factor of~38.

      • On the most performant solution (Numba + Dask), computing the occupation snapshots for all 2020 checkouts takes:

        • With 10 cores: 4h 55min.

        • With 15 cores: 3h 55min.

    • Lessons Learnt

      • A simple switch to NumPy already offers a dramatic improvement, and changing Pandas code to NumPy was not particularly time consuming.

      • Numba is quite restrictive in the data types that it allows for full optimization (nopythonmode), but such option did not offer much improvement over the NumPy solution in this case.

      • Find the branch with the experiments here.

    • TODO

      • Cleanup code

      • Can I use Numba/Cython to speed up the loop? / Can we get rid of the loop and do it vectorized somehow?

      • Write complete implementation with Numba.

      • Write unit tests.

      • Validate results and intermediate data (Great Expectations, Bulwark´s…).

      • Other ideas:

        • Offloading of certain functions to Big Query (extra steps of ETL? queries on the fly?).

        • Use a columnar data format such as Parquet in the ETLs.