- Result:
-
Preliminary results
-
Time it takes in seconds to compute_occupation_snapshots (the inner loop of compute_availability) for all checkouts of January, 2019, against the number of cores (for those implementations that allow multicore computation).
-
-
No implementation is covered by unit tests, thus, take these metrics with a grain of salt.
-
Numba has the ability to auto-parallelize code, but it was not used on these tests.
-
In short,
-
Switching to NumPy shortens computation time by a factor of ~4.
-
Parallelizing with Dask, with 16 cores, shortens computation by a factor of ~9 to ~15.
-
The most performant solution (Numba + Dask) shortens computation by a factor of~38.
-
-
On the most performant solution (Numba + Dask), computing the occupation snapshots for all 2020 checkouts takes:
-
With 10 cores: 4h 55min.
-
With 15 cores: 3h 55min.
-
-
-
Lessons Learnt
-
A simple switch to NumPy already offers a dramatic improvement, and changing Pandas code to NumPy was not particularly time consuming.
-
Numba is quite restrictive in the data types that it allows for full optimization (nopythonmode), but such option did not offer much improvement over the NumPy solution in this case.
-
Find the branch with the experiments here.
-
-
TODO
-
Cleanup code
-
Can I use Numba/Cython to speed up the loop? / Can we get rid of the loop and do it vectorized somehow?
-
Write complete implementation with Numba.
-
Write unit tests.
-
Validate results and intermediate data (Great Expectations, Bulwark´s…).
-
Other ideas:
-
Offloading of certain functions to Big Query (extra steps of ETL? queries on the fly?).
-
Use a columnar data format such as Parquet in the ETLs.
-
-