Update - We're still investigating certain workload reastarts. There seems to be certain workloads that can trigger coditions on the compute nodes which then leads to the workloads on the compute node being rescheduled.
Feb 13, 2024 - 09:39 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Jan 24, 2024 - 22:51 UTC
Identified - Following up from the earlier Incident regarding the intermittent workload restarts: We'll run an additional maintenance window after 21:00 UTC today to move workloads onto a new set of compute nodes. This action should stabilize the intermittent workload restarts we are seeing.
Jan 24, 2024 - 16:58 UTC
Resolved -
The logging system is fully operational again.
Mar 12, 07:17 UTC
Investigating -
After rebalancing the data some parts of the logging system failed and we are working on making it fully operational again.
Mar 11, 13:50 UTC
Monitoring -
We have rolled out some improvements and the logging system is currently rebalancing data.
Mar 8, 10:36 UTC
Investigating -
We have seen an increase in timeouts for log queries and are working on identifying the root cause of this.
Mar 8, 08:23 UTC