tl:dr; We’re rolling a new data ingestion platform addressing root causes of the incidents that happened in the last year. The ingestion platform has a way more robust monitoring, alerting and significantly improved latency for seeing your data processed (hours to minutes). 50% of the production traffic is already on the new platform.
Hey folks, Kamil from the Eng team here. Apologies for the data loss. I’ve been working on refactoring the whole ingestion pipelines for the last few months. Lack of stability and maintenance overhead are main factors that contribute to the refactor. You can find out what happened and how we’re addressing it below.
@czecko, since you’re asking concrete questions and you’re one of the most affected users, let me answer your questions first:
Great, but how about informing all users about it.
It was a silent failure caused by unpredicatble deployment condition across three systems. Our data integrity checks couldn’t have catched it. We’re adjusting these data integrity checks to be more robust and improve oncall monitor and alerting for conditions like that. Details below.
What if I upload sequences today and they too are lost?
Oncall is alerted on failures aggregated on a per minute basis now. The new pipeline (see below) is failing hard, fast and makes it explicit that even a single sequence didn’t get processed in time. Alerting and monitoring is more robust both on operational metrics and data integrity.
I’m interested in what you were doing at the time that half of my sequences uploaded that day were lost!
See below.
Incident report
On 20.12.2023 we had a pipeline churn caused by bad deployment. This caused processing delay for 6 hourly batches of seqeunces. Upon noticing, oncall fixed the deployment with a patch and scheduled a backfill for 75% of the data, the remaining 25% were scheduled for a backfill on the new platform (details below). Oncall then checked the backfills in processing after a few hours and they looked correctly. However, there was a silent failure for most of the data in the 75% backfill and our data quality checks didn’t detect it (hence silent failure). We have a strict TTL on the data uploaded to our platform and oncall has a limited time window to act on failures like this (days). Fast forward two months to the future, we can’t recover that data, because we’re past TTL.
Moving forward
I mentioned above that at the time on 20.12.2023, 25% of the data from that day was backfilled on the new platform as a one-time event. For the last few weeks, we’ve been routing 50% of the production traffic (across all users) to this new platform. The new platform not only significantly improves the latency time to see your sequences processed (hours to minutes), but has more granular alerting and monitoring in place. You might’ve noticed that when you upload sequences in batches, around half of them is processed way faster than the other half.
We know data loss is super frustrating and we thank you for your patience while we migrate to a more stable system with better reporting to better detect and prevent these types of issues going forward.