Upload data loss on September 21 & 22

Hi folks, digging deeper on this with the eng team, it looks like we lost a substantial amount of uploads from December 20th. If you still have the original files you uploaded on that day, please go ahead and upload.

We are working on modernizing this pipeline to both be more reliable and provide alerting for failures. Thank you for your patience, we know this is frustrating and are working to improve it.

After almost two months
:rofl: :rofl:.
Great, but how about informing all users about it.
What if I upload sequences today and they too are lost?

I’m interested in what you were doing at the time that half of my sequences uploaded that day were lost!

tl:dr; We’re rolling a new data ingestion platform addressing root causes of the incidents that happened in the last year. The ingestion platform has a way more robust monitoring, alerting and significantly improved latency for seeing your data processed (hours to minutes). 50% of the production traffic is already on the new platform.

Hey folks, Kamil from the Eng team here. Apologies for the data loss. I’ve been working on refactoring the whole ingestion pipelines for the last few months. Lack of stability and maintenance overhead are main factors that contribute to the refactor. You can find out what happened and how we’re addressing it below.

@czecko, since you’re asking concrete questions and you’re one of the most affected users, let me answer your questions first:

Great, but how about informing all users about it.

It was a silent failure caused by unpredicatble deployment condition across three systems. Our data integrity checks couldn’t have catched it. We’re adjusting these data integrity checks to be more robust and improve oncall monitor and alerting for conditions like that. Details below.

What if I upload sequences today and they too are lost?

Oncall is alerted on failures aggregated on a per minute basis now. The new pipeline (see below) is failing hard, fast and makes it explicit that even a single sequence didn’t get processed in time. Alerting and monitoring is more robust both on operational metrics and data integrity.

I’m interested in what you were doing at the time that half of my sequences uploaded that day were lost!

See below.

Incident report

On 20.12.2023 we had a pipeline churn caused by bad deployment. This caused processing delay for 6 hourly batches of seqeunces. Upon noticing, oncall fixed the deployment with a patch and scheduled a backfill for 75% of the data, the remaining 25% were scheduled for a backfill on the new platform (details below). Oncall then checked the backfills in processing after a few hours and they looked correctly. However, there was a silent failure for most of the data in the 75% backfill and our data quality checks didn’t detect it (hence silent failure). We have a strict TTL on the data uploaded to our platform and oncall has a limited time window to act on failures like this (days). Fast forward two months to the future, we can’t recover that data, because we’re past TTL.

Moving forward

I mentioned above that at the time on 20.12.2023, 25% of the data from that day was backfilled on the new platform as a one-time event. For the last few weeks, we’ve been routing 50% of the production traffic (across all users) to this new platform. The new platform not only significantly improves the latency time to see your sequences processed (hours to minutes), but has more granular alerting and monitoring in place. You might’ve noticed that when you upload sequences in batches, around half of them is processed way faster than the other half.

We know data loss is super frustrating and we thank you for your patience while we migrate to a more stable system with better reporting to better detect and prevent these types of issues going forward.

3 Likes

@knikel Thank you for your hard work. Stability and resiliency improvements are always welcome. Silent failures are really ugly. Anyway, you learn, fix, and move on.

@boris You did post on the forum rather promptly. However, maybe a blog post or a notification :e-mail: email to all users might have been some better response? Although, I am also aware that mass emailing can lead to flooding the support team with hollow user inquiries. So emailing may not be the best communication channel on such events. But, a blog post would have probably been advisable. Another, perhaps best option, would have been a message in the feed because the feed is something most contributors read more or less on every login. Such a message in the feed would need to be sticky however (always at the top for the duration of the event).

1 Like

Great suggestions, thank you as always @GITNE - we will consider these and aim to do a better job of informing folks going forward (and of course preventing the issues to begin with).

1 Like

I have 28 sequences from the same trip stuck on “Ingesting…” since December 20 with no thumbnails. Would this be the same problem?..

Hey @carryteaz, not sure what’s your username, but the dates seem to overlap!