[HN Gopher] Ask HN: Is there a data loss bug lurking in MS365 ba...
___________________________________________________________________
 
Ask HN: Is there a data loss bug lurking in MS365 backup solutions?
 
It sounds crazy, and maybe I'm just doing something dumb, but I've
seen a similar issue in two different MS365 backup products this
year. I can't reproduce it reliably, but _feel_ like there could be
a serious issue, even though I can 't prove it.  My issue is
specific to OneDrive. When I reconcile a backup set against live
data, files are missing. I've had it happen with both Veeam Backup
for MS365 and Synology Active Backup for MS365. Neither system
reports any issues when backups run. I don't know the cause and I
can't reproduce it, but it happens consistently for at least one of
my tenants and seems to get worse over time. I've seen the issue on
more than one tenant, so I don't think it's anything tenant
specific and the only (untenable) solution I've come up with is to
restart the backup from scratch.  The tenant that has the most
issues is a business with about 200k files. The business owner owns
the files and everyone else has access to a few shared folders near
the top of the hierarchy. They have about 350GB of data which ends
up being about 1TB of quota after versioning.  I originally ran
into the issue with Veeam by randomly spot checking 1-2 files every
once in a while and running into missing data by random chance.
That made me realize I needed to do some kind of bulk
reconciliation on a regular basis. I gave up on Veeam because they
append the OneDrive file version to all restored files and it makes
it difficult to reconcile. For example, locally restored files end
up with ' (ver 2)' or similar appended to the file name.  I
switched to the Synology system because it's ideal to reconcile
since an up-to-date backup set can be shared via SMB. That makes it
possible to have an up-to-date OneDrive sync and a mapped drive to
an up-to-date backup set on the same machine. After that, it's a
matter of comparing two folders as long as care is taken to get a
consistent point-in-time for both sets of data.  The only
noteworthy thing that I _think_ plays a part is how frequently the
tenant reorganizes their data. They 're always renaming and moving
files and folders to keep things organized. I'd frame it as being
frequent, but not unreasonable. The reason I think this is
noteworthy is that in cases where I'm able to track the life-cycle
of missing files, they seem to "disappear" after being impacted by
a directory rename or move operation.  I can't engage with support
for this particular tenant because the data is supporting
documentation for government work. I can't even share examples or
screenshots with the file names AFAIK.  I've seen people
complaining about similar issues, but the complaints are from years
ago [1]. This [2] caught my eye.  > The root cause for the missing
data was due to incorrect representation of the changes from the
SharePoint API side. Veeam RnD team performed an investigation and
found that sometimes the SharePoint API mechanism of tracking
changes did not track the changes inside the Child files or Folders
inside the Site`s list.  Does anyone here reconcile their OneDrive
backups well enough to say you're confident you can reliably
restore your data with 100% consistency?  Is it possible there's a
change tracking bug lurking in the SharePoint API? I don't know
anything about how it works, so any insight would be useful. For
example, would the OneDrive client and backup clients use the same
change tracking? That seems unlikely based on what I'm seeing, but,
again, I have no idea how it actually works.  1.
https://forums.veeam.com/veeam-backup-for-microsoft-365-f47/...  2.
https://forums.veeam.com/veeam-backup-for-microsoft-365-f47/...
 
Author : ryan87
Score  : 32 points
Date   : 2023-09-06 18:08 UTC (1 hours ago)
 
| mschuster91 wrote:
| > I can't engage with support for this particular tenant because
| the data is supporting documentation for government work. I can't
| even share examples or screenshots with the file names AFAIK.
| 
| Your purchase department should have access to their Microsoft-
| side account representative to get you access to support without
| breaching legal requirements.
 
  | jnsaff2 wrote:
  | I've been in a few companies which where we had dedicated
  | account managers on MS side. One was even so big that there
  | were MS engineers sitting with our teams.
  | 
  | Questions of actual substance were rarely answered
  | satisfactorily, sometimes weeks of chasing got nowhere.
  | 
  | With this context I would not hold by breath for them to come
  | back with anything useful.
  | 
  | Also having seen the kinds of untested garbage they deploy as
  | services I would not be surprised at all at them silently
  | losing data, I would not even be sure for them ever noticing
  | that they have lost data.
  | 
  | For anyone using their storage as backups, I recommend having
  | an inventory (with hashes) and comparing that to whats in MS
  | from time to time. At least you know you are not going crazy,
  | even if MS never trusts your proof about their incompetence.
 
    | jnsaff2 wrote:
    | Hell, github has had a lot of outages in the last weeks
    | (webhooks for the most part), I would not be surprised if
    | they'd lose github data .. which I guess would be detected by
    | git eventually but hey, what's gone is gone.
 
| Borg3 wrote:
| Holy moly.. Reading this made me smile.. No offence. Im super
| happy I do NOT need to deal with that Microsoft stuff.. So much
| moving parts, so many places where it can silently fail. Scary.
| 
| I cannot personaly recommend anything usefull to you sadly. Stuff
| just drifted in completly wrong direction imho. Linus showed how
| that stuff could be handled by introducing GIT. I went into that
| direction myself, coding simple DVFS repo manager to handle all
| my mutable files. But to 99% of people working with computers
| today its too hard to handle. Heh, we are going backward I think.
 
| avannatta wrote:
| That's terrifying. I'm going to spot check our N-able backups
| more closely
 
| prmph wrote:
| I wouldn't trust any MS cloud product to not lose my data.
| 
| A year or two ago I lost data backing up my OneDrive files to my
| local HDD. I was told my backup had succeeded, and so I deleted
| the files on OneDrive. Later, when I tried to extract some large
| files from the backup, I saw that they had been replaced with
| some text files, containing a message that the download of those
| files failed.
| 
| What the heck... Was I expected to extract the archive and go
| through all the files one-by-one (there are thousands of them) to
| check if every file was properly backed up?
 
  | ansible wrote:
  | > _later, when I tried to extract some large files from the
  | backup, I saw that they had been replaced with some text files,
  | containing a message that the download of those files failed._
  | 
  | That's a really asinine failure mechanism. At the very least,
  | it should create "filename-FAILED_TO_RESTORE" or something like
  | that.
  | 
  | Something that I was instructed to do nearly 30 years ago when
  | at Motorola was: never overwrite the original files. Standard
  | procedure was to make a directory called "BACKUP" (or
  | "RESTORE"?) and put the files in there instead. The requesting
  | person was then responsible for moving the files back to their
  | original location, so any screwups were on them, not us (IT).
 
  | ryan87 wrote:
  | > What the heck... Was I expected to extract the archive and go
  | through all the files one by one (there are thousands of them)
  | to check if every file was properly backed up?
  | 
  | There are a lot of silent pitfalls like that in my experience.
  | For example, if you use folder level encryption with Synology
  | Active Backup for MS365 it can silently mangle the file names
  | due to path length restrictions. You'll end up with files that
  | have "file name too long" as part of the file name.
  | 
  | That's why I'm trying to reconcile every file in this data set.
  | I don't trust anything without being able to personally verify
  | it at this point.
 
  | dosshell wrote:
  | Does someone know if rclone also can fail silently when syncing
  | onedrive? [0]
  | 
  | From my experience it doesn't but onedrive provides, sometimes,
  | filesizes that are wrong, so checking file integrity is a tiny
  | bit more of a pain.
  | 
  | [0] https://rclone.org/docs/
 
  | bombcar wrote:
  | This is exactly what I've encountered with _all_ the cloud-
  | based file storage apps now that they 've gone to "dynamic
  | download".
  | 
  | You will get failed downloads quietly.
 
  | grepfru_it wrote:
  | >I wouldn't trust any MS cloud product to not lose my data.
  | 
  | I raised issues of data loss and management prioritized
  | shipping to customers instead. I can't say I blame them, they
  | can take those risks knowing customers will accept them because
  | their big name is behind the product. That said, I heard the
  | issues were resolved so proceed with caution
  | 
  | Note: this was not onedrive
 
| nottheengineer wrote:
| Possibly related: A few weeks ago I was told to use the autosave
| option of MS word. The login dialog went through, but then
| nothing happened. Classic Microsoft. I reinstalled Onedrive,
| autosave said it's on and I go ahead with my work. That was
| friday 3PM. Come monday, I find that the last 2 hours of friday
| are gone and spend about half an hour looking for it. I Ctrl+S
| about every 5 seconds so I can't imagine that there's no copy of
| it anywhere. But there wasn't one, it was just gone.
| 
| What happened was that onedrive was stuck trying to upload my VMs
| (I had those in my _local_ documents folder, which onedrive
| claimed ownership of during reinstallation without telling me).
| Onedrive never showed my excel file in the list of files it was
| uploading, so it could have either tried to do the VMs first and
| never got to the excel file or quietly errored out on that
| without ever uploading it.
| 
| In any case, I'm still dealing with the fallout of the fatal
| mistake I made: Trusting a microsoft product to do the absolute
| fucking basics.
| 
| Maybe I'll learn this time.
 
  | robotnikman wrote:
  | I'm glad I'm no longer on the side of managing Microsoft 365
  | stuff for a company. I've seen this happen too many times, and
  | of course the user takes it out on you for it. Was so much
  | simpler and reliable in the past just having a NAS and mapping
  | it to a drive. Even my current company still uses NAS for
  | mission critical stuff.
 
| still-old wrote:
| Are you sure you understand how VEEAM MS365 actually backs up
| data?
| 
| They designed it to look like a differential backup model, but
| it's not. In fact the only backup you can trust as to its
| integrity is the first full one.
| https://helpcenter.veeam.com/docs/vbo365/guide/retention_pol...
| 
| "When a retention policy is applied in backup repositories with
| the Snapshot-Based Retention type, Veeam Backup for Microsoft 365
| removes versions of an item, but not an item itself. Data removal
| from backup occurs every time the restore point of an item's
| version in a backup file goes beyond the retention coverage.
| Eventually, if no more changes were made to an item, Veeam Backup
| for Microsoft 365 will remove all versions of an item except the
| latest one. The latest versions and items that were never changed
| stay in a backup repository with the Snapshot-Based Retention
| type forever."
 
  | ryan87 wrote:
  | > They designed it to look like a differential backup model
  | 
  | > Eventually, if no more changes were made to an item, Veeam
  | Backup for Microsoft 365 will remove all versions of an item
  | _except the latest one_.
  | 
  | I added the emphasis. I read that to mean I can expect the
  | current point-in-time (aka right now) to be identical to my
  | live data. I'm only trying to reconcile the most recent set of
  | data, so that retention policy shouldn't make any difference,
  | right?
 
___________________________________________________________________
(page generated 2023-09-06 20:00 UTC)