Strange errors with multi-master

We are trying to setup Keydb as 4 node multi-master setup between two datacenters for disaster recovery.

The steps we have taken

  • replaced Redis-cluster with KeyDB
  • Keydb is the latest version, executing inside docker container
  • changed configuration files to default KeyDB
  • added the multimaster options to config file ( multi-master yes + active-replica yes)
  • multi master options are before the replicaof options. Some thread had comments that the order of these is critical

In our tests, the second node does load the DB fine from the first and only node.
When the third node joins in, the trouble starts:

  • nodes bounce up and down, no indication about this in the logs though. But clients do loose connection to the cluster
  • thousands of following errors appear to logs in all nodes:

== CRITICAL == This replica is sending an error to its master: ‘Invalid MVCC Tstamp’ after processing the command ‘KEYDB.MVCCRESTORE’

Latest backlog is: ‘“b4-46fd-819b-a879a588c46d\r\n$153\r\n*5\r\n$17\r\nKEYDB.MVCCRESTORE\r\n$44\r\nmyvine:STAT:C271F2F6588DD73BE0530502020A5897\r\n$20\r\n18446744073709551615\r\n$13\r\n1626350182830\r\n$20\r\n\x00\b18763:10\t\x00H\xda\x1a\xd4\xcbZ\xd0\xa1\r\n\r\n$1\r\n0\r\n$19\r\n1700941248263623573\r\n\r\n$1\r\n0\r\n$19\r\n1700941309739532331\r\n”’

Database size is about 3Gb

In dev tests, where the database is much smaller, the same errors appeared, but stopped after couple hours. But with this real system, the problem seems to stay + we can not anyway afford the nodes to bounce up and down as our infra heavily relies on Redis/KeyDB now

From Google I found this related error:
However the solution described in there did not help in our case:
client-output-buffer-limit replica 0 0 0

I also tried to increase the thread count in the config file. Did not help.

Any ideas if I am missing something from the config or doing something otherwise wrong?

Thanks for any help!

Investigated a bit of the Redis source code vs Keydb. Looks like Redis does not use timestamps at all

So maybe the problem is simply this:

  • the RDB file is from Redis and does not have timestamps and keydb is streaming it to new node which then complains about this
    => this extra traffic causes network to oversaturate and nodes will be unaccessible.

However I do not understand why the second node did not produce the errors…or maybe they were produced fast enough for me to miss those

So is this possible problem? And solution would be
a) ignore the missing timestamp in initial replicaof-load and just add new timestamp?
b) add the missing timestamp during the streaming from master
c) add the missing timestamp when loading the db from disk - I think this would not help me as I cannot really stop the cluster. There must be at least one node running all the time