Consensus fault
A consensus fault/error is when multiple nodes do not agree on some result, and suddenly they are unable to make more progress. This can happen do to operator error, rare software bug, or even from malicious tampering.
If nodes complain about "wrong Block.Header.AppHash", this is likely a consensus fault.
Diagnostics
It's important to understand why a fault occured. It may be obvious from operator error, but if not, please record information before restoring a faulty node. This can be used to investigate further what happened.
- On the faulty node, run:
cord test collect-blocks
- On a working node, run:
# use the height "Latest block height: xxx" output from collecting on the faulty node.
cord test collect-blocks --height HEIGHT_FROM_FAULTY_NODE
Both commands will output a blocks.tar file. Please share both files with Cordial Systems. This should be able to help pin-point where the disgreement stemmed from.
Recovery
Recovery should be done manually, and involves modifying the faulty node's state to align with the rest of the cluster.
You can do this using a snapshot from any other node.
# be sure to stop the node first
export TREASURY_HOME=...
cord backup restore --engine --no-secrets --snapshot "<snapshot_from_other_node>"
Alternative ways to recover
If you do not have a snapshot, or the snapshot is taking too long, you can also recover by overwriting:
data/directory.config/genesis.jsonfile.
In other words, you must copy these files from a working node, and overwrite them on the faulty node. The working node must be stopped when you copy from it, otherwise there may be consistency issues.
Note that there are no secrets in any of these files, but you should be careful not to copy more than this.