Recently, I had the privilege to experience yet another Cisco Bug. After upgrading from 9.4.x to 9.6.x my VPN control plane crashed causing all of my VPN L2L to stuck at MM_WAIT_MSG2.
Oct 08 03:34:58 [IKEv1 DEBUG]IP = 220.127.116.11, IKE MM Initiator FSM error history (struct &0x00002aaad0819380) <state>, <event>: MM_DONE, EV_ERROR–>MM_WAIT_MSG2, EV_RETRY–>MM_WAIT_MSG2, EV_TIMEOUT–>MM_WAIT_MSG2, NullEvent–>MM_SND_MSG1, EV_SND_MSG–>MM_SND_MSG1, EV_START_TMR–>MM_SND_MSG1, EV_RESEND_MSG–>MM_WAIT_MSG2, EV_RETRY
My environment is on the more complex side. That being said this may not be relevant to all but if you are running multiple last resorts with policy based routing and multiple outside interfaces then read on.
VPN consists of two phases. Phase one which is the control plane between the peers that needs to be established prior phase two which is the data plane between end to end hosts.
When you leverage VPN on multiple outside interfaces the way the firewall know which interface to use for its egress traffic is via crypto map policy i.e.:
crypto map outside_map2 interface outside2
When phase one is being initiated the PBR doesn’t come into play since its only for data plane purposes (I thought different). What the firewall will do, it will check which interface to use via crypto and as long as you have (important) the secondary last resort configured it will know how to exit.
This was a valid configuration on 9.4.x range but seems like logic changes with 9.6.x releases so be aware.
After reviewing all the configuration and capturing tcp dump on egress interfaces it was obvious that control plane wasn’t picking correct static route anymore to the other peer causing MM_WAIT_MSG2.
Creating specific (/32) static route via secondary outside interface to the peer triggered control plane of the box to use the proper egress interface and the tunnel came back.
This is definitely new behavior that wasn’t present with 9.4.x release. I’ve asked Cisco to provide RCA for this behavior change but I highly doubt they will come back to me with anything. If they do I’ll update this post with findings.
Again, this is not a very common deployment where you will typically have one outside interface with one last resort.
Thought I will share my experience in case someone is in the same ditch.
Additionally, I would strongly recommend to check out my other post explaining MM_WAIT_MSG2 cause and remediations.