Skip to content

[v3.30] cherry-pick ipv6 patches from master on release/v3.30.0#978

Open
aritrbas wants to merge 68 commits intorelease/v3.30.0from
ipv6-cherry-pick-r330
Open

[v3.30] cherry-pick ipv6 patches from master on release/v3.30.0#978
aritrbas wants to merge 68 commits intorelease/v3.30.0from
ipv6-cherry-pick-r330

Conversation

@aritrbas
Copy link
Copy Markdown
Collaborator

No description provided.

hedibouattour and others added 30 commits April 22, 2026 20:41
services can have v6 address of be dualstack
endpoints can be dualstack as well, in which case we track
them using the endpointslice object rather than the endpoints.
This patch adds the support for the v6 services
we only nat v4 to v4 or v6 to v6

add map for svc and corresponding epslices and fix run-test-v6

make run-tests
make run-tests-6
This patch fixes the behavior of CNI server state
reload on error. It addresses the two following issues.
- If the state file is corrupted and parsing errors,
we should return, not proceed as the parsed cniServerState
might contain partial data.
- When VPP re-programmation completes, we overwrite the
state file so that we remove pods that yielded errors or
for which the linux ns was removed.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch removes the nodeIP from the tap0 interface in VPP.
With this patch, for each uplink interface eth0 with IP 192.168.0.1/24
we create a corresponding tap0 set up the following way:

* In VRF:0
  * we create the af_packet interface with IP 192.168.0.1/24
  * we receive 192.168.0.1/32 locally, traffic to 192.168.0.1 without listeners
    will end up in punt
* In the punt table
  * we route 192.168.0.1/24 via tap0 192.168.0.1
* In linux
  * tap0 has the 192.168.0.1/24 address
  * tap0 will respond to ARPs as VPP has arp proxy enabled
* In a host-tap-eth0-v4 VRF
  * we place the tap0 interface
  * we give it the 169.254.0.1/32 address, overridable with CALICOVPP_TAP0_ADDR
  * we enable IP6 without setting an address
  * we add a static neighbor for 192.168.0.1 to the MAC of the linux side of the tap
* If we specify a rule in redirectToHostRules (e.g. for DNS in kind)
  * we will have the classifier entry redirect to tap0 192.168.0.1

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
IPv6 gateway traffic (DHCPv6/ICMP) fails when VPP takes over the uplink.
- Without gateway ND proxy, host NS for the default gateway is dropped by VPP
  with "neighbor solicitations for unknown targets" error due to missing /128
  target entry in the tap FIB.

Fix:
- Enable ND proxy for the gateway on the tap so the host can resolve the
  gateway via VPP.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
Configure ip6tables mangle rule to set hop limit to 2 for DHCPv6 OUTPUT
traffic from client (sport 546) to server (dport 547). This prevents VPP
from dropping DHCPv6 SOLICIT/REQUEST packets when it decrements hop-limit
by 1 during forwarding. Since clients generate SOLICIT/REQUEST with
hop-limit=1, without this rule VPP drops the packet (ip6 ttl <= 1)
with ICMP time exceeded, causing DHCPv6 lease negotiation to fail.

The rule is checked for existence before adding to prevent duplicates
since ip6tables does not auto-dedupe rules. The rule is also cleaned
up during configuration restoration.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Link-local addresses are not routable. When synchronizing Linux
routes to VPP's uplink interface, filter out link-local addresses
so that they are not added to VPP's main VRF routing table.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Capture ID_NET_NAME_* properties before VPP driver unbind and restore them
via udev rules after VPP creates host-facing tap/tun interface. This is
needed for IAID generation by DHCPv6 client in systemd-networkd to be
consistent across VPP lifecycle on the node.

Key changes:
- Repurpose BEFORE_IF_READ hook to capture udev properties before driver unbind
- Move SetInterfaceNames() before HookBeforeIfRead so interface names are available
- Store ID_NET_NAME_* values and MAC address while interface still has original driver
- Create udev rules for the interface to restore ID_NET_NAME_* values after VPP runs
- Cleanup udev rules on VPP shutdown
- BEFORE_IF_READ → capture, VPP_RUNNING → create, VPP_DONE_OK/ERRORED → cleanup
- Add EnableUdevNetNameRules config knob in CalicoVppDebugConfigType (default: true)
  - Allows disabling udev net name rules generation (if needed). When disabled, skips
    captureHostUdevProps(), createUdevNetNameRules() and removeUdevNetNameRules()

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
IPv6 ping between nodes fails with "l3 mac mismatch" error in VPP's
ethernet-input node. Packets arriving on tap0 with destination MAC
set to the infrastructure gateway's MAC are dropped.

- IPv4 (ARP Proxy): Host sends ARP request, VPP responds with its own
  tap interface MAC. All subsequent IPv4 packets use VPP's MAC as the
  destination, passing VPP's L3 MAC filter check.

- IPv6 (ND Proxy + Neighbor Advertisement): While VPP's ND proxy responds
  to Neighbor Solicitations with the tap interface MAC, the host also
  receives Neighbor Advertisement (NA) packets from the real gateway.
  These NA packets contain the Target Link-Layer Address Option (TLLAO)
  with the real gateway's MAC address. The host overwrites its neighbor
  cache with this information and sends IPv6 packets to the real gateway
  MAC instead of VPP's tap MAC.

Capture the gateway's MAC address from Linux neighbor cache before VPP
takes over the interface, then add it as a secondary MAC address on the
tap interface using VPP's existing sw_interface_add_del_mac_address API.

VPP's ethernet-input node accepts packets with either the primary MAC
or any configured secondary MAC addresses, allowing traffic to flow
regardless of which MAC address the host learned (from ND proxy or NA).

This is a control plane only fix that requires no VPP patches.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
In dual-stack or IPv6-enabled clusters, the agent can crash when it attempts
to announce or withdraw a BGP path for an IPv6 address, but the nodes does
not have a corresponding IPv6 address configured in HostMetadata.

Before this change, common.MakePath() returned a generic error ('no ip6
address for node'). That error was wrapped by the routing_server and
propagated back to tomb, causing the routing watcher to stop and the
main process to tear down (ending in a fatal gRPC server error).

Changes:
- Added sentinel errors ErrNoNodeIPv4 and ErrNoNodeIPv6 in common.go
- Added helper function IsMissingNodeIP() to detect these specific errors
- Updated MakePath() to return sentinel errors (including for SRv6 next-hop)
- Updated routing_server and prefix_watcher to treat missing-node-IP as a
  non-fatal condition: log a warning indicating we skip announce/withdraw,
  returning nil so tomb does not enter Dying state

This prevents the agent from crashing with a clear warning log for operators.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch changes the way the link local address is configured in VPP.
Previously we were going through the addresses configured on the
uplink in linux prior to starting VPP, extracting the Linklocal and
using is as-is in VPP.

This patch makes it so that we first create the tap interface in linux,
wait for it to get a linklocal address and use the new one as a Link local
in VPP.

This addresses a conflict where linux will compute two different LL addresses
for the real uplink - prior to VPP starting - and the tuntap replacing it
- after VPP started. This due to the fact that the computation [0] includes
idev->dev->perm_addr which is unset in tuntap but is in hardware drivers.

Another reason for this design is that VPP does not currently support multiple
LL addresses for a given interface.

[0] https://github.com/torvalds/linux/blob/master/net/ipv6/addrconf.c#L3337C12-L3337C40

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
This patch filters out the ipv6 link-local addresses and routes we copy
from the physical uplink to its tuntap replacement.

The responsibility of setting up the v6 link-local routes and
v6 link-local addresses on the new tuntap should be onto the
networkd-systemd or equivalent so that we do not risk running into
race conditions.

This code only affects ipv6 link local routes. We do not have
special processing for the ipv4 link local routes and addresses
for the moment.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
This patch refactors the way we access the uplink routes and addresses
in vpp-manager, so that we always exclude the v6 link-local routes
from the list, and thus never wrongfully program ll routes where
we expect regular addresses.

As part of this effort, we deprecate the support for extra addresses
configuration, removing the `extraAddrCount` knob from CALICOVPP_INITIAL_CONFIG

Finally we enable the ND proxy for the newly found IPv6LinkLocal address
which was missing from the previous imlementation.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
- Fix CI build failure in config_test.go: update test to use unexported
  'routes' field instead of old exported 'Routes' field
- Remove unused SetV6LinkLocal method: IPv6LinkLocal is exported and
  assigned directly in vpp_runner.go
- Unexport IsV6Cidr → isV6Cidr: only used within config package
- Fix pre-existing bug in pickNextHopIP: check addr.IP.To4() instead of
  nhAddr.To4() (nil on first iteration, always set needsV6=true)

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch moves the VPP admin up API call as one of the very last
thing vpp-manager will do when configuring VPP. Importantly, it places
it after running the last hook 'VPP_RUNNING' that is tasked with applying
the new systemd-networkd configurations.

This should eliminate the possibility of race-conditions, e.g. having
dhclient issue a v6 sollicit before the proper udev rules have been
put in place so that IAID stays identical with the original interface.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
systemd-networkd computes the DHCPv6 IAID from the interface's
persistent name obtained via udev properties (ID_NET_NAME_ONBOARD,
ID_NET_NAME_SLOT, ID_NET_NAME_PATH). When VPP replaces the physical
NIC with a TUN/TAP, the virtual device lacks these properties, so the
IAID falls back to a MAC-based hash — a different value from what the
physical NIC used.

The previous fix (7bc4b5c) deferred VPP-side InterfaceAdminUp to after
the VPP_RUNNING hook, but the Linux side of the tap was already UP from
the moment VPP created it. systemd-networkd detected the interface and
sent DHCPv6 SOLICITs with the wrong IAID before the hook had a chance
to install the udev rule.

Fix:
- Move udev rule installation from the VPP_RUNNING hook (after taps
  exist) to BEFORE_VPP_RUN (before VPP starts). The rule is already
  loaded when VPP creates the tap, so the kernel udev "add" event
  applies ID_NET_NAME_* properties immediately. Remove the udevadm
  udevadm trigger which is no longer required.
- Defer configureLinuxTap() (netlink.LinkSetUp + addresses + routes)
  from configureVppUplinkInterface() to after the VPP_RUNNING hook in
  runVpp(). This guarantees networkd has been restarted and udev
  properties are in place before the host tap becomes operational.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Fix the endpointslices errors in agent
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch resets the next_dpo property
on the vlxan_tunnel on deletion to prevent
resource leakage and use after free in the
case where vxlan tunnels get frequently
added and deleted.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
VXLAN tunnels were showing UNRESOLVED FIB entries despite correct
neighbors and routes causing intermittent connectivity issues.

- pod CIDR routes incorrectly used local node IP as gateway on VXLAN
  tunnel interface - this created problematic adj-sourced FIB entries
  that interfered with tunnel destination encap DPO resolution.

- EncapVrfID field was never set in VxlanAddDelTunnelV3(), defaulting
  to 0. For secondary networks with uplinks in non-default VRFs,
  tunnel destinations had no routes in VRF0, causing UNRESOLVED encap.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Aritra Basu and others added 26 commits April 22, 2026 20:41
Commit 47563a4 changed VXLAN pod-CIDR routes from `Gw: nodeIP` to `Gw: nil`
to avoid local-IP adj-sourced side effects and UNRESOLVED encap DPO issues.

`Gw: nil` is valid for IPIP (P2P tunnel semantics), but VXLAN is non-P2P.
VXLAN tunnels in VPP lack `VNET_HW_INTERFACE_CLASS_FLAG_P2P` (unlike IPIP).
With `Gw:nil` on a non-P2P interface, VPP creates an attached/glean FIB
entry (`FIB_PATH_TYPE_ATTACHED`) and attempts NDP for each destination
directly on the VXLAN tunnel which triggers unresolved neighbor resolution
behavior on the tunnel path.

Fixed the code to use `cn.NextHop` (remote node IP / tunnel destination)
as the gateway. This creates `FIB_PATH_TYPE_ATTACHED_NEXT_HOP`, which
resolves via NDP on the uplink (tunnel is unnumbered) and does not
conflict with the encap DPO since the remote IP has no local receive route.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
On some deployments, uplink IPv6 addresses can be programmed in VPP
with host prefixes (`/128`) which means there is no connected subnet
on the interface. Neighbor discovery can still learn MAC/IP entries,
but VPP may create host/adj-fib behavior that effectively treats
off-subnet neighbors as attached, causing forwarding to become
`UNRESOLVED` for certain peers and intermittently override expected
default-route forwarding.

The failure is intermittent because it is timing-dependent: whether
VPP learns the neighbor (via NDP) before or after the default route
is installed, and on neighbor aging/re-learning cycles.

Introduced CALICOVPP_DEBUG.uplinkSubnetMask (default: 64) to force all
IPv6 uplink interface AddInterfaceAddress calls in vpp-manager to use
this mask, regardless of the source netmask from Linux interface.

This keeps neighbor/MAC learning behavior while ensuring uplink IPv6
addresses are installed with a connected-prefix mask that avoids
host-prefix adjacency edge cases.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch transforms the UplinkSubnetMask option to TranslateUplinkAddrMaskTo64
and restricts it to only apply to non linklocal v6 addresses read on the
uplink interface, and that have a /128 prefix

With the previous implementation we were trying to program VPP with LL /64
addresses which was returning errors.

Also this patch makes it so that vpp-manager will error out if it fails
programming VPP with an address so that we notice such errors

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
If VPP is killed abruptly (e.g. due to OOM), it does not shut down
gracefully and does not restore interface bindings. As a result,
interfaces may remain bound to a DPDK driver and no longer appear
as Linux network devices.

When the expected interface is not found, attempt to rebind PCI
devices back to the kernel driver and retry the lookup once.

Because we don't die gracefully, pingCalicoVPP() won't be called
and agent will never restart. So we call it in the beginning to
kill previous stale agents if any.
Previously, any return from the VPP manager was treated as an indication
that VPP had exited or crashed, triggering configuration restoration.

However, if the VPP manager itself panicked while VPP was still running,
the deferred logic would interpret this as a VPP crash and attempt to
restore the configuration. Since VPP was actually still alive, the
restoration could hang and the original panic would never be properly
surfaced.

This change adds panic recovery in the VPP manager to prevent false
assumptions about VPP state and avoid blocking during restore when the
failure originates in the manager itself.

When VPPmanager crashes we kill VPP, wait a bit, then restore the config.
When using systemd-networkd, restarting networkd immediately after
systemd-udev-trigger can race with udev event processing.

This can cause networkd to compute DHCPv6 IAID before ID_NET_NAME_*
properties are fully restored, leading to intermittent IAID mismatch.

Add a udev settle barrier before restarting networkd:
- restart systemd-udev-trigger
- udevadm settle --timeout=5
- restart systemd-networkd

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch fixes intermittent DHCPv6/link-local punt loops caused
by a startup race in IPv6 link-local discovery.

LL discovery ran in configureVppUplinkInterface() before Linux had
brought the tap UP, so LL was intermittently missing.
Without LL /128 in punt table, punted link-local packets matched
fe80::/10 (ip6-link-local DPO), got redirected back to per-interface
LL FIB, and looped until VPP recursion guard dropped them.

- Add configureIPv6LinkLocal() and call it from runVpp() after
  configureLinuxTap() (LinkSetUp) and before InterfaceAdminUp().
- Poll for tap LL with bounded retries after tap UP; if not found,
  return an error so runVpp() terminates VPP.
- Program LL-specific state in configureIPv6LinkLocal():
  * punt-table LL /128 route
  * LL address on uplink in VPP
  * ND proxy for LL
- Remove old LL handling from the previous setup path.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch adds a static neighbor for the node's LinkLocal
address on tap0. This simplify things as we know the node's
link local in advance and use it for other purposes, so we
save a resolution. Also we do not yet support it changing.

Finally if we do not have this, we learn the neighbor through
gratuitous requests from linux to ND proxy and NS/ND, but as
we do not have a glean toward linux, we cannot learn the neighbor
if packets are coming from the uplink without prior linux
initiated sollicits. As an example DHCPv6 sollicts replies
need forwarding and we might have forgotten the neighbor by then.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
this was previously done in deprecated vpplink repo
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Co-authored-by: Aritra Basu <aritrbas@cisco.com>
This patch updates the sollicited node address patchset
to also program the LL address as a source for the sollicited
node mac address and mcast group to join.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch prevents neighbor advertisements from being
punted toward the host. The expectation is that linux
will issue NS that will hit NDproxy, thus we should
consume NAs in VPP, and learn neighbors, but not transmit
the packets to linux.

This is especially useful in the case where we loose the
node IP (e.g. when dhcpv6 lease renewal fails), as forwarding
NAs destined to the nodeIP to linux with a default route
towards VPP will result in VPP learning neighbors on the tap0
instead of the uplink.

We also include an evolution in ND proxy that removes the need
to allowlist destination IPs that VPP will reply to. That way
we can have the guarantee that all next hops in linux will
be the MAC address of the Gateway, regardless of the routing
on the node.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
- remove "daemon-reload" from generic restartService()
  to avoid reloads on every network restart
- keep "daemon-reload" only for NetworkManager DNS fix
  and undo fix via reloadAndRestartService()

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Signed-off-by: Aritra Basu <aritrbas@cisco.com>
This patch will replicate the eui64 link local address we
read on the linux uplink before startup (in ifState) on the
tap replica we create after starting VPP.

We will still wait for linux to assign a new link local address
to the tap, add the eui64 one and remove the newly created, so
that we do not run into a situation where we have two link local
addresses as VPP only supports one.

This addresses issue with anti mac spoofing mechanism that prevent
us from changing our link local address. This is also cleaner as the
newly generated address will only be temporary.

Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
Reorder startup flow so HookVppRunning executes in this order:
- configureLinuxTap() (tap link set UP, addresses/routes restored)
- configureIPv6LinkLocal() (tap auto-LL replaced with physical LL)
- HookVppRunning (networkd restart happens after physical LL is set)
- InterfaceAdminUp() on tap in VPP

systemd-networkd can initialize DHCPv6 using the kernel auto-generated
tap LL before configureIPv6LinkLocal() replaces it with the physical
LL. By executing VPP_RUNNING after the LL reconciliation, networkd
restart observes the final (physical) LL and DHCPv6 uses the
correct LL source address.

Signed-off-by: Aritra Basu <aritrbas+gh@cisco.com>
avf is deprecated in vpp.
MakePathSRv6Tunnel() constructs SRPolicyNLRI without setting the Color
field, which defaults to zero. This violates RFC 9256 §2.1 which
requires Color to be "an unsigned non-zero 32-bit integer value", and
causes DT4 and DT6 SR Policies for the same endpoint to collide in
GoBGP's RIB (identical NLRI key), so only one of DT4/DT6 survives.

- Set Color to uint32(trafficType) (4 for DT4, 6 for DT6) to ensure
  each <Color, Endpoint> tuple is unique per traffic type and
  satisfies the RFC's non-zero requirement.

- Added unit tests to verify non-zero Color & DT4/DT6 NLRI uniqueness.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
CrossSubnet provider selection relies on local node CIDR containment checks
(nodeIPNet.Contains(peerNextHop)). On environments where the uplink IPv6
address is discovered as a host prefix (/128), node.Spec.BGP.IPv6Address
was advertised as /128. A /128 never contains a same-subnet peer address,
so CrossSubnet effectively behaved like Always.

Moved IPv6 /128 -> /64 translation into LinuxInterfaceState address
getters instead of applying it only at VPP interface programming time.

updateCalicoNode() now advertises translated IPv6 (/64) through GetNodeIP6(),
so Felix/nodeBGPSpec receives a subnet prefix usable by CrossSubnet checks.

Added GetAddressesNoMaskTranslation() and switched Linux address add/del
and tap restore/programming paths to use raw addresses, preserving previous
Linux-side behavior.

CrossSubnet decisions now use a subnet-capable local IPv6 prefix and no
longer degrade to Always due to host-prefix advertisement.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
Signed-off-by: Aritra Basu <aritrbas+gh@cisco.com>
Enable IPv6 on pod interfaces when a pod is IPv6 enabled.
This ensures a link‑local address exists for NS.

VPP change "ip-neighbor: do not use sas to determine NS source
address" makes NS always use the interface’s link‑local address.
Calico VPP pod interfaces are unnumbered and never had IPv6
explicitly enabled, so no link‑local address existed on the pod
interface. This breaks IPv6 neighbor resolution and traffic.

Signed-off-by: Aritra Basu <aritrbas@cisco.com>
@aritrbas aritrbas self-assigned this Apr 23, 2026
@aritrbas aritrbas marked this pull request as ready for review April 30, 2026 07:23
@sknat sknat self-requested a review April 30, 2026 09:07
Copy link
Copy Markdown
Collaborator

@sknat sknat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants