Skip to content

Conversation

@phsm
Copy link
Contributor

@phsm phsm commented Mar 20, 2025

Description

This PR changes the behavior of Security Groups to disable connection tracking when it is not needed.
The idea is that the VM that have "allow all" rule can have as many connections as they want without straining the host system. This change may be benefitial for VPS hosters, where the VM behavior is not under control of the servers administrator.

The list of changes:
Introduced two new ipsets: cs_notrack for IPv4 and cs_notrack6 for IPv6 that contain the VM IP addresses that do not need to be tracked.
When a security group contains a rule allowing all protocols from 0.0.0.0/0 (IPv4) or ::/0 (IPv6), then all the IPv4 and/or IPv6 addresses of the VM are added to these ipsets.

The following rules are added into iptables table raw chain PREROUTING:

iptables -t raw -A PREROUTING -m set --match-set cs_notrack dst -j NOTRACK
iptables -t raw -A PREROUTING -m set --match-set cs_notrack src -j NOTRACK
ip6tables -t raw -A PREROUTING -m set --match-set cs_notrack6 dst -j NOTRACK
ip6tables -t raw -A PREROUTING -m set --match-set cs_notrack6 src -j NOTRACK

The iptables matchers -m state --state NEW are removed as they are not needed for several reasons:

  • they block the allowed traffic if the connection is not tracked
  • the rest of the matcher is explicit enough to allow the traffic that was specified in the security group
  • the conntrack look up calls can be very expensive on high packet per second rate when the connection tracking table has tens millions of records

The -m state --state ESTABLISHED,RELATED rules are only placed at the end of the VM -def chain, as the last resort rule before the final decision to drop the packet. The goal is to try explicit matchers as much as possible.

The behavior of the -VM chain that contains user-defined rules was modified:

  • do not return traffic in the rules, the only possible rule action is ACCEPT. If a packet doesn't match any rules, then it returns back to the -def chain, where it is checked to belong to an existing connection, otherwise dropped.
  • the above mentioned -m state NEW are removed.

Since the VM -def chain is populated with rules for each NIC, and there is no place in inject the final unconditional -j DROP in the code, I had to resort to blocking traffic matching each VM network interface in the end of each set of interface-specific rules

A minor refactoring is done:

  • The function split_ips_by_family() now takes one or more arguments that can be either a ;-separated string or any other type that can be parsed by Python ipaddress.ip_address() method.
    The function splits ;-separated strings when it encounters them, removes the empty elements and '0' literals (they indicate an empty IP address list for some reason).
    As the result, it returns a tuple containing a list of IPv4 addresses, and IPv6 addresses. Therefore, the function is backwards compatible to the previous behavior.
  • Some lines of code that were doing the same functionality as the updated split_ips_by_family(), are removed.
  • The function add_to_ipset() uses -! flag that silently ignores addition of a new element if it already exists in the ipset, or its removal if it doesn't exist in the ipset. It will still crash if the requested ipset does not exist. This change makes ipset add calls indempotent.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Tested:

  • starting a VM
  • changing the rules on the fly: add/remove "allow all" rule, add more specific rules such as allow a specific TCP port range.
  • migrating the VM to a host with these changes deployed
  • migrating the VM from the host with these changes deployed to a host with "vanilla" security groups script, and back
  • both ingress and egress security groups behavior is tested.

How did you try to break this feature and the system with this change?

  • Test with only egress traffic allowed (no ingress rules)
  • Test with only ingress traffic allowed (only one egress rule allowing traffic to a non-existing IP address, that makes every other egress traffic dropped)
  • Test with more-specific rules, e.g. allow specific ports, or allow only IPv6 traffic

The conntrack is disabled if the security group allows all traffic.
Also, refactored the code a little.
@boring-cyborg boring-cyborg bot added component:networking Python Warning... Python code Ahead! labels Mar 20, 2025
@codecov
Copy link

codecov bot commented Mar 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.00%. Comparing base (653b973) to head (f1ff535).
⚠️ Report is 828 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #10594      +/-   ##
============================================
- Coverage     16.00%   16.00%   -0.01%     
- Complexity    13104    13105       +1     
============================================
  Files          5651     5651              
  Lines        495870   495870              
  Branches      60049    60049              
============================================
- Hits          79370    79361       -9     
- Misses       407638   407652      +14     
+ Partials       8862     8857       -5     
Flag Coverage Δ
uitests 4.00% <ø> (ø)
unittests 16.84% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@wido wido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great! I don't have a test env now, but can you confirm that you have tested this in your environment and verified it works?

@phsm
Copy link
Contributor Author

phsm commented Mar 21, 2025

Well, I tested it somewhat extensively: single ipv4, single ipv4+ipv6, ipv4+ipv6+additional ipv4 and ipv6 ips, multiple NICs, different ipv4+ipv6 rule combinations.. Looks good to me.

However, I wouldn't trust me on this entirely, it is quite hard to not make a mistake with such a complicated script.
I could have missed something. So it would be great if someone could test this extensively too.

if vm_ip:
execute("iptables -A " + vmchain_default + " -m physdev --physdev-is-bridged --physdev-in " + vif + " -m set ! --match-set " + vmipsetName + " src -j DROP")
execute("iptables -A " + vmchain_default + " -m physdev --physdev-is-bridged --physdev-out " + vif + " -m set ! --match-set " + vmipsetName + " dst -j DROP")
execute("iptables -A " + vmchain_default + " -m physdev --physdev-is-bridged --physdev-in " + vif + " -m set --match-set " + vmipsetName + " src -p udp --dport 53 -j RETURN ")
Copy link
Member

@weizhouapache weizhouapache Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phsm
I think we should not change it

RETURN means rules in other chains will be checked. But they will not be checked if this is changed to ACCEPT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw: I did not look into the changes. Correct me if I am wrong.

This script is very important for public cloud providers, we have to be very careful

Copy link
Contributor Author

@phsm phsm Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for scrutinizing my PR, I understand how critical that file is. So the more eyes look at it (and test), the better.
Let me show the examples of the current and proposed security group rules so it becomes more clear.

First, lets have a look how the current rules look like for a single virtual machine with the current security group implementation:

:FORWARD ACCEPT [0:0]
# a set of rules like that for every shared network bridge (VLAN)
# in/out traffic forwarded via this bridge always passes into an individual "per bridge" chain
-A FORWARD -o brbond0-304 -m physdev --physdev-is-bridged -j BF-brbond0-304
-A FORWARD -i brbond0-304 -m physdev --physdev-is-bridged -j BF-brbond0-304
# Everything that was not explicitly accepted or dropped in the "per bridge" chain gets returned here
# and dropped
-A FORWARD -o brbond0-304 -j DROP
-A FORWARD -i brbond0-304 -j DROP

:BF-brbond0-304 - [0:0]
# The early accept of packets belongign to established and related connections.
# The goal of my changes are to eliminate conntrack as much as possible, therefore my changes do not have this rule
-A BF-brbond0-304 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A BF-brbond0-304 -m physdev --physdev-is-in --physdev-is-bridged -j BF-brbond0-304-IN
-A BF-brbond0-304 -m physdev --physdev-is-out --physdev-is-bridged -j BF-brbond0-304-OUT
-A BF-brbond0-304 -m physdev --physdev-out bond0.304 --physdev-is-bridged -j ACCEPT

:BF-brbond0-304-IN - [0:0]
# rules for other VMs are omitted
-A BF-brbond0-304-IN -m physdev --physdev-in vnet152 --physdev-is-bridged -j i-2-104-def

:BF-brbond0-304-OUT - [0:0]
# rules for other VMs are omitted
-A BF-brbond0-304-OUT -m physdev --physdev-out vnet152 --physdev-is-bridged -j i-2-104-def


# At this point we can safely claim that the traffic concerning a single VM always reaches the VM '-def' chain.
# If anything is returned from the '-def' chain, it traverses back to the previous chains until it gets dropped in the FORWARD chain
# Except the outgoing traffic to the Internet due to an unexpected rule '-A BF-brbond0-304 -m physdev --physdev-out bond0.304 --physdev-is-bridged -j ACCEPT'


:i-2-104-def - [0:0]
-A i-2-104-def -m state --state RELATED,ESTABLISHED -j ACCEPT <- unnecesarry, the same rule exists earlier
-A i-2-104-def -p udp -m physdev --physdev-in vnet152 --physdev-is-bridged -m udp --sport 68 --dport 67 -j ACCEPT
-A i-2-104-def -p udp -m physdev --physdev-out vnet152 --physdev-is-bridged -m udp --sport 67 --dport 68 -j ACCEPT
-A i-2-104-def -p udp -m physdev --physdev-in vnet152 --physdev-is-bridged -m udp --sport 67 -j DROP
-A i-2-104-def -m physdev --physdev-in vnet152 --physdev-is-bridged -m set ! --match-set i-2-104-VM src -j DROP
-A i-2-104-def -m physdev --physdev-out vnet152 --physdev-is-bridged -m set ! --match-set i-2-104-VM dst -j DROP
# This traffic in the following 2 rules gets returned into the previous chains, which I believe triggers the rule
# '-A BF-brbond0-304 -m physdev --physdev-out bond0.304 --physdev-is-bridged -j ACCEPT'
# effectively always allowing the DNS, regardless if the security group denies all the traffic or not.
-A i-2-104-def -p udp -m physdev --physdev-in vnet152 --physdev-is-bridged -m set --match-set i-2-104-VM src -m udp --dport 53 -j RETURN
-A i-2-104-def -p tcp -m physdev --physdev-in vnet152 --physdev-is-bridged -m set --match-set i-2-104-VM src -m tcp --dport 53 -j RETURN
-A i-2-104-def -m physdev --physdev-in vnet152 --physdev-is-bridged -m set --match-set i-2-104-VM src -j i-2-104-VM-eg
-A i-2-104-def -m physdev --physdev-out vnet152 --physdev-is-bridged -j i-2-104-VM

:i-2-104-VM - [0:0]
# the incoming traffic to the VM gets into this chain
# The test VM has only one rule there: accept all protocols from 0.0.0.0/0
-A i-2-104-VM -j ACCEPT
-A i-2-104-VM -j RETURN

:i-2-104-VM-eg - [0:0]
# the outgoing traffic from the VM gets into this chain
# the test VM has no rules there:
-A i-2-104-VM-eg -j ACCEPT

Now, lets look at the rules that my changes produce:

*raw
:PREROUTING ACCEPT [0:0]
# Everything in the cs_notrack ipset gets excluded from conntrack
-A PREROUTING -m set --match-set cs_notrack dst -j NOTRACK
-A PREROUTING -m set --match-set cs_notrack src -j NOTRACK

:FORWARD ACCEPT [0:0]
-A FORWARD -o brbond0-302 -m physdev --physdev-is-bridged -j BF-brbond0-302
-A FORWARD -i brbond0-302 -m physdev --physdev-is-bridged -j BF-brbond0-302
-A FORWARD -o brbond0-302 -j DROP
-A FORWARD -i brbond0-302 -j DROP

:BF-brbond0-302 - [0:0]
-A BF-brbond0-302 -m physdev --physdev-is-in --physdev-is-bridged -j BF-brbond0-302-IN
-A BF-brbond0-302 -m physdev --physdev-is-out --physdev-is-bridged -j BF-brbond0-302-OUT
-A BF-brbond0-302 -m physdev --physdev-out bond0.302 --physdev-is-bridged -j ACCEPT

:BF-brbond0-302-IN - [0:0]
-A BF-brbond0-302-IN -m physdev --physdev-in vnet4 --physdev-is-bridged -j i-2-340-def

:BF-brbond0-302-OUT - [0:0]
-A BF-brbond0-302-OUT -m physdev --physdev-out vnet4 --physdev-is-bridged -j i-2-340-def

:i-2-340-def - [0:0]
-A i-2-340-def -p udp -m physdev --physdev-in vnet4 --physdev-is-bridged -m udp --sport 68 --dport 67 -j ACCEPT
-A i-2-340-def -p udp -m physdev --physdev-out vnet4 --physdev-is-bridged -m udp --sport 67 --dport 68 -j ACCEPT
-A i-2-340-def -p udp -m physdev --physdev-in vnet4 --physdev-is-bridged -m udp --sport 67 -j DROP
-A i-2-340-def -m physdev --physdev-in vnet4 --physdev-is-bridged -m set ! --match-set i-2-340-VM src -j DROP
-A i-2-340-def -m physdev --physdev-out vnet4 --physdev-is-bridged -m set ! --match-set i-2-340-VM dst -j DROP
# In the following 2 rules: instead of passing the traffic all the way back to the BF-brbond0-302 chain to accept it, I accept it here.
-A i-2-340-def -p udp -m physdev --physdev-in vnet4 --physdev-is-bridged -m set --match-set i-2-340-VM src -m udp --dport 53 -j ACCEPT
-A i-2-340-def -p tcp -m physdev --physdev-in vnet4 --physdev-is-bridged -m set --match-set i-2-340-VM src -m tcp --dport 53 -j ACCEPT
-A i-2-340-def -m physdev --physdev-in vnet4 --physdev-is-bridged -m set --match-set i-2-340-VM src -j i-2-340-VM-eg
-A i-2-340-def -m physdev --physdev-out vnet4 --physdev-is-bridged -j i-2-340-VM
# The following 2 rules: the last resort rule before dropping the packet: check the conntrack
-A i-2-340-def -m physdev --physdev-in vnet4 --physdev-is-bridged -m state --state RELATED,ESTABLISHED -j ACCEPT
-A i-2-340-def -m physdev --physdev-out vnet4 --physdev-is-bridged -m state --state RELATED,ESTABLISHED -j ACCEPT
# The following 2 rules: drop any other traffic concerning this vNIC
# this is done to support multi-NIC, as the same set rules will be added below for the other NIC if the VM has it
-A i-2-340-def -m physdev --physdev-in vnet4 --physdev-is-bridged -j DROP
-A i-2-340-def -m physdev --physdev-out vnet4 --physdev-is-bridged -j DROP

My intention to have the logic reworked was to reduce implicit chain returns when you do not know what is going to happen to a packet when it is returned.
So I made the traffic to flow into the VM '-def' chain and never return from it anymore, so it is clear what is happening to a packet by just looking at the '-def' chain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phsm
thanks for the reply.
as far as I remember, both ingress rules and egress rules should be checked. this is why I questioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update regarding the missing '-j DROP' rules in the chain FORWARD chain:
I double checked, those drop rules are present in the chain. Seems I just wrongly grep-ed while preparing the listings for the previous post.

Edited the rules listing in the previous to contain these rules.

@weizhouapache
Copy link
Member

cc @loth @kriegsmanj

@sureshanaparti sureshanaparti added this to the 4.20.2 milestone Jun 5, 2025
@weizhouapache
Copy link
Member

Moving to 4.23 milestone
cc @harikrishna-patnala @DaanHoogland

@weizhouapache weizhouapache modified the milestones: 4.20.2, 4.23 Sep 8, 2025
@DaanHoogland DaanHoogland changed the base branch from 4.20 to main December 12, 2025 10:43
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✖️ debian ✔️ suse15. SL-JID 16023

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-14977)

@phsm
Copy link
Contributor Author

phsm commented Dec 15, 2025

[SF] Trillian Build Failed (tid-14977)

I see the test had failed here. Is there a way for me to see the details?

@DaanHoogland
Copy link
Contributor

[SF] Trillian Build Failed (tid-14977)

I see the test had failed here. Is there a way for me to see the details?

this is not a problem with the tests but a capacity problem in the backend lab. I restarted the test job.

@blueorangutan
Copy link

[SF] Trillian test result (tid-14995)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 55931 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10594-t14995-kvm-ol8.zip
Smoke tests completed. 150 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Contributor

@wido @loth @kriegsmanj , can any of you test/verify this PR, or ask any of your colleagues please?

@phsm
Copy link
Contributor Author

phsm commented Dec 16, 2025

@wido @loth @kriegsmanj , can any of you test/verify this PR, or ask any of your colleagues please?

The last 2 persons you mentioned are colleagues of mine. We already use this script in production for at least the last 6 months.

The additional testing from @wido would be appreciated indeed.

@phsm
Copy link
Contributor Author

phsm commented Dec 16, 2025

For those who're wiling to test:

This script can be placed on one of the nodes in /usr/share/cloudstack-common/scripts/vm/network/security_group.py, overwriting the existing one.

Key points to test:

  1. The securitygroup-enabled VM shall not be able to spoof the IPv4 or IPv6 addresses.
    Try assigning some other address to the VM from the same subnet (not the one that Cloudstack given to it), and try to send some traffic from it / to it.

  2. The conntrack table shall not show entries for the traffic for this VM:
    Ensure you have an Ingress security group rule that allows "ALL from 0.0.0.0/0", and "ALL from ::/0" (Ipv6 is handled separately).
    For example, install Nginx on the test VM, and run "apache benchmark" against it. It will produce a lot of connections.

During the test, running conntrack -L on the host machine shall now show tons and tons of connections to the VM IP.

  1. The conntrack is used when its needed
    Then, you may change the security group to allow only TCP port 80 from 0.0.0.0/0, ::/0, and remove "allow all" rule.
    In this configuration, the conntrack is required, so the conntrack -L shall show these connections.

The outgoing connections from the VM, e.g. "apt update" shall still work

if type(i) is str:
ips += [ip for ip in i.split(';') if ip != '' and ip != '0']
else:
ips.append(str(i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just me, but I would try to check if it's a valid IPv4/IPv6 here prior to adding it to the list. Not check the list afterwards, but that's just me.

If it's valid, add the IP object to the list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rechecked this function.

It has 2 for loops:

The first does the initial parse and splits some of the elements. The result is stored in the intermediate ips list that is not returned back. This is the forloop you're referring to.

The second for loop does the precise sort into lists ip4s and ip6s using python ipaddress library. And those are the lists returned by the function.

Of course I could redo it into a single forloop, but back then I decided to have it more verbose for the readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just me how I would code it. I wouldn't put a str() in the list, I would already cast it to an IP object, but that's just me.

You can leave it as-is.

@wido
Copy link
Contributor

wido commented Dec 17, 2025

This is really great work and much needed for many situations! Now, I would love to test this, but I am currently lacking a testing environment on which I can do so easily. This will take me a couple of weeks before I get a new environment delivered on which I can test.

Looking at the code I don't see any major problems to be honest, seems like well thought of and tested.

@DaanHoogland
Copy link
Contributor

@wido , that sounds like an LGTM from you and a “let’s trust the submitter’s tests” ..? cc @weizhouapache

should we merge as is?

@phsm
Copy link
Contributor Author

phsm commented Dec 17, 2025

I certainly don't mind waiting for couple more weeks before Wido is able to test it. So, unless there's a rush, we could wait more..

@wido
Copy link
Contributor

wido commented Dec 17, 2025

@DaanHoogland I see no objections in merging it. As it looks like it right now I will not have the new test environment before end of Jan 2026 (hardware deliveries take a long time....) and that would be long.

I am OK with merging this based on the code. I know this script really well and these changes are sane.

@DaanHoogland
Copy link
Contributor

Whenever you feel like @wido . this will be in main and no new release will be out till half of next year (alternatively it can be rebased on 4.20 or 4.22)

@wido
Copy link
Contributor

wido commented Dec 18, 2025

Whenever you feel like @wido . this will be in main and no new release will be out till half of next year (alternatively it can be rebased on 4.20 or 4.22)

I am ok with merging it. Code seems good and IF (IF!) there are issues they will be caught during other tests as well

@DaanHoogland
Copy link
Contributor

Whenever you feel like @wido . this will be in main and no new release will be out till half of next year (alternatively it can be rebased on 4.20 or 4.22)

I am ok with merging it. Code seems good and IF (IF!) there are issues they will be caught during other tests as well

;) that was an invitation to merge. Any objections @weizhouapache ?

@wido
Copy link
Contributor

wido commented Dec 18, 2025

If @weizhouapache agrees this can be merged

@weizhouapache
Copy link
Member

@DaanHoogland @wido
looks good to me

@wido wido merged commit bb5da0e into apache:main Dec 18, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants