Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support exposing UDP ports when using Network Load Balancer #4767

Closed
dannyrandall opened this issue Apr 17, 2023 · 15 comments
Closed

Support exposing UDP ports when using Network Load Balancer #4767

dannyrandall opened this issue Apr 17, 2023 · 15 comments
Labels
area/svc Issues about services. size/S We should be able to deliver roughly 2 small issues in a sprint. type/enhancement Issues that are improvements for existing features.

Comments

@dannyrandall
Copy link
Contributor

From #1783 (comment).

Copilot currently supports exposing TCP ports through an NLB, and it would be nice to support UDP ports. Additionally, it would be nice to specify a range of UDP ports to expose through the NLB instead of listing a range one-by-one.

@dannyrandall dannyrandall added type/enhancement Issues that are improvements for existing features. size/S We should be able to deliver roughly 2 small issues in a sprint. labels Apr 17, 2023
@h5aaimtron
Copy link

My use cases is specific to a voice chat application (WebRTC) that requires a range of UDP ports to be opened to facilitate the SFU (Selective Forwarding Unit) running. I could see this being true beyond my specific use-case as other chat systems, games, etc. could utilize port ranges.

@tjhorner
Copy link
Contributor

tjhorner commented Jun 1, 2023

I am also interested in this functionality, so I went ahead and added udp to the list of valid NLB protocols in this file and passed port: 80/udp to nlb in my service manifest.

The deployment appears to have worked:

image

However, I haven't written any tests to determine if there are side effects I should be aware of (i.e., health check fail?). But it does work in this limited scenario.

@dannyrandall
Copy link
Contributor Author

@tjhorner thanks for that!☺️ I think you're pretty much right that the health check is the only thing missing - We'd love a PR if you're interested! I think the only other things you'll have to update are:

@tjhorner
Copy link
Contributor

tjhorner commented Jun 2, 2023

Thank you for the pointers! I'm not super familiar with this codebase so they are very helpful. I'll take a crack at a PR for this sometime next week, would love to help get this out the door.

@dannyrandall
Copy link
Contributor Author

Awesome! Feel free to ask any questions here😊.

@dannyrandall dannyrandall added the area/svc Issues about services. label Jun 5, 2023
@dannyrandall
Copy link
Contributor Author

@tjhorner I'm so sorry!😅 I just realized I didn't think enough about UDP health checks - NLB doesn't support UDP health checks, which makes total sense haha. So I think all of the changes I mentioned to support UDP health checks are totally unnecessary!

@tjhorner
Copy link
Contributor

I made some progress on this today; it turns out most everything works out of the box with the one change I made, except that I also needed to update the exposed port on the ECS task to listen on the correct protocol. It was hardcoded to tcp previously:

return []ExposedPort{
{
Port: targetPort,
Protocol: "tcp",
ContainerName: targetContainer,
},
}, nil

I'll do some more manual testing then try to write some automated tests then open the PR!

@tjhorner
Copy link
Contributor

tjhorner commented Jun 11, 2023

It also looks like some additional security group rules will be required in order for UDP traffic to reach the task containers since NLB UDP listeners preserve the source IP (this was a pleasant surprise!), so the existing NLB->Task ingress rules don't work as expected.

How would you suggest this be done? Correct me if I'm wrong, but I don't see any existing parameters to take inspiration/best practice from. I think providing a CIDR block of allowed source IPs alongside the port definition would work well. Something like this:

nlb:
  port: 8080/udp
  allow_from: 0.0.0.0/0

It can alternatively be provided via a YAML patch or addon, but this doesn't seem ideal.

Since the NLB listens on the public Internet I think 0.0.0.0/0 would actually be a reasonable default here. It's safe enough since the task containers don't have a public IP, so they can't be accessed directly anyway, only through the NLB. (I just double-checked and "Auto-assign public IP" actually enabled with no way to turn it off, oops. So not sure what to do in this case 🤔)

For completeness, here are the YAML patches I'm using to add Internet access to the security group:

- op: add
  path: /Resources/NLBSecurityGroup/Properties/SecurityGroupIngress/-
  value:
    CidrIp: 0.0.0.0/0
    Description: Ingress to allow all RADIUS UDP ports from Internet
    FromPort: 1812
    ToPort: 1812
    IpProtocol: UDP

Previously I was also disabling AssignPublicIp with a YAML patch but since I don't have NAT Gateway enabled in my configuration (I am importing an existing VPC/subnets) it failed to retrieve the image etc.

@Lou1415926
Copy link
Contributor

Lou1415926 commented Jun 15, 2023

Related: #3281, #2918 (comment)

@Lou1415926
Copy link
Contributor

Hey @tjhorner 👋🏼 !! Sorry for jumping in late! I've seen your PR - thanks a bunch for the awesome contribution. However before we review your PR, here are some information that I'd like to gather from you. Then, we can work towards a design for this feature, and finally finish up the PR. Does that sound good to you 👀 ?

We removed UDP from our NLB release mostly because of the issue that you've found - source IP preservation + auto assigned public IPs which potentially exposes your tasks' public IPs.

This prompted us the unanswered question of how users like to use UDP - do you need UDP for your public-facing service (Load-Balanced Web Service) or for your internal service (Backend Service)?

From what you've said here, it looks like your UDP is for public usage. Can I have a high-level understanding of what that service do?

In addition, is UDP the only type of traffic that your service is listening to?

@tjhorner
Copy link
Contributor

Sure, sounds good.

The service is public-facing. We run a RADIUS authentication server for our B2B customers which listens on UDP port 1812, but I could also see this feature being useful for game servers, DNS, QUIC, etc. We also run on a few other UDP ports depending on environment (also for RADIUS), and the only TCP port is for the healthcheck, which isn't exposed publicly or via the load balancer.

@Lou1415926
Copy link
Contributor

Hey @tjhorner ! Sorry for getting back to you so late : ( \

I think your existing PRs already solves a part of the problems, and we can just use it as is . I have no intention of dumping all of the below ⬇️ designs for you to implement (but if you are interested, it will be 100% super appreciated ❤️!).

I am posting the proposals here to see if it sounds good to you as the user (and potentially other users that happen to stumble on this). Let me know if the design makes sense!

If this sounds good, then I can work on an implementation plan than splits the project into smaller tasks (again, at least one small task is already solved by your PR, so feel free to say "I'm done, please take it from here")

Background

As you have noticed, if your service has network.placement: 'public', then it will have AssignPublicIps enabled. This gives your tasks public IPs, enabling it to pull images from ECS, put logs into CloudWatch, etc.

This was not an issue previously, because the security group in front of the service blocks all public ingress. That is, as a random client, even if I know the tasks' public IP addresses, I won't be able connect directly to it via. the public IP.

However, for UDP users, because of "source ip preservation", the security group has to allow-list a bunch of source IPs directly. In this case, clients from these source IPs will be able to connect to the task directly via. the public IPs. This becomes a larger problem if the service is a public-facing UDP server - it has to allow list 0.0.0.0/0, which exposes the public IPs to the whole world.

Today, for services to not have public IPs, user can either:

  1. Specify network.placement: 'private'. This will put the tasks into private subnets, uncheck the AssignPublicIps option, and create a NAT Gateway so that the tasks can make egress traffic to the internet.
  2. Use YAML patch to uncheck the AssignPublicIps option, but this will prevent the task from pulling images (either from ECR or other registry) or putting logs into CloudWatch. In this case, users need VPC endpoints.

In summary, if the protocol is UDP and allowed_source_ips are configured, on the infrastructure level there are four types of configurations:

AssignPublicIPs Additional Resource Exposed via public IPs? Can pull images? Internet egress ? Note
1 ✔️ none 🔴 yes 🟢 🟢 cheap
2 ✖️ none 🟢 no 🔴 no 🔴 no cannot deploy
3 ✖️ NAT Gateway 🟢 no 🟢 🟢 extra costs; supported already via. network.placement: 'private'
4 ✖️ VPC endpoints 🟢 no 🟡 ECR images only 🔴 no cheaper than NAT

Proposals

[P.1] Specify UDP protocol as /udp

nlb:
  port: 5160/udp

I think your PR #4980 already does this!

[P.2] Specify allowed_source_ips in manifest

nlb:
  port: 5160/udp
  allowed_source_ips: [217.243.37.126/32, 217.333.37.126/32]

This will create an NLB that listens on port 5160 for UDP traffic. Traffic from the allowed source IPs will cut through the NLB and reach the fargate service.
This is similar to http.allowed_source_ips. Besides, this field will work for tcp and tls as well.

[P.2.1] allowed_source_ips is by default 0.0.0.0/0

If the field is empty or not specified, then we assume that the service is public facing and open to the world.

[P.3] Error out if network.placement is empty

By default, network.placement will be 'public' if not specified. With allowed_source_ips being 0.0.0.0/0, this will become "configuration 1" in the table, which exposes tasks' public IPs to the allow-listed source IPs.

This proposal states that when we see nlb.port: <port>/udp, and network.placement is empty, we will:

  1. Error out.
  2. Explain why this is insecure.
  3. Recommend user to either:
    1. Specify network.placement: 'private' if their service needs to connect to the internet, or their image is from non-ECR registry. This is "configuration 3" from the table.
    2. Or, explicitly specify network.placement: 'public':
      1. Ask users to create VPC endpoints as environment addons. For ECS Fargate, they need ECR endpoint and S3 gateway endpoint, and potentially CW - doc. This is the CFN template for the four VPC endpoints.
      2. Then deploy the service with network.placement: 'public'.

[P.4.1] Turn off AssignPublicIP seeing network.placement: 'public'

This proposal states that when we see nlb.port: <port>/udp, and network.placement is 'public' , we will turn of AssignPublicIP so that the tasks do not have public IP addresses. Without VPC endpoints, this is "configuration 2" from the table; if VPC endpoints are already created in the environment addons, then this is "configuration 4" from the table.

By this proposal, Copilot is "secure by default". However, the users can still force assigning public IPs by using YAML patch.

@tjhorner
Copy link
Contributor

As a user, these proposals seem great and pretty ergonomic. I appreciate the care taken to make sure everything is following best practices and is secure by default.

Yes, looks like my PR covers [P.1].

My main concern with these proposals (which turned out to be a non-issue) was that the traffic sent from the allowed IP addresses to the NLB would not reach the underlying tasks since they do not have public IP addresses (how the source IP preservation works is kind of opaque, so it wasn't obvious if this is the case or not). But I just did some testing in an environment with a Copilot-generated VPC and NAT gateway, and placed the service in private subnets with network.vpc.placement: "private" (along with the additional security group rules, etc.)

When trying to reach the service via the NLB it appears to work just fine, despite my concerns. If I'm understanding correctly, this is configuration 3 in the table you provided.

From a security standpoint, this seems like the ideal and most secure configuration for my use case, though the extra costs associated with additional Elastic IPs, NAT Gateway, VPCs, etc. on each environment is a little concerning. However, there are ways to reduce costs such as by sharing VPCs between environments so it's not too big of a deal.

As much as I would love to help out, my plate is pretty full, so I will hand it off to y'all to continue it.

How will this feature be introduced? Will all of the proposals be implemented before the feature is released, or should I expect them to be released as each is developed? Basically, I'm wondering if my PR will be merged and included in a release before the rest of the proposals are implemented. (At work, we are using my forked version of Copilot in addition to some YAML patches and addons to achieve the deployment configuration we want, so it would be nice to be able to go back to the official Copilot release sooner than later haha)

@Lou1415926
Copy link
Contributor

@tjhorner Yay thanks for the feedback and thank you for testing them out!

When trying to reach the service via the NLB it appears to work just fine, despite my concerns. If I'm understanding correctly, this is configuration 3 in the table you provided.

Yes this is configuration 3. NLB routes traffic based on tasks' private IPs within network. If you are interested, you can take a look at the NLB's target group, and you will see that the targets are the tasks private IP addresses.

How will this feature be introduced? Will all of the proposals be implemented before the feature is released, or should I expect them to be released as each is developed?

I'd like to deliver a complete feature, so I'd say we will officially announce it after all proposals are implemented. If we are missing any of the proposal, users might face a lot of caveats (security risks, a lot of churning using YAML patch...) that they will have a bad time with 😂.

However! I don't see why we can't merge your PR now. The code can be shipped along with our next release (and you can switch back to official Copilot release ASAP) whether we've completed the rest of the proposals or not. At this point, Copilot is not supporting UDP yet, but your code is already lying in the codebase. We will wait until the features are completed to announce and blog about it.

mergify bot pushed a commit that referenced this issue Jul 14, 2023
This PR adds basic UDP support for Network Load Balancers. For example, you can now do this in a `Load Balanced Web Service`:

```yaml
nlb:
  port: 1812/udp
  # Run the healthcheck against a different TCP port
  healthcheck:
    port: 8080
```

Additional listeners are also supported:

```yaml
nlb:
  # [...]
  additional_listeners:
    - port: 1912/udp
    # healthcheck omitted
```

I also updated the English documentation to reflect this change, but not the Japanese documentation.

This addresses issue #4767.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
mergify bot pushed a commit that referenced this issue Sep 18, 2023
#### Creates a security group for Network Load Balancers!

Changes are for 3 distinct purposes:
1. Redeploys security group and associated resources to now accept public traffic to NLB with specified ports/protocol when using UDP protocol
    - Moves security group from network configuration to NLB
    - Changes name of resources which need to be redeployed
4. Changes wkld custom domain lambda to no longer assume that the hosted zone and DNS associated with an alias never change
5. Marks for removal the API call to retrieve CIDR blocks for NLB security group, as the NLB is now public facing, and should accept all public traffic on the correct ports

The next step to completing this feature is to make changes to the manifest package to correctly configure and validate the usage of UDP for the NLB.


First step in #4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
mergify bot pushed a commit that referenced this issue Sep 18, 2023
Refactors the validation for exposed ports to make code more readable, preserve functionality, and include protocol validation.


Step in #4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
mergify bot pushed a commit that referenced this issue Sep 26, 2023
Fixes `copilot svc show` so that while the `NLBTargetGroup` logical ID is changed to `NetworkLoadBalancerTargetGroup`, both are acceptible IDs to enable displaying the NLB URI


Related to #4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
mergify bot pushed a commit that referenced this issue Sep 28, 2023
Refactors how `exposedPorts` retrieves list of exposed ports for template `PortMappings` and refactors how we define the service connect target port, making it more clear what fields it's exposed from in the manifest. This also fixes an issue for NLB enhancements which prevents service connect from targeting a port expecting UDP traffic.


Finishes CLI functionality for #4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
mergify bot pushed a commit that referenced this issue Oct 4, 2023
Validates that Load Balancer HealthChecks don't use invalid protocol for load balancer listeners.


For #4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
mergify bot pushed a commit that referenced this issue Oct 4, 2023
Adds a test case to an existing e2e test which adds a UDP network load balancer and tests for success with UDP traffic.
Should be merged after #5316 


e2e test for #4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
mergify bot pushed a commit that referenced this issue Oct 5, 2023
Includes docs for NLB enhancements coming with v1.31


Blogpost docs for #4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
@Lou1415926
Copy link
Contributor

Hi all! The UDP support is now released in AWS Copilot v1.31.0! For more, see the release note and the blog post 🎉🚀 🚀

KollaAdithya pushed a commit to KollaAdithya/copilot-cli that referenced this issue Oct 18, 2023
#### Creates a security group for Network Load Balancers!

Changes are for 3 distinct purposes:
1. Redeploys security group and associated resources to now accept public traffic to NLB with specified ports/protocol when using UDP protocol
    - Moves security group from network configuration to NLB
    - Changes name of resources which need to be redeployed
4. Changes wkld custom domain lambda to no longer assume that the hosted zone and DNS associated with an alias never change
5. Marks for removal the API call to retrieve CIDR blocks for NLB security group, as the NLB is now public facing, and should accept all public traffic on the correct ports

The next step to completing this feature is to make changes to the manifest package to correctly configure and validate the usage of UDP for the NLB.


First step in aws#4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
KollaAdithya pushed a commit to KollaAdithya/copilot-cli that referenced this issue Oct 18, 2023
Refactors the validation for exposed ports to make code more readable, preserve functionality, and include protocol validation.


Step in aws#4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
KollaAdithya pushed a commit to KollaAdithya/copilot-cli that referenced this issue Oct 18, 2023
Fixes `copilot svc show` so that while the `NLBTargetGroup` logical ID is changed to `NetworkLoadBalancerTargetGroup`, both are acceptible IDs to enable displaying the NLB URI


Related to aws#4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
KollaAdithya pushed a commit to KollaAdithya/copilot-cli that referenced this issue Oct 18, 2023
Refactors how `exposedPorts` retrieves list of exposed ports for template `PortMappings` and refactors how we define the service connect target port, making it more clear what fields it's exposed from in the manifest. This also fixes an issue for NLB enhancements which prevents service connect from targeting a port expecting UDP traffic.


Finishes CLI functionality for aws#4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
KollaAdithya pushed a commit to KollaAdithya/copilot-cli that referenced this issue Oct 18, 2023
Validates that Load Balancer HealthChecks don't use invalid protocol for load balancer listeners.


For aws#4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
KollaAdithya pushed a commit to KollaAdithya/copilot-cli that referenced this issue Oct 18, 2023
Adds a test case to an existing e2e test which adds a UDP network load balancer and tests for success with UDP traffic.
Should be merged after aws#5316 


e2e test for aws#4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
KollaAdithya pushed a commit to KollaAdithya/copilot-cli that referenced this issue Oct 18, 2023
Includes docs for NLB enhancements coming with v1.31


Blogpost docs for aws#4767 

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the Apache 2.0 License.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/svc Issues about services. size/S We should be able to deliver roughly 2 small issues in a sprint. type/enhancement Issues that are improvements for existing features.
Projects
None yet
Development

No branches or pull requests

4 participants