-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in stunnel port selection #129
Conversation
I'll fix the failing tests (so far I keep running into pytest-dev/pytest#8539 due to python-3.10)... |
1bbe91b
to
bad09ff
Compare
bad09ff
to
69d41d6
Compare
@lshigupt you appear to be the primary maintainer. This PR would be very helpful in addressing issues in the kubernetes-sigs/aws-efs-csi-driver (kubernetes-sigs/aws-efs-csi-driver#695) |
I actually don't think the CSI driver issue kubernetes-sigs/aws-efs-csi-driver#695 can be fixed from within efs-utils: this would however help with other problems in the CSI driver (like deleting multiple unused volumes). |
Thanks a lot @tsmetana for the PR, I am doing the testing and I could see that some of the Tests are failing on our End. I am trying to debug them and Will post the comments where it is failing. These are the Tests which are Failing:
|
Hello. I've tried to run the tests with python-3.7.13 and I'm not able to reproduce the failure. Do you have some more detailed logs? |
@lshigupt Hi. I wouldn't really want the issue and PR to rot completely... Is there anything I can do to help getting it moving forward? |
Hi @tsmetana, sorry for the delay. We've picked this back up and will look into it. |
@RyanStan any news? I mean... The patch is not that big and (I hope) quite understandable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. We will do a quick release for this fix soon tomorrow morning so we will put this commit along with other urgent fix we have.
The fix is not permanent, since it is not guaranteed there is no race condition between the socket is closed and the stunnel is launched, though the time interval is pretty tight. We will work on a long term fix for this.
src/mount_efs/__init__.py
Outdated
@@ -944,13 +944,13 @@ def choose_tls_port(config, options): | |||
assert len(tls_ports) == len(ports_to_try) | |||
|
|||
if "netns" not in options: | |||
tls_port = find_tls_port_in_range(ports_to_try) | |||
sock = find_tls_port_in_range(state_file_dir, fs_id, mountpoint, ports_to_try) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: tls_port_sock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
mount_filename = get_mount_specific_filename(fs_id, mountpoint, tls_port) | ||
config_file = get_stunnel_config_filename(state_file_dir, mount_filename) | ||
if os.access(config_file, os.R_OK): | ||
logging.info("confifguration for port %s already exists, trying another port", tls_port) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Since if the port is already used the binding will fail anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's is not necessary but helps in the case the port biniding fails to distinguish whether we're not clashing with other processes. Makes debugging easier.
src/mount_efs/__init__.py
Outdated
@@ -1430,7 +1436,8 @@ def bootstrap_tls( | |||
state_file_dir=STATE_FILE_DIR, | |||
fallback_ip_address=None, | |||
): | |||
tls_port = choose_tls_port(config, options) | |||
sock = choose_tls_port(state_file_dir, fs_id, mountpoint, config, options) | |||
tls_port = sock.getsockname()[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use a function for this socket.getsockname()
such that in unit test you can directly use the func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're always interested in both the socket and the port, I changed the function to return a tuple instead.
src/mount_efs/__init__.py
Outdated
# close the socket now, so the stunnel process can bind to the port | ||
sock.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add the close to a finally statement, so if any steps failed in between you create the socket and create the socket, the socket will eventually be closed cleanly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
69d41d6
to
0d20d72
Compare
0d20d72
to
612d060
Compare
Could you elaborate on this, please? The important part is the configuration file here: it has to be written first before the socket is closed and the stunnel process is launched. No other efs-mount can choose the same port since it would either find the existing configuration file (that serves also as a lock for the chosen port essentially) or fail to bind the same port if other efs-mount is trying to create the new configuration (i.e. the file is not written yet, but other efs-mount has chosen the same port already). This is why there's the check for the config file existence there and why I wanted to explicitly log if another efs-mount is trying to use the same port based on that. |
I think that only applies to the same file system right? Since the configuration file is checked based on the fs and mountpoint and tlsport, but another file system may not be applied here. Anyway I will merge this change, and push another commit and bump the release version. We can continue our discussion here in the thread, I will add more detail to it. Thanks for the PR! |
True. So adding a separate lock file with just the port should be sufficient. Or remove the fs and mount point from the config file name if it's not used for anything else (would have to check), or at worse parse the filename and check just the port part... |
@Cappuccinuo I think by removing the config file existence check in the latest patch you actually introduced the race you described even for single fsid/mountpoint case: now we rely on the (uncertain) fact that stunnel binds the socket before another efs-mount tries is out... If we checked the config file, the time between closing the socket and stunnel binding the port would not matter, the port wouldn't be even probed. |
Issue #, if available:
Issue #125
Description of changes:
To prevent the stunnel port selection race between parallel mount.efs processes keep the probed port bound until the stunnel configuration file gets written and add a check for the config file existence prior trying to bind the stunnel port to check for its availability.