-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The grand Lua pull request #183
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
At this point, all the test/server/* tests should pass most of the time. There still seems to be some sporadic testing issues that can cause some failures that we still need to track down, but things are basically implemented.
Related to the changes here: NREL/api-umbrella-gatekeeper#15
Instead of `*example.com`, we'll now use `.example.com` which will match both `example.com` and `*.example.com`. This ensures we get along with nginx server name parsing.
Instead of `*example.com`, we'll now use `.example.com` which will match both `example.com` and `*.example.com`. This ensures we get along with nginx server name parsing.
This also uncovered a problem with wildcards and our hostname normalization, which largely defeated the wildcard handling. Instead, let's only apply the host normalization to the default host, since that was the only place we really wanted it.
This reduces the need to publish multiple API Backends when only reordering a single API backend for matching purposes. This becomes more important when there are admin groups and scopes in use, since it prevents one admin group from affecting another group's APIs when performing a reorder. Previously, we ordered each API backend with a numeric counter like 1, 2, 3, 4, 5, etc. So if you wanted to move the last backend to the front, you actually ended up having to change the order for all the backends. The new approach is similar in that it uses a counter, but we provide a significant gap between each item (10,000). So instead, API backends added would by default have a sort order of 0, 10000, 20000, 30000, 40000, etc. If you wanted to shift the last backend to the front, it would get a new sort order of -10000, while leaving the rest of the backends untouched. More important than simply going negative, if you wanted to move the last backend to the second position, it would leave the rest of the items untouched by choosing a sort order in between the existing values (so 5,000 in this example). Subsequent moves in between the first two items would continue dividing the available integer range in half (2,500, 1,250, etc.) With this approach of leaving gaps and then filling them in to fulfill moves, there is a theoretical point where the surrounding items will need to be shifted. Similarly, the sort order uses 32bit signed integers, so there's also a theoretical maximum upper bound and lower bound if enough entries get added with the gaps in place. Both situations should be handled in this implementation by shifting surrounding records to make room as need. In those cases, other records may still be changed when sorting, so this isn't a 100% perfect solution, but I think both situations should be exceedingly rare given the amount of data typically present and number of reorders that typically happen.
Implement different approach to API backend reordering to reduce reorder churn
- Roles from sub-URL settings now are appended by default to any original roles, rather than overwriting them. This seems more intuitive and less dangerous, since you're less likely to accidentally lift all role restrictions by simply adding a sub-url setting. - There's a new option to explicitly have sub-URL roles override the parent roles, but by default, they are appended. - When multiple roles are defined on the API, now the API key must have all of the roles in order to access the API, rather than any single role. I don't think the previous approach of only requiring a single role was very intuitive, and I think this approach makes more sense particularly with sub-URL roles appending to the base roles by default. - Fix crash possible if API backend was setup with role requirements but api keys not required. While this combination of settings doesn't make sense, it may have been encountered with an API backend that required a role, but then a sub-URL setting trying to disable key requirements (for CORS preflight requests as an example). - Remove special handling of "admin" role that previously granted access to all APIs. This wasn't really documented anywhere, and I don't think we have a strong use-case for this kind of global admin role, so it seems safest to remove this so a user isn't accidentally granted global access.
The override roles checkbox is part of some changes to how roles get applied and inherited made in: NREL/api-umbrella-gatekeeper@94670b0 This turned into a larger change than just adding the single override checkbox for roles, since we only want this checkbox to show up in the context of the sub-url settings, and not the other places that those settings fields are shared (api global settings). So we can more easily control pieces of the settings fields, it's been turned into an Ember component, instead of a render call, which gives us the ability to pass extra variables around. This in turn led to a few other things: - Turn the selectize input from a sort of strange view wrapper into an easyform input type. This should be more straightforward, and since Views will be removed in Ember 2, it seems like a good move to go ahead and make. - Tweaks to the selectize and ace editor inputs to improve usability and testability. Now the associated labels for these input types are correct, so clicking on the labels lands you in the correct input field, rather than still being tied to the underlying, hidden field. This and some related changes help with testability of these input types. - More shifting of text in the api and settings forms to i18n while we're at it. - A much more complete full integration test for the whole API backend and API user sections. Since we needed to shift the shared settings from a view/controller to a component, we needed a much more thorough integration suite to ensure this didn't break any piece unexpectedly.
Add option to override roles in sub-settings (and some refactoring)
Fixes and changes to how roles get applied
Related to it being possible to enter invalid error data (but valid YAML) via the web UI: #153 In these cases, we will now return the default internal server error, which at least prevents bad error responses without an HTTP status code at all, which could really throw some clients for a tizzy.
This ensures that the error data isn't an unexpected type at the root like a string: #153
Ensure that the error data yaml entered is the expected type (hash)
Better error handling for if error data is unexpectedly not an object.
Rails security updates: http://weblog.rubyonrails.org/2015/6/16/Rails-3-2-22-4-1-11-and-4-2-2-have-been-released-and-more/ Related rack update: rack/rack#894 Moped security updates: mongoid/moped#377
I'm favoring the ideas of simplifying our stack down to Lua more and more, so let's go ahead and switch the CLI app out for Lua.
We have a pull-request in, so hopefully this will be integrated into the releases at some point. In the meantime, we need it for the CLI app to dump YAML properly.
We're also giving our app an explicit namespace in lua for better separation.
Also try to improve the build process a bit, so that if things are already installed, we don't have to perform any actions.
Shift the /etc/resolv.conf and /etc/hosts parsing to the CLI, since this only needs be done once per reload, and then we the values stored in our runtime config.
This has been replaced by the web backends concept.
Implement the "services" logic so that only certain processes startup depending on the broader service groups defined for api umbrella.
- Track down a couple of processes that weren't in our default port range for api umbrella services (~14000). - Bind all the internal-only services to 127.0.0.1 instead of the public network interface. - Make sure all of our test processes run on different ports than the default ports (so tests can be run at the same time). - Remove a few bits of defunct config related to ports.
The CI environment is having some odd issues with the rake version under the static site suddenly. Bump some web dependencies while we're at it.
Running into rack bundler issues in newer version. I think perhaps the the various rack changes made in the newer versions of Puma aren't compatible with this older Rails 3.2 version of rack.
6 tasks
Right now, we'll use the standard CentOS 6 output, but we'll tailor this later for other distros.
Whoa. |
This was referenced Nov 28, 2015
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For background, at the beginning of the year, we began experimenting with rewriting some of API Umbrella's core in Lua & OpenResty. That toying around has proved worthwhile, so we're moving forward with this plan. See this issue for some of the original thinking and more details: #86
The short version is that these updates should be backwards compatible (there are a few small caveats noted below), the platform is faster, and this cleans up a few pain points.
There are still a few loose ends, but it's very close, so I wanted to start a pull request where we can start to document the changes.
Code Organization
This consolidates most of our separate repositories into this single git repo. Currently we have:
Under the new repository structure we will have:
The gatekeeper, router, and web projects have been merged into the api-umbrella repo. api-umbrella-config and omnibus-api-umbrella have been rendered defunct.
The reasoning behind this shift is for a couple reasons:
To be clear, we're not aiming towards a monolithic codebase with these changes, but we want to make the codebase easier to work on, which hopefully these code organization changes will do. As a quick example of some of the pain points with the previous setup, see this recent feature. It required coordinating 3 pull requests, with some duplicate config across projects, and the router changes having failing tests until the gatekeeper changes were merged and released.
Architecture
As outlined in #86, the proposed architecture was essentially:
That didn't quite pan out like that once we got to adding the caching layer. The plan had been to use nginx's built-in caching layer, but the major pitfall with that is that nginx's caching does not work when response streaming is enabled. Since response streaming is important to us, we can't use this approach for now (but if that issue gets resolved with nginx, we can revisit).
What we ended up landing on for the architecture looks like:
So while we still have the nginx sandwich we were trying to simplify, things are still dramatically better, since we remove the multiple nodejs processes, and the nginx instances contain all the relevant configuration in the same instance (rather than trying to have to sync config changes between the independent nodejs processes and the nginx backend router).
The switch to TrafficServer for caching was due to it performing a fair bit better than Varnish in this sandwich situation, and allowing us to still achieve sub-millisecond overhead (to be clear, both Varnish and TrafficServer perform very well as caches, but TrafficServer seems to have a lower proxy overhead on a per request basis). We've also been eying TrafficServer for a while for other reasons. The primary reasons (streaming) were obviated by the Varnish 4 upgrade, but there are still some nice things that TrafficServer addresses (more flexible configuration for keep-alive settings, a hybrid disk and memory cache, so we can cache larger things on disk, support for HTTP/2, and TrafficServer is pursuing a Lua approach for plugins, which nicely aligns with OpenResty).
I experimented with Ledge a Lua-based http cache that can be embedded, but the tests didn't pan out too well (there were some proxy behavior issues that didn't match our expectations for a transparent proxy, and the performance wasn't as good). Those initial experiments were much earlier this year, so it's possible things have changed since then (Ledge is under pretty active development), but for now, we'll stick with a more established proxy solution.
Backwards Compatibility
By and large, this upgrade should be backwards-compatible with previous API Umbrella releases, despite the significant underlying changes. In nearly all cases, the proxying functionality should remain unchanged, and the internal admin APIs we provide are also unchanged. However, there are a couple of small areas where we do have backwards-incompatible changes. They mostly affect some more esoteric features that I don't think were actively being used by anyone, but it's worth noting:
{{#if}}
block, the syntax has changed since we are now just using Mustache templates for parsing. Something like{{#if headers.x-missing}}{{headers.x-missing}}{{else}}default{{/if}}
in Handlebars would now become{{#headers.x-missing}}{{headers.x-missing}}{{/headers.x-missing}}{{^headers.x-missing}}default{{/headers.x-missing}}
.Analytics Logging
Gathering the log data for analytics was a particular pain point in the old stack. It had grown rather unwieldy over time with multiple queues to try to aggregate data from the multiple points in our proxy stack (eg, combining analytics from the nodejs processes and nginx). The overall approach is much, much simpler, now that everything is handled inside OpenResty. But another big improvement is that we have a much simpler approach to queuing this data using heka, which is much faster and solves potentially unbounded memory growth with disk buffering: 18F/api.data.gov#233
Performance
Basically, things are much faster and efficient. Our general goal based on prototypes had been to get the local time spent in the API Umbrella proxy down below 1 millisecond. I ended up stripping some of the initial over-eager optimizations I had done to simplify the initial code, so I think we might have crept up more in the 1-2ms range (depending on hardware). However, things are still definitely faster and more efficient than the old stack in any benchmark I've run. I also think there's quite a bit of room for further optimization if we make take a more focused look at that again (which should be easier to do now that we have all the functionality out of the way).
Resiliency
The stack should be much better at handling unexpected failures, particularly with the database going down. We now cache more things, like valid API keys locally, so even in the event of a complete database outage, API Umbrella's proxying should be able to continue to serve recent, existing users for a while (new users, or uncached users will fail, but this at least reduces the general impact of a small database outage).
We've also added a health check API endpoint that can be used to better determine the overall health of the API Umbrella platform. This could be used in load balancers to proactively remove servers that might be responding, but aren't actually able to service requests.
Process Management
One area that continued to cause pain in the old stack was how we were running the various processes that make up API Umbrella. API Umbrella is composed of multiple processes, but we want to offer an easy way to start and stop all of these services together. What we had before had grown a little unwieldy and could occasionally get in a funky state where things would not stop or start properly. All of that has been much simplified and improved.
We were previously using a combination of forever.js and supervisord, which was a bit of an odd mix, since they both do similar things (essentially, supervisor was being used for managing the processes, and forever.js was responsible for starting supervisord and responding to custom signals). We've switched to just using perp for all this. It's much lighter weight and more reliable since we replace the "api-umbrella" process with the perpboot process, and then trap specific signals we need to respond to. This helps cleanup our overly complex process tree and removes the need for us to run our own persistent process to respond to signals. It should lead to faster starts/stops, and also removes some overly-fancy logic we had that could lead to API Umbrella shutting down if a single component failed to start (which isn't ideal from a resiliency perspective).
Configuration Reloading
Despite improvements made, we still had some issues where sending a "reload" signal to API Umbrella wouldn't always reload specific config settings. Thanks to the process management simplifications made, this has been greatly simplified to better guarantee that a reload will always get the completely updated config.
DNS Changes & Keep-Alive Connections
To handle DNS changes with API backend servers, we currently end up reloading nginx. This works, but leads to hyper-frequent nginx reloads if you have a lot of API backends with changing IPs. These reloads should be zero-downtime, but I think this may lead to corner-case issues with keep-alive connections. I haven't proved this, but it's my current theory that this might be responsible for some ELB proxying issues we've potted.
In the new stack, all of the DNS changes are handled inside nginx without any reloads necessary. Overall, it's much simpler (and I'm hoping might explain/fix possible keep-alive issues). In addition, there's a path towards making the code even simpler, since a future version of OpenResty will natively support managing upstream server connections (right now we rely on an nginx plugin that works, but has a few rough edges).
Packaging Improvements
The binary packages for API Umbrella are much smaller in size since we've been able to simplify our dependencies. We've also made an effort to rely more on system dependencies wherever possible, rather than packaging all of our dependencies into our own binary. In addition to a smaller binary, this also means that when there are things like OpenSSL security updates, you only have to worry about updating the copy on your system, rather than also needing to upgrade API Umbrella.
We were previously using omnibus for building our packages (which is what led to a lot of the over-eager dependency inclusion), but now we're using a simpler standalone make process combined with fpm. Docker is being used to create packages for different distros. These changes should speed up the packaging process so it will be easier for us to release packages more frequently.
We'll also be looking to publish our binaries in real Yum and Apt repos so that dependencies can be automatically fetched (this is more of an issue for apt). This should also simplify upgrading API Umbrella.
Misc
There's probably more I'm forgetting, so I'll update this if I remember anything else. There's also more we need to outline on the roadmap related to this (shifting our APIs to Lua, unifying our testing frameworks), but this is the bulk of the big revamp we want to get out of the way.
If you've made it through this long-winded list, you'll probably see that the changes really extend beyond the switch to Lua. By taking a fresh look at things more broadly, it has allowed us to make improvements in other areas we've identified over the years. I hate to do such a massive code dump all at once in this pull request, but that's unfortunately how things turned out. However, with the major portions of the revamp out of the way, it should make things much easier to improve and iterate on moving forward. I don't take these kind of big changes (including a language switch) lightly, so hopefully we won't be doing anything like this again in the foreseeable future. But the benefits of the revamp have proved worthwhile, and we're excited about where this will let us go. To the future!