-
Notifications
You must be signed in to change notification settings - Fork 260
Proxies
StormCrawler's proxy system is built on top of the
SCProxy class and the ProxyManager
interface. Every proxy used in the system is formatted as a SCProxy. The ProxyManager implementations handle the management and delegation of their internal proxies. At the call of
HTTPProtocol.getProtocolOutput()
the
ProxyManager.getProxy()
is called to retrieve a proxy for the individual request. The ProxyManager interface can be implemented in a custom class to create custom logic for proxy management and load balancing.
The default ProxyManager implementation is SingleProxyManager. This ensures backwards compatibility for prior StormCrawler releases. To use MultiProxyManager or custom implementations pass the class path and name via the config parameter http.proxy.manager
http.proxy.manager: "com.digitalpebble.stormcrawler.proxy.MultiProxyManager"
StormCrawler implements two ProxyManager classes by default:
Manages a single proxy passed by the backwards compatible proxy fields in the configuration
http.proxy.host http.proxy.port http.proxy.type http.proxy.user (optional) http.proxy.pass (optional)
Manages multiple proxies passed through a TXT file. The file should contain connection strings for all proxies including the protocol and authentication (if needed). The file support comment lines (
//
or#
) and empty lines. The file path should be passed via the config at the below field. The TXT file must be available to all nodes participating in the topology.http.proxy.file
The MultiProxyManager load balances across proxies using one of the following schemes. The load balancing scheme can be passed via the config using http.proxy.rotation
, the default value is ROUND_ROBIN
.
- ROUND_ROBIN
Evenly distributes load across all proxies
- RANDOM
Randomly selects proxies using the native Java random number generator. RNG is seeded with the nanos at instantiation
- LEAST_USED
Selects the proxy with the least amount of usage. This is performed lazily for speed and therefore will not account for changes to usages during the selection process. If no custom implementations are made this should theoretically operate the same as ROUND_ROBIN
The SCProxy class contains all of the information associated with proxy connection. In addition, it tracks the total usage of the proxy and optionally tracks the location of the proxy IP. Usage information is used for the LEAST_USED load balancing scheme. The location information is currently unused but left to enable custom implementation the ability to select proxies by location.
- Start
- Components
- Filters
- Bolts
- Protocol
- Metadata
- Resources