- Executive Summary
- Agenda: Issues to DIscuss
- Background
- Phase 1 proposal
- Slaves must be pre-configured
- Use of ATTRIBUTES to indicate connectable storage "tuples" should work nicely with Marathon
- Alternate to Marathon + attributes for placement decisions:
- For Discussion: How should external storage capacity be advertised?
- Demonstration of the difficulty with reporting a shared pool's capacity at the slave level
- For Discussion: Do we even want to advertise a pool capacity, with dynamic volume creation from the pool?
- Summary of Options
- Attribute Options
- Creating a volume
- Volume lifecycle workflow
- TBD: Where to perform volume format. Options:
- Volume mount/unmount management
- Project Steps
- Phase 2+ considerations that should influence architecture
- Appendix
Pursue a Phased approach
- Start Today:
- assume an existing external storage platform has been deployed
- create volumes (if needed for platform)
- consume volumes from it
- Tomorrow:
- persistent external disk reservations
- compose a storage platform that is provisioned by mesos and runs on Mesos slaves using direct attached storage DAS
How should external capacity be advertised?
- each volume a resource advertised by a single slave?
- each volume by all slaves? revokable after 1st reservation?
- Should it be advertised at all in phase 1?
- Should a pool capacity be advertised, with smaller volumes "carved" from pool by reservations?
When do we mount/dismount volumes?
- Aligned with reservation create/destroy?
- Are we bothered that reservation locks volume to a slave?
- Aligned with first task dispatch/ reservation destroy?
- Aligned with each task dispatch / task terminate?
- How do we handle slave / task abnormal terminations, network isolation?
When do we format a volume?
- reservation time?
- first task schedule time?
How do we police network volume mounts?
- When tasks / slave abnormally terminate or network isolation occurs:
- How do we detect orphaned mounts?
- What do we do about them if we find them? and When?
Where do we host an external storage management implementation?
- Where is abstraction interface located? Where is implementation binary hosted? Consider these (could be some or all):
- Anonymous Module
- new Storage Management Framework
- In Isolators
What/Where is "ultimate source of truth" regarding existence, mount state of external persistent volumes?
- Mesos "call outs" through abstraction layer to an external storage management implementation?
- Something else?
Is policy based mapping of Tasks to storage volumes on the long term agenda?
How do we handle storage network connection constraints? (failure domain constraints are likely a similar issue)
- Get Mesos + Docker (volume drivers) + marathon working
- Get Mesos + external storage + any framework working (this is like #1 for Frameworks that don't use Docker containers)
- Support Mesos managed storage profiles - Do not assume pre-configured storage volumes, have Mesos manage volume creation on appropriate storage
- Mesos framework for storage platform lifecycle management - Distributed software defined storage platforms, like ScaleIo, are deployed by Mesos on slave nodes.
- Mesos external storage + Docker (volume drivers)
- Slaves offer resources (including storage) to the Master
- Slaves advertise their capabilities with a combination of resources + attributes.
- The Master's Allocator module maintains a model of system wide resources.
- Resources and attributes are offered to Frameworks (by the Allocator) as vectors (tuples). For example resources available on a slave node might include cpu+mem+storage
- the Allocator tracks consumption of resources by Frameworks, and decrements subsequent resource offers while the resource is in use.
- Attributes are not considered consumed by a running task, so they are not "usage constrained" or tracked by the Allocator - they are always simply passed though in offers to Frameworks. Attributes are commonly used to guide placement by Marathon.
It might make sense for a read-only volume to be advertised as an attribute, if the storage provider can safely expose this to multiple slaves concurrently. For normal read-write volumes, some facility must track consumption by a task, so resource is the proper advertisement, not attribute
A resource can be explicitly earmarked with a role in a slave's configuration.
--resources="cpus(ads):8;mem(ads):4096"
This has the effect of restricting consumption of the resource to particular framework or usage.
This is known as a static reservation. A static reservation can only be changed by restarting the slave.
- starts as a resource defined in the slave with a "*" role.
- The slave's resource vector is reported to the Mesos Master, which offers the vector to Frameworks.
- A framework responds to the offer with a "Reserve" response, which includes a role designation in the response
- The Reserve response does not have to include all the elements of the offer vector.
- For example the offer could be (cpu+memory+storage), but the Reserve response could be just storage
Dynamic reservations will mandatory for external persistent storage
Many services are provisioned for peak, not average load. - result: provisioned resources are under-utilized
Over-subscription attempts to marshall these underutilized resources to offer a best efforts class of service
- To implement this, The Master Allocator module tracks oversubscribed resources separately from regular resources
- A resource Monitor running on each slave gathers usage metrics
- Use of over-subscription is recommended only with "compressible" resources like cpu and bandwidth
Because use with storage is not recommended, we do not plan to address over-subscription related to persistent external storage. However observe "Thin provisioned" storage is similar in some ways and different in others:
- similar: you do want to monitor thin provisioned storage actual usage in relation to real capacity
- different: Revoking storage is not a viable option
Conclusion: We will deem over-subscription of external storage as unsupported, and monitoring of thin provisioned external storage as something to be managed and monitored outside Mesos.
- Must pre-associate slave with storage provider(s) (e.g. ScaleIo)
- storage provider (ScaleIo) client software assumed pre-installed on slave
- Slaves report this association to Mesos Master using resources + attributes
- resources and attributes can be specified in /etc/default/mesos-slave, environment variables or command line options
RESOURCES="cpus:2;mem:822;disk:234500;ext-storage-inst1:10000000
This slave is reporting that it is pre-configured to consume storage from an external storage provider designated as inst1
- This provider has a max capacity of 10000000
TBD this is a awkward way to report and track capacity, see slides to follow for more. We need to discuss alternate solutions with Mesosphere
RESOURCES="cpus:2;mem:822;disk:234500;ext-storage-inst1:10000000"
ATTRIBUTES="ext-storage-inst1:scaleio~row8~pool-hdd"
Note that the attribute "key" duplicates the storage provider instance name - For the scaleio provider, storage volumes will come from the provider-pool named "pool-hdd" in the provider-instance designated "row8" - The pool might infer a class of storage such as SSD
ATTRIBUTES="scaleio:row8~pool-hdd"
You can easily constrain an application to run on slave nodes with a specific attribute value using Marathon Constraints
This would allow a task associated with a persistent volume to be restarted on any slave node that has an attribute indicating it can connect to the volume's storage pool.
Mesos177 describes a future implementation of a facility that would allow a Framework to dynamically add new slave attributes at task launch time. Assuming we build a ScaleIo Framework in a later phase, this could be used as part of a facility to automate association of slaves to ScaleIo storage pools.
If the allocator was extended to consider attributes automatically in offer composition, this could eliminate the management burden of configuration of attribute based restrictions.
- Offers to frameworks would be automatically filtered based on attributes.
Mesos currently expects slaves to provide resources, and thus implements a slave centric model for reporting available resources to the Mesos Master.
External storage does not quite match this slave-centric model.
The "straw man" proposal to report a capacity on each connectable slave has some serious flaws, but best to start with something to make the issues apparent.
Suppose the Mesos operator chooses to allow a total pool of 1,000 TB to be utilized by a Mesos Cluster.
For simplicity assume 2 slaves have connectivity to this pool.
Assume a Framework wants to consume (reserve) a 750TB external persistent volume, if it is presented with an offer.
discussion of alternatives based on this scenario...
RESOURCES="...;ext-storage-inst1:1000000000"
RESOURCES="...;ext-storage-inst1:1000000000"
If each slave advertises the 1,000TB the aggregate is inflated by 2x. Although actual offer vectors (cpu+mem+storage) will not report an impossible to achieve tuple.
After the Framework reserves 750 TB, vectors from both slaves need to be reduced by 750 TB. This will require a change to Mesos?
RESOURCES="...;ext-storage-inst1:1000000000"
If only one slave reports the 1,000 TB, what happens if the slave leaves the cluster - probably nothing good!
RESOURCES="...;ext-storage-inst1:500000000"
If both slaves advertise a pro-rata portion, e.g 500 TB in this case, the Framework will not be able to consume a 750 TB volume, even though this should be possible.
If a third slave is added, what happens then? Everybody has to change - this isn't going to work!
For Discussion: Do we even want to advertise a pool capacity, with dynamic volume creation from the pool?
Another alternative is operator (admin) composition of specific volumes in advance, and advertisement of the resulting individual volumes
For example an operator composes a 750GB volume plus a couple 500GB volumes
The slave report a 750GB resource and two 500GB resources
Assume you compose small, medium and large volumes. Whenever the off-the-shelve volume isn't the perfect size, the task must "round up" resulting in waste
In the example of a 750GB and two 500GB volumes, suppose a Framework requires an 800GB volume and four 100GB volumes. Even though this should fit within the total pool, the result will be two failed requests and a lot of wasted storage on the 3 successful requests
Has similar issue to the MB capacity pool
- Do you report the volume on ALL slaves?, OR
- just one that is arbitrarily chosen?
- Operator (admin) precomposes a collection of volumes. The resource pool consists of this assortment of volumes. OR
- Operator (admin) designates a pool of external storage. For example 1,000 TB. The resource pool consists of this storage, to be "carved" into volumes at the time a Framework accepts an offer in the form of a RESERVE.
Note: Initial implementation with assume pre-deployed storage, and could even assume pre-provisioned volumes. Maybe just tell Mesos what volumes have been pre-provisioned, without implementing an operator API to do volume provisioning
Assume storage connectivity constraints are defines as slave attributes
- Have a user declare constraints in Marathon. OR
- Have Mesos manage this automatically in the allocation logic, with no Marathon configuration required
Whether done in advance, or at time of RESERVE, this is required:
- a "tuple" indicating the source of the storage
- provider-type
- provider-instance
- provider-pool
- volume size in MB
- filesystem (e.g. ext4) to be used for formatting
- a mount point on the slave
- It might be better to relieve the operator of this specification by auto- generating a unique mount point
- optional: a source volume (this volume will be a snapshot)
- optional: access mode (RW, RO, default is RW)
- Operator (admin) creates volume
- The step of formatting the volume will be done on an arbitrary slave which has been associated with the provider
- We want the format to take place at volume create time because it is time consuming and we want to avoid startup latency when tasks are dispatched. Need to determine how and where we can trigger format code.
- The step of formatting the volume will be done on an arbitrary slave which has been associated with the provider
- Imbedded in isolator?
- don't think isolator is invoked at Reserve time so this may not be feasible
- Dispatch a Task to do this using Marathon
- If we do this during a Reserve, what are expectations for timeliness? format can take a long time. Is RESERVE a synchronous operation expecting a rapid response?
- Utilize an "out-of-Mesos" host to perform formatting
Unless the reservation process ties an external volume to a particular slave, it is best to defer mount until a task using the mount is scheduled.
If a slave becomes detached, of a task crashes, many storage providers will refuse to remount the volume to a different slave until the old mount is detached. We need to consider the best workflow for detach:
- Should we automatically detach when a tack ends? Probably yes, if mount/remount cost is low - and it usually is low.
- How do we handle slave detach? Probably leave mount attached - should discuss. Does this tie into slave checkpoint and recover options?
- Where (what module) is detach code triggered an run? Master, Framework, or Slave?
new-volume
Function | description | comments |
---|---|---|
new-volume | Create a new volume | new-snapshot used if this is a clone |
attach-volume | Attach volume to slave instance | |
format-device | Format the attached device |
rexray new-volume --size=10 --volumename=persistent-data rexray attach-volume --volumeid="vol-425e94af"
Assume volume will be attached only during format, and while an associated task runs. Thus RexRay detach will also be utilized
A ReservedRole (string) and a DiskInfo have been added to Resource.
TaskInfo links to Resources
DiskInfo incorporate a slave mount path, a container mount path and a RW/RO flag. DiskInfo looks like a suitable place to record remote persistent volume metadata. I believe there is already a Mesos ID here, and that this is retained persistently by Mesos.
The Mesos ID might be sufficient if we maintain a "database" that maps the ID to storage-provider, provider instance and volume-id.
We'd like to avoid maintaining an external storage provider database. I propose adding these to the Mesos DiskInfo:
- external provider type
- external provider instance ID
- external provider volume ID
- (optional) external provider pool ID. This is optional because providers can likely deduce this via API, if instance and volume IDs are available
- (optional) Do we want to record a file system type?, or can this be deduced by examination
Determine Resource advertisement philosophy:
- advertise capacity pool and form volumes at RESERVE time OR
- form volumes in advance and advertise these volumes
Treating resource management as only a capacity issue is short sighted
- External storage consumes network bandwidth - even if slave has a dedicated storage network, bandwidth is a finite resource
- External storage has an finite IOPS per pool
If Mesos really intends to claim to manage external persistent storage, taking on bandwidth and IOPs is mandatory, though it might be deferred to a later release.
To be discussed: Could limiting mounts per slave be a "poor man's" replacement for limiting bandwidth and IOPs?
- Instead of managing bandwidth and IOPs, we could mange connections (mounts) per slave
- OR should be do this in addition to storage bandwidth and IOPs
Bandwidth is associated with a specific slave, so this is not tied to a persistent storage volume reservation
thus slaves should advertise storage bandwidth, and task dispatch should consume it
Units? If goal is to simply achieve balanced task distribution (i.e. avoid placements that crowd external storage using tasks onto the same slave), arbitrary units would be OK
Thus this is best addressed by tying it directly into a persistent storage reservation.
If we go the direction of managing external persistent storage as a MB pool, rather than a collection of pre-composed volumes, IOPs can also be a pool, consumed at RESERVE time by a Framework.
- Assume pool-A offers 800TB and 400IOPS
- Assume pool-B offers 800TB and 1000IOPS
- Framework X wants a 500TB volume with 900IOPS Result should be:
- Decline pool-A offer
- Accept and RESERVE from the pool-B offer, consuming 500TB and 900IOPS
If we go the way of pre-composed volumes, an IOP evaluation (placement decision) should take place during the composition process.
Implement an abstraction for
- Volume: create/remove/mount/unmount
Goal - abstraction works for: - RexRay using external storage - Docker Volume API for those workloads that run in a Docker container
- binary that allocates a volume and formats it
- runs at volume definition time, or RESERVE time (TBD which time)
- tentative plan: use RexRay CLI
- binary that mounts a pre-existing volume on a slave for use by a containerized task
- tentative assumption: this will run in a Mesos isolator at task start time.
- plan: call out to RexRay CLI
- A query API that reports volume inventory and status (mounted/unmounted)
- The external storage management layer should be the source of truth as to volume existence and status
Discuss build process if time allows
kubernetes has handling for persistent volumes, including block devices. I may be worthwhile to examine this for ideas. /~https://github.com/kubernetes/kubernetes/blob/master/docs/design/persistent-storage.md Kubernetes target use case is management of:
- GCE persistent disks
- NFS shares
- AWS EBS stores
The current facility assumes the storage is pre-provisioned, outside kubernetes.