Skip to content

Solutions for the Distributed System Challenges from Fly.io and Kyle Kingsbury

Notifications You must be signed in to change notification settings

fersilva16/gossip-glomers

Repository files navigation

Gossip Glomers

Solutions for the Distributed System Challenges from Fly.io and Kyle Kingsbury

Solutions

Hello world! Just echoes back what you send it.

solution / tests

Generates a unique ID for each node. Inspired by MongoDB's ObjectId.

Each ID is a concatenation of:

  • Node ID: to avoid collisions with other nodes
  • Current time: to avoid collisions with messages sent at the same time
  • Count: to avoid collisions with messages sent at the same time and node

solution / tests

A basic node that receives messages, saves them to a slice, and returns on the read message.

solution / tests

Same as the previous challenge, but now it broadcast the message to all neighbors.

solution / tests

Node spawns a new thread for each message sent to a neighbor, and exponentially backoff if the neighbor doesn't respond.

This solution is pretty dumb and not the most efficient one, here's other ideas of how it could be done:

  • Use a map to keep track of which messages are not yet acknowledged and retry them on the next messages
  • Use a separated thread with a ticker to periodically distribute the messages to neighbors

solution / tests

The node receiving the message broadcasts to all the other nodes in network with a send-and-forget approach with the gossip message type.

Results from a run:

All checks passed
  Messages per op: 11.789474/30
  Median latency: 84/400
  Maximum latency: 105/600

solution / tests

The previous solution also works for this challenge, but I've taken a step further and tried to make it send as less messages per operation as possible.

It's the same idea as #3d, but it spawns a new thread that broadcasts the messages every 1.5s with a buffered channel and only broadcasts if there are new messages.

Results:

All checks passed
  Messages per op: 4.728111/20
  Median latency: 791/1000
  Maximum latency: 1584/2000

solution / tests

I've made 2 solutions for this challenge, one that uses a OT approach and another with a CRDT solution.

OT (Operational Transformation) solution uses the Sequential KV store from Maelstrom and tries to Compare and Swap the value of the counter.

solution / tests

CRDT (Conflict-free Replicated Data Type) solution uses a map to store the counter of the other nodes and propagates its own when receives an add message.

solution / tests

Links:

Simple implementation of a Kafka-style log. Store messages in a map, and return the messages after a given offset.

solution / tests

Implementation using the Lin-KV store:

  • Compare and Swap to get the next offset for a key
  • Store the each message in a separated key
  • Last Write Wins to store the committed offsets

solution / tests

The previous solution also works for this challenge, so I tried to remove the KV completely and use a leader-follower approach.

The leader node is fixed to n0 (could be done with a consensus algorithm).

The leader handles the offsets to avoid conflicts. All the nodes exchange messages to get updated messages and committed offsets.

solution / tests

Note: since the txn-rw-register workload doesn't really check any anomalies, and it passes even returning 0 for all keys. I've tried to follow the definitions from Jepsen. Thanks to teivah and mchernyakov solutions, I could check if I was on the right track.

A simple, totally-available transactional store, keeps a single map and lock for each read/write operation.

This one took me lots of back-and-forth because I wasn't sure of how to implement the mutexes to get the right consistency model. Even through the test specify read uncommitted, this doesn't abort G0/P0 anomalies.

Links:

solution / tests

This solution uses a g-counter to keep track of the transactions, ensuring that G0/P0 anomalies won't happen and to keep the last possible state of the key, and replicates each write operation to all the other nodes.

In case of network partitions, the key is replicated until it's acknowledged by the node.

solution / tests

Links:

The solution follows the same idea as the previous one, with g-counters, but it replicates entire transactions instead of individual writes, and first writes to a separated store to then merge changes into the main store.

Like the previous solution, the txn is replicated until it's acknowledged by the node. On replication, it also merges the changes using a separated store.

solution / tests

Links:

Project Structure

This repo is a Go workspace where each solution is in a separate module.

Each solution is in a folder with the main.go and main_test.go for tests.

Tests

All solutions have automated tests using custom utils as a way to verify and play with the solutions, and to learn how to write tests in Go.

Running

This project uses Nix for packages and Taskfile to run and build the solutions.

To run the solutions, use the run-* tasks:

task run-all
task run-uniqueids
task run-broadcast-efficient-ii1 # Will run and check the result output

To run the tests, use the test-* tasks:

task test-all
task test-uniqueids
task test-broadcast-efficient-ii

About

Solutions for the Distributed System Challenges from Fly.io and Kyle Kingsbury

Topics

Resources

Stars

Watchers

Forks