Async I/O - Sockets

Alright, so we've got an event loop that can schedule I/O operations. That still feels far from our original goal of "talking" to a remote machine, doesn't it? Bear with me. it will make sense once we lay down the next corner piece of BarkFS's networking stack.

Talking Machines

In UNIX world, talking to a remote machine typically means doing read or write operation on a thing called socket. A socket is basically the OS handing you a handle and saying: "use this to send bytes out". This is a mature concept from 80s which almost every operating system had for decades.

We will build BarkFS's networking stack on top of non-blocking sockets and reactor thingy we built earlier.

Message Passing

For BarkFS to work, its remote parts need a reliable way to talk to each other. They can do it by passing messages to each other. When a BarkFS node receives a message, it reacts by changing its state and possibly issuing a message to the next actor.

We will model this concept in the socket abstraction.

class Socket {
 public:
  void close(Fn<> callback);                                  // closes socket
  void bind(Fn<IOStatus, Address> callback);                  // signals it is ready to receive 
  void connect(Fn<IOStatus> callback);                        // opens a link with remote peer
  void read(Fn<IOStatus, Buffer&&, Peer> callback);           // read chunk of data from remote
  void write(Buffer data, Peer peer, Fn<IOStatus> callback);  // write chunk of data to remote
}

As you can see socket is not a pure stateless "message passing" thing. It operates in either connected (client) or bound mode (server) and implies unicast topology. I believe, this keeps the mental model simple.

Internally, the socket is a deterministic state machine wrapping low-level protocols (TCP, UDP, etc.). It has handful amount of states with a step function tick. This function runs either by socket itself (e.g. for housekeeping) or by transport.

Transport is a thing that actually touches the wire.

struct ITransport {
  virtual ~ITransport() = default;

  // Lifecycle
  virtual void attach(io::EventWatcher& ew, Fn<IOEvent&&> replay) = 0;
  virtual void bind()                                             = 0;
  virtual void connect()                                          = 0;
  virtual void close()                                            = 0;

  // I/O
  virtual size_t resumeRead(Buffer& out_data, Peer& out_peer,
                            IOStatus& out_status, size_t offset = 0,
                            size_t max_len = 0) noexcept = 0;
  virtual void suspendRead()                             = 0;

  virtual size_t resumeWrite(Buffer&& data, const Peer& peer,
                             IOStatus& out_status) noexcept = 0;
  virtual void suspendWrite(const Peer& peer)               = 0;
};

Socket calls transport internally to read or write data to the wire. If transport is ready, resumeRead or resumeWrite completes immediately, if not it notifies socket via replay and socket can "resume" operation.

Transport interface may look relatively generic and doesn't prescribe any message format. This approach allows protocol designers to stack transport implementations and extend functionality with framing support, compression, etc.

In BarkFS the lowest level transport is TcpTransport stacked beneath FramedTransport.

Framing

What is framing?

BarkFS would need to exchange some sort of messages between servers. And different types of sockets provide different semantics for the units of data they exchange. For example, TCP is modeled as ordered stream of bytes while UDP transfer data in datagrams.

For BarkFS we would like to have a network stack that doesn't depend on a specific low-level protocol (wire).

But working with raw bytes at the application level is painful. And that's where FramedTransport comes handy. It decorates lower level transport abstraction to look at individual chunks of bytes as independent messages.

I'm aware of two mainstream approaches for framing: delimiter based and length prefix based.

HTTP/1.0, for example, uses split based framing. Binary protocols like Thrift use length prefixes so each message tells you exactly how many bytes belong to it.

Delimiter based approach works nicely with text-like data and many developers love it because it is super easy to debug. But it doesn't work well for arbitrary binary data (what if payload contains the delimiter byte sequence?).

For BarkFS we can use either of those approaches.

🧠 Task

In tasks/socket/ you'll find:

tcp_transport.cpp - Handles raw TCP socket I/O
framed_transport.cpp - Adds length-prefix framing on top of TCP

The challenge for today is to implement the missing logic in framed_transport.cpp. Fragments that look like:

void FramedTransport::attach(io::EventWatcher& ew, Fn<IOEvent&&> replay) {
  // ==== YOUR CODE: @FB101 ====
  
  // ==== END YOUR CODE ====
}

Specifically:

Implement frame encoding & decoding using a framing algorithm of your choice
Handle partial reads by accumulate bytes until you have a complete frame
Handle partial writes by maintaining pointer for buffer unsent bytes
Manage per remote peer buffers

The TCP layer is already done. You're just adding the framing logic on top.

📦 Build & Test

Tests are in socket_test.cpp.

~/$ ./tasklet test socket

There's also a benchmark that measures ping-pong latency:

~/$ ./tasklet bench socket

To give you a rough baseline, on my machine with my own rough implementation I was able to achieve around 19μs mean latency @ 105k RPS for 0.064Kb messages and 1ms @ 29k RPS for 264Kb messages.

~/$ lscpu
Architecture:                x86_64
    CPU op-mode(s):          32-bit, 64-bit
CPU(s):                      8
    Thread(s) per core:      2
    Core(s) per socket:      4
    CPU max MHz:             4200.0000

What's Next?

With reliable message transport in place, we can build higher-level protocols. Next up: RPC. Taking a function call and shipping it across the network, getting a response back. All the fun of distributed systems - timeouts, retries, backpressure, you name it.