Optimizing an Order Matching Engine with an FPGA

12/12/2024

DISCLAIMER: I am still only a student, and do not have any real experience programming in financial markets. Looking at my code and content, I'm sure that any experienced quant dev would pick up on that, but just saying, if you see anything on here that is factually incorrect, misleading or bad practice, please let me know!

source code

Part 0: Introduction

Trading has evolved a lot. Now that we are (mostly) out of the stone ages of trading pits, we enjoy a less chaotic, much more efficient avenue through which market participants can make trades in a timely and fair manner. The order matching engine is easily the most significant element of this, as it will determine when and how to execute trades, how to prioritize market participants, and a lot more things that are crucial to capital markets. In quantitative finance, I've made it my goal to develop a solid understanding of this piece of technology through a lot of reading, implementation and testing.

I've developed a couple of different order matching engines in the past in Python and C++. It's not super hard understanding how it works at a high level, as it's essentially just a pair of two collections of orders (buy orders, and sell orders), and it's up to the matching engine to make sure that if the lowest buy order's price is >= the highest sell order's price, orders get matched. This is of course an oversimplification, but just know that to implement a basic orderbook for your own understanding, it shouldn't take a huge amount of time. It starts to get complicated when you start to think about how an orderbook is built and applied to a real world trading environment, likely many different users concurrently interacting with the orderbook. There are a lot of things to consider when actually building an application like this, and if you're interested in doing a deep dive I highly recommend the book Building Low Latency Applications with C++ by Sourav Ghosh. This goes over how to build a serious multithreaded exchange with TCP and UDP multicast connections.

For this project however, I will be honing in on specific aspects of the orderbook that I want to understand better. Specifically, I will be focusing on two things:

This means I will be abstracting away, or simply ignoring some pretty glaringly important parts of the exchange which, if not for the sake of targeted learning, would be otherwise foolish to leave out. These things include support for multiple tickers, support for multiple users, and responses to clients, client portfolios, and a lot of other more minor things.

Part 1: Project Layout

The project has two separate branches, which are just two different extensions of the same orderbook implementation, meant to explore two different concepts. These are:

Both of these parts log the latencies of each matched order, generating a report (.rpt) file containing metrics about order latencies.

Here I will show some parts of my code that are important. Starting with the structure of the orderbook itself:


struct Order {
    bool buy;           // true for buy, false for sell
    int price;          // integer price
    int quantity;       // quantity remaining
    // int client_id;      //id of client placing order (so if we match, we know who to tell)
};
        
class OrderBook {

    private:
        static const int SCALE_FACTOR = 100;  // 1 = 1 cent
        static const int MIN_PRICE = 1; //1 means min price of $0.01 per security
        static const int MAX_PRICE = 1000000; //1M means a max price of $10k per security
        static const int PRICE_RANGE = MAX_PRICE - MIN_PRICE + 1;

        int bestBidIndex = -1; 
        int bestAskIndex = -1; 
    

        std::vector> bids; // array of queues for buy orders
        std::vector> asks; // array of queues for sell orders

        ...
}
            

As you can see, we have a simple order struct which has the bare bones attributes of an order. Extending upon this, we have our bids and asks in deques (More on this later, as this could be a latency bottleneck). We index this data structure by taking the price of the order, multiplying it by 100 to get the amount of cents of the price, and then using that as the index.

Now, for the client and server:

Server

We initialize with our standard socket(), bind() and listen():

int Server::initialize() {

    //create socket
    server_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (server_fd < 0) {
        std::cerr << "could not create socket\n";
        return 1;
    }

    //bind
    memset(&server_addr, 0, sizeof(server_addr));
    server_addr.sin_family = AF_INET;
    server_addr.sin_port = htons(PORT);
    //server_addr.sin_addr.s_addr = INADDR_ANY; 
    server_addr.sin_addr.s_addr = inet_addr("127.0.0.1");
    
    if (bind(server_fd, (struct sockaddr*) &server_addr, sizeof(server_addr)) < 0) {
        std::cerr << "could not bind\n";
        close(server_fd);
        return 1;
    }

    //listen for connection
    if (listen(server_fd, 1) < 0) {
        std::cerr << "could not listen\n";
        close(server_fd);
        return 1;
    }

    std::cout << "listening on port " << PORT << "...\n";

    return 0;
}
            

Then, we wait for a client connection via the accept() function:

int Server::wait_for_client_connection() {
    //accept client connection
    socklen_t client_len = sizeof(client_addr);
    client_fd = accept(server_fd, (struct sockaddr*)&client_addr, &client_len);
    if (client_fd < 0) {
        std::cerr << "could not accept\n";
        close(server_fd);
        return 1;
    }

    std::cout << "client connected...\n";
    return 0;
}   
            

We now listen to the client via the recv() function and a buffer, which the client communicates through.

ssize_t bytes_read = recv(client_fd, buffer, BUFF_SIZE, 0);
            

Client

We start by creating our socket, setting the port and allowing any connection. Then, we connect via connect().

int Client::connect_to_server() {
    client_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (client_fd < 0) {
        std::cerr << "could not create socket\n";
        return 1;
    }

    memset(&server_addr, 0, sizeof(server_addr));
    server_addr.sin_family = AF_INET;
    server_addr.sin_port = htons(server_port);
    server_addr.sin_addr.s_addr = INADDR_ANY;

    if (connect(client_fd, (struct sockaddr*)&server_addr, sizeof(server_addr)) < 0) {
        std::cerr << "connection failed\n";
        close(client_fd);
        return 1;
    }

    std::cout << "connected to server " << server_ip << ":" << server_port << "\n";
    return 0;
}
            

Other Important Stuff

I test the latencies of each order processing event by recording the start and end times:

auto start = std::chrono::steady_clock::now();
//order processing code
auto end = std::chrono::steady_clock::now();
long long latency = std::chrono::duration_cast(end - start).count();
            

I push each latency to a vector and then at the end, of the program's lifecycle, calculate and log these metrics into a report (.rpt) file:

On average, it seems that I am unable to get under 3300ns per order. We shall investigate further on optimization.

Part 2: Understanding the Latency