Lighthouse Update #28

Update Summary

It has been all hands on deck since the last update, and a lot has been happening. In quick dot point form, here are the highlights:

Medalla - a large, public multiclient testnet was launched
v0.2.0 of Lighthouse was released (coinciding with the Medalla testnet)
Rust-gossipsub 1.1 is ready for testing
Key-management improvements and EF launchpad integration
Advanced peer management and peer scoring
Significant work on stability, performance and testnet improvements

The Medalla Testnet

Since the last update the Medalla public multiclient testnet was launched. It has over 20,000 validators and at least 5 different client implementations running.

It's launch had a rough start with a number of large stakers being offline during genesis. This lead to a lower participation rate than expected. Many of these Stakers soon joined and the testnet achieved finality and was running relatively smoothly. We've had a chance to do a lot of client interoperability testing and performance tweaking over the past few weeks on this testnet.

Attestation Inclusion

One of the main concerns we witnessed on this testnet, is rate of attestation inclusions. Each validator must produce an attestation each epoch. These get included in future blocks and, once included, the validator associated with the included attestation gets rewarded. Although on a Lighthouse-only network we see 100% attestation inclusion (all produced attestations get included in blocks) on Medalla some attestations were being missed (not included in a future block). This is not a straightforward problem to solve as the process is somewhat complicated, as I'll outline.

Each epoch, validators get shuffled into committees. Each validator of a committee must publish their attestations to a specific gossipsub subnet (related to their committee). Of all the validators in a committee a pseudo-random subset of validators (around 16) are required to collect all the attestations and aggregate them. These "aggregators" then publish the aggregate attestations onto a global "aggregate" gossipsub channel. This allows block proposers to subscribe only to the "aggregate" channel and hence only see the grouped attestations of each committee rather than all individual attestations that occur in the gossipsub subnets. A block proposer should then select the aggregate attestations it deems will make it the most profit and include them into its proposed block.

Now there are many points of failure that can occur that could prevent a validators attestation from being included into a block. Firstly, a client needs to find other peers that are also subscribed to the required subnet. If no such peers exist, the published attestation cannot propagate and will not reach any of the proceeding steps to be included into a block. If a client has found sufficient peers, it is then up to the "aggregators" of the subnet to firstly receive the attestation from the gosispsub subnet and aggregate it. The aggregators could be different client implementations and may not agree on the attestations they have received on the subnet. Timing is also somewhat important between clients here to ensure the attestation is published before the aggregator aggregates it. The aggregator must then publish on the global "aggregate" channel. A block producer (within an epoch of publishing the attestations) must see the aggregate attestation and decide to include it in it's block. These last two steps could be done by any node on the network and the challenge lies in getting all client implementations to work together harmoniously throughout the entire process.

It is our (and I'm sure other client implementer team's) goal to achieve 100% attestation inclusion rate for our all validators. But this will be a cross-client debugging and engineering effort which will likely involve a number of iterations and likely continuously improve as clients progress.

In the past week we've already seen large gains from various teams in this regard. We've managed to increase our inclusion rate on low-performance nodes through reduced load in processing in other parts of the client.

The Medalla Hiccup

On the 14th of August there was an issue with Cloudfare's Roughtime server which caused all Prysm nodes to exhibit a clock skew negating all their attestations and blocks. This effectively removed all Prysm nodes from the chain, which was around 70% of the network (a detailed account can be found here). The chain lost finality, due to a large number of validators being effectively offline. This is one of the primary reasons for a multi-client chain and client diversity, such that in these events, the chain continues to run if a single implementation is taken offline.

Despite this being a catastrophic failure (having such a large portion of the network simultaneously taken offline) this has been a very fruitful event for implementers to see how their clients handle such an event.

For us, there we saw an influx of invalid blocks and attestations flooding the gossipsub channels which lead to some processing bottlenecks in Lighthouse. We've seen bizarre memory consumption usage and also some interesting syncing edge cases due to the various forks from the Prysm clients. This has allowed us to identify these hot-spots which occur in adverse conditions and correct them, ultimately stabilising our client further allowing it to handle such extreme conditions going forward.

We are in the process of completing many of these updates, but imagine we will have a significantly more robust and performant client once completed.

Gossipsub 1.1

We have completed our first version of rust-gossipsub 1.1. We have integrated it into Lighthouse and ran it on the Medalla testnet successfully. Gossipsub 1.1 is a more secure version of its predecessor which primarily incorporates a scoring mechanism for peers which is designed to mitigate a number of attacks and maintain a healthy mesh-network for message propagation.

We are in the early stages of this development however. Our plan over the next few weeks is to perform some more internal testing, attempt to do some large scale simulations and design and document a set of scoring parameters that should be applicable to the Ethereum 2.0 gossipsub channels. The end result should be a more resilient network of Ethereum 2.0 nodes for message propagation.

Stability, Performance and Peer Management

The events that have taken place on the Medalla testnet have helped us identify some key areas of improvement in Lighthouse. We have been profiling the client, searching for processing bottlenecks, excessive memory usage and overall client stability.

We have seen performance degradation when decoding/processing large amounts of blocks/attestations. These were originally being processed on the core-executor (a threadpool managing basic client operation) which meant that core parts of Lighthouse would be delayed whilst block/attestation processing was underway. We have been identifying heavy processes and lifting these off the core-executor into their own tasks such that core Lighthouse components continue to run as expected, even in high-load conditions. This has shown to enhance the attestation inclusion rate of some Lighthouse users.

We have also identified areas of Lighthouse that allocate more memory than required. We are actively hunting and chasing for memory leaks and unnecessary allocations as we've observed some Lighthouse nodes spike in memory usage during the Medalla hiccup.

There is also a known deadlock in the client which occurs more regularly on nodes with low CPU counts. We are actively hunting this lock (and have been for a while) and are narrowing it down. We should have this resolved soon.

Finally, since our last update, we have enhanced our peer management system. Additionally to the scoring, we actively track current and past-known peers. When a Lighthouse node is run with default parameters, typically it will connect to 50 peers and oscillate somewhere between 50 and 55. Lighthouse by default, targets 50 peers to connect to and has a 10% threshold which allows additional incoming connections. This allows new peers to easily join a network (we don't reject them if we've already reached 50 peers) and allows us to cycle our peer-pool with new peers which can potentially remove other non-performant or malicious peers.

If your Lighthouse node is fluctuating between 50 and 55 peers, this is the desired behaviour and it is running as expected.

Still to come

The Medalla hiccup has brought with it a number of improvements to the Lighthouse code base. We are still working on some of these and we expect a series of updates to the client to occur within the next week to handle all the issues we've witnessed over the last few days.

We will be continuously working (with other client teams) to increase the attestation inclusion rate. This will be a long-running endeavour and hopefully we can achieve 100% inclusion rate for all our validators soon.

Lighthouse will be completing its first audit around October (this will be primarily focused on the networking components) and will undergo it's second audit around the same time.

We will be working to have everything complete and working smoothly for these audits.