Lighthouse Update #22

The progress before the storm...

The short version

This update has been delayed as we were hoping to make some big announcements, but it seems they will have to come in the next update.

True to our word, we have been focusing on performance and optimisations over the last month and have made some very significant gains. To list of a few improvements over the last few months:

  • Block processing times are 70% faster
  • Improved Sync speeds (we can now sync our current 4k validator testnet at around 110 blocks/second, previously this was around 10 blocks/second),
  • Attestation processing times have seen a 20% speed improvement.
  • RAM usage has seen a 75% reduction
  • Disk usage has been reduced by 90%
  • Fork choice is about 1000x faster

Needless to say, the current Lighthouse is significantly leaner and meaner that any of its previous versions.

Alongside this, we have been working tirelessly building the last remaining features required by the Eth2.0 spec. Specifically, we have implemented Noise (and tested its interoperability with go and nim), built framed snappy compression and re-wrote a large portion of our networking stack to accommodate the current Eth2.0 attestation aggregation strategy. The majority of these are in testing phases and will soon be ready very soon to produce a milestone Lighthouse update which will mark Lighthouse as being Eth2.0 mainnet feature-complete. With this, we plan to release a long-lived interoperability testnet and will encourage all clients to join and participate, but we will save these details for the next update.

Finally, it should be mentioned that we have officially kicked off the research and development of the Lighthouse UI. There's a dedicated section on this below with further details.

The longer version

The majority of work done since the last update has been quite technical performance tuning and optimisations, most of which is specific to our implementation. However, we'll detail some of the more interesting aspects of this and some lessons learnt along the way.

Memory

We've been stress-testing Lighthouse, seeing how far we can push our client to determine its capabilities and limitations, then try to exceed them. One of our tests, was running a testnet with 100,000 validators. We also tried putting all 100,000 validators on a single validator client. Surprisingly, the testnet worked, for the most part. We learned that the validator client software we've developed (which you would run if you want to stake in Eth2.0, and requires 32 eth per validator) can handle 100,000 validators per instance, so we're pretty sure it can handle the realistic case of running 10s of validators.

Of course, such a testnet was not without its issues. One of the biggest issues we discovered was Lighthouse consuming large amounts of memory, to the extent that some of our nodes were killed due to running out of memory.

We tracked the root cause down to memory fragmentation, in the most part caused by temporary allocations during tree hashing. We removed the overwhelming majority of these heap allocations through better algorithm design and further utilising the stack. As a result, we managed to reduce the memory usage of nodes using >8GB of memory down to ~2GB. This also reduced our block processing times and was a contributing factor to the statistics listed in the introduction.

During this process we developed a new tree hashing algorithm with a streaming API that allocates O(log(n)) memory (where n is the number of nodes in the tree) instead of our previous implementation which allocated O(n/2-1). Combined with Rust's smallvec crate, we are hashing trees with less than or equal to 256 nodes without any heap allocation. The algorithm is well-documented and can be found in our repository.

Remerkleable

Michael attended ETHDenver and Stanford Blockchain Conference this month to catch up with the international Eth2 community and brainstorm ideas. One of the most promising ideas came from Proto (Diederik Loerakker) in the form of a data-sharing representation for in-memory state.

Presently, most of Lighthouse's memory usage is dominated by BeaconState objects and the majority of that is to accommodate the infrequently-changing validator registry. Proto's scheme drastically reduces the amount of memory required by allowing states to share the parts they have in common. This is achieved by storing each state as an immutable binary Merkle tree. Upon a state transition, the changes to the state can be represented by a relatively small set of Merkle-tree diffs -- the unchanged parts of the parent state are referenced by pointers, rather than copied.

Michael implemented a prototype of the idea in Lighthouse, focussed on the validator registry and found that it ran about as quickly as the current approach whilst reducing memory usage. However, more engineering work is required to reach production readiness as there are some open questions about the impact of many small allocations on memory fragmentation. As described above, we have been working hard to reduce memory usage and fragmentation in other ways so this approach might have to be highly tuned to out-perform the old one. To learn more, you could check out Proto's remerkleable Python library.

Noise

The Noise framework provides a collection of cryptographies primitives that allow users to establish secure connections. Libp2p have recently added a specification for noise handshakes to function as the encryption layer in a libp2p stack.

As Noise was not specified for libp2p, Eth2 clients have been using a protocol called secio for establishing encrypted communications between peers. This however, has always been a temporary placeholder to be replaced by Noise for mainnet. As the specification have solidified, we have added support for Noise encryption into our libp2p stack and have successfully performed interoperability testing with nim and go (although there are some updates that need to be merged upstream as a result). Lighthouse now preferences Noise as the security layer, but will fallback to secio if a client does not yet support it. For mainnet, we will be entirely removing secio support.

Snappy Compression

A mainnet client should support compression over the network. Our (Eth2) chosen compression algorithm is snappy. There has been some debate recently about which kind of snappy algorithm to use (chunked or framed); see this PR for more details. We have pre-emptively built support for the framed version of snappy compression as outlined in the mentioned PR. This support is awaiting Lighthouse's network upgrade and so we are yet to find any solid metrics from adding the compression to our networking, but look forward to them in the near future.

Aggregation Strategy

The aggregation strategy for mainnet is to segregate attestations into gossipsub subnets. Within each subnet, a set of validators are randomly chosen to collect all attestations on the subnet (subnet attestations are un-aggregated) and aggregate them and publish them on a global topic to all peers.

Implementing this has taken us significantly longer than we were expecting. There are a number of caveats involved in handling this aggregation strategy. Timing, for example, is one. Lighthouse needs to time subscriptions to subnets carefully, to allow for the beacon node to search and discover new peers that might be subscribed to this subnet in time. Once new peers are found, the beacon node needs to connect to them and subscribe to the gossipsub subnet topic and join the gossipsub mesh of newly connected peers. This must all be done prior to receiving the attestation on a particular subnet on any particular slot. This is all done for a single validator (which must do this every epoch), and let's not forget, we're pushing Lighthouse to support 100,000 validators on a single validator client/beacon node.

Most of the intricacies of this strategy we think have currently been built and are now in the final stages of completing the PR that introduces this. Significant internal testing will then follow and this will pave the way to having an Eth2.0 mainnet-feature complete Lighthouse.

Lighthouse UI

We recently announced an RFP for a Lighthouse UI, which has been completed, candidates selected and the UI is currently in the research phase. We have teamed up with Aqeel from Empire Ventures to research and design the interface and Flex Dapps to build it out. The research is well underway and we are actively encouraging community members who have an interest in staking in Eth2 to have their say by contributing to this Eth2 UX survey or reaching out in our discord.

We expect there to be steady progress on this in the coming weeks, so watch out for the next update to see how Lighthouse starts becoming end-user friendly.

Things still to come

Although we are nearing the end of building all the necessary Eth2 features there is still a lot internal bells and whistles we would like to add to Lighthouse before mainnet.

We are in the process of designing and building an internal peer management system, which will score peers and provide sophisticated logic on tracking all known peers in the system. In particular, this should add a significant level of DOS protection as various attacks will be recorded against a peers name, which should eventuate in Lighthouse banning peers that are acting maliciously.

Stable futures! Lighthouse has been late to the party on rust's stable futures (as we've been focusing on building out all the mainnet features) however we are now on track to upgrade stable futures which in turn will bring the performance upgrades that have been developed in some of our core dependencies since stable futures were released.

Finally, (after a bit of a delay) we will be getting a security audit from Trail of Bits (ToB). Once we have all our features, are practically ready for mainnet, our great friends at ToB will be sifting through our code helping us make sure it is fit and safe for public consumption.

For the next update, we imagine we will be announcing a new testnet, open for everyone!