Lighthouse Update #34

An update on improvements to Lighthouse aimed at increasing validator rewards and decreasing server costs.

TL;DR

Since our last update, the team has been focussed on improving the Lighthouse Beacon Chain implementation. This means analysing validator performance and resource consumption. We have made progress in two major areas:

  • Block propagation times.
  • Memory consumption.

Our latest release, v1.1.1, includes improvements to block propagation times. We recommend all users update to this latest release, even if they are not running a validator. We'll go into further detail later in this post.

Memory consumption improvements have not yet been released, however we hope to push them into production in the coming weeks.

Separate from development, we've been interviewing lots of candidates after our recent hiring campaign. We're excited to announce that we'll have at least one new Lighthouse developer starting in March.

Block Propagation Times

Validators on Eth2 are relatively frequently (sometimes multiple times per day) experiencing a reduction in rewards due to missed head/target votes. The beaconcha.in block explorer doesn't go into detail on the head/target aspect of attestations, however it is frequently the cause of reward variations between 0.00002 and 0.00004 ETH. This section of the post will describe the concept of head/target attestations, why we think they're being missed and what we're doing to fix it.

To understand head/target votes, we must look at the AttestationData object that validators include in their attestations. With this object, each attestation votes for three block roots:

  • Head vote: attestation_data.beacon_block_root
    • Set to the block at attestation_data.slot (i.e. the head of the chain).
  • Target vote: attestation_data.target.root
    • Set to the block at the start of the epoch (i.e., epoch(attestation_data.slot).
    • When it is the first slot of the epoch, this equal to the head vote.
  • Source vote: attestation_data.source.root
    • Set to the root of the block at the start of the current justified epoch.

For an attestation to be included in a block and therefore be eligible for rewards, the source vote must always point to a block in the canonical chain (as opposed to some non-canonical block that was forked out). However, this is not the case for the head and target votes; if these votes are invalid (i.e., they point to non-canonical blocks) the attestation can still be included in a block but the attester wont receive the full reward (recall the 0.00002 and 0.00004 ETH differences).

A notable difference between the source vote and the head/target vote is their depth in the chain. The source vote is always more than 32 blocks/slots (several minutes) deep in the chain. On the other hand, the target vote is 0-31 blocks deep (depending on our progress through the epoch) and the head is always at the very tip of the chain (0 blocks deep). It's rare to see an invalid source vote since the depth ensures nodes have had a long time to propagate and agree upon it. However, the head and occasionally the target vote reference the latest block. The latest block is created only four seconds before the attestation and it's therefore much more challenging to ensure that all nodes have seen it, verified it and imported it.

Drilling into the tasks that need to be performed in those critical four seconds, Eth2 nodes must complete:

  1. Block creation (performed by a single validator, the block producer):
    • Packing the block with attestations, deposits, exits and slashings.
    • Having the block signed by the producer.
    • Computing the state transition.
    • Publishing the block on the P2P network.
  2. Block propagation (generally performed collaboratively by all nodes):
    • Receiving the block from the P2P network.
    • Performing the "anti-spam checks", aimed to filter out invalid blocks.
    • Forwarding the block to other peers on the P2P network.
  3. Block verification (performed by each node individually):
    • Performing the entire suite of validation checks on the block (as opposed to the basic anti-spam checks).
  4. Block import (performed by each node individually):
    • Writing the block to the database
    • Updating the internal data structures used for creating attestations (e.g., fork choice).

Since some validators are missing the head/target votes, it's clear that these steps are not completing within four seconds. Of course, we must determine exactly which steps are a bottleneck. To this end, my analysis shows that there are multiple occasions each day where a block is received more than four seconds after the start of the slot. This clearly indicates that progressing through steps 1 and 2 is so slow that it leaves no chance for 3 or 4 to complete in time.

Analysis of Lighthouse nodes shows that blocks are being created in less than a second (generally less that half a second), so this points to block propagation (2) as a major factor. Further inspection of Lighthouse nodes shows that some nodes are taking up to 1/10th of a second to do the anti-spam verification of gossip blocks. Considering that blocks might need to travel through multiple nodes to traverse the network, the anti-spam times can compound.

After identifying that block propagation times are likely a contributing a factor, we developed an optimisation which has successfully reduced our propagation times such that they are consistently less that 1/100th of a second. There are some edge-cases which may result in longer verification times (e.g., deep forks), but these cases are not appearing on mainnet or Pyrmont at this point. As always, we shared our findings and optimisations with the other client teams and inspired Prysm to adopt a similar approach. I understand that Teku and Nimbus already had similar approaches implemented.

This type of optimisation only becomes effective when a majority of the nodes on the network adopt it. I'm looking forward to observing the effect on the network (and validator rewards) once the Lighthouse and Prysm optimizations gain prominence on the network.

Memory Usage

Lighthouse has always experienced a memory usage pattern where it starts at a low value (~500mb) and slowly creeps to a larger value over time (+4gb) and then plateaus. Whilst we have identified that this is not a dangerous memory leak caused by lost pointers, we've never been able to identify which component of Lighthouse is consuming this memory.

After some very interesting discoveries by Sean (Sigma Prime), we've identified that we can eliminate this creep by setting the following environment variable before starting Lighthouse:

export MALLOC_ARENA_MAX=1

In our experiments this has reduced memory consumption from ~3gb to ~1gb. This is very exciting, but one must ask why and at what cost?

Let us start with what is malloc? The term "malloc" comes from the C malloc() function, which is used to allocate system memory. The name itself is a portmanteau of "memory" and "allocate". The "malloc" term is a little overloaded and without clear definition these days, but when we're using the MALLOC_ARENA_MAX variable, we're using "malloc" to refer to the GNU Malloc library: the behind-the-scenes code used when Lighthouse allocates memory on most Linux systems. So, the "malloc" we're referring to is a program which manages memory allocations for Lighthouse.

Knowing that this variable targets the GNU Malloc library, we can refer it's tunable parameters documentation to understand MALLOC_ARENA_MAX:

This parameter sets the number of arenas to use regardless of the number of cores in the system.

The default value of this tunable is 0, meaning that the limit on the number of arenas is determined by the number of CPU cores online. For 32-bit systems the limit is twice the number of cores online and on 64-bit systems, it is eight times the number of cores online. Note that the default value is not derived from the default value of M_ARENA_TEST and is computed independently.

This parameter can also be set for the process at startup by setting the environment variable MALLOC_ARENA_MAX to the desired value.

So, it seems that setting MALLOC_ARENA_MAX=1 has reduced the number of "arenas" that are created by GNU Malloc. Great, but what is an arena? Let's consult the GNU Malloc Internals documentation:

[An arena is a] structure that is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are "free". Threads assigned to each arena will allocate memory from that arena's free lists.

This definition is not immediately enlightening, but further reading discovers that an arena is an area of memory that GNU Malloc uses to store the variables created by Lighthouse. We also come across the following paragraph:

As pressure from thread collisions increases, additional arenas are created via mmap to relieve the pressure. The number of arenas is capped at eight times the number of CPUs in the system (unless the user specifies otherwise, see mallopt), which means a heavily threaded application will still see some contention, but the trade-off is that there will be less fragmentation.

The paragraph teaches us that GNU Malloc will create more arenas if it detects multiple threads vying for access for a single arena. This reduces memory contention (hopefully making the program faster) but the trade-off is memory fragmentation; inefficient use of memory.

Putting it all together, when we say MALLOC_ARENA_MAX=1 we're telling GNU Malloc to stop trying to "optimise" our program by consuming more memory to make it run faster. The reduced memory usage is great, but we must ask: do fewer arenas slow down Lighthouse? Presently, our (fairly immature) understanding is that is has no impact on performance. Reducing the arena count seems to have no impact on tree-hashing, block importing or state transition timings, so far. However, the very low-level and far-reaching nature of this change means we need to do more analysis on testnet and mainnet nodes before we can recommend this change to users.

This large reduction of memory usage is highly desirable; lower memory usage results in cheaper servers and more headroom for times of high network contention and additional functionality required for the merge or sharding. Whilst we're not ready to include these changes in a release, we're actively experimenting with a more flexible and fragmentation-resistent replacement to GNU Malloc called jemalloc. You can follow our progress at this PR.