Lighthouse Update #19 - Public Testnet

Announcing our public testnet

Announcement

We are now hosting publicly-accessible Lighthouse nodes and provide instructions for running your own beacon node and/or validator client. In other words, we have a public testnet that you can join today. Get started with our documentation

Before you decide to run a node, I kindly ask you to read the rest of this section (feel free to skip the following technical section).

Our testnet has two notable characteristics:

  1. It uses the mainnet specification (with slight modifications to increase inactive validator churn and decrease lag when following the Eth1 chain).
  2. It has over 16,384 validators.

When launching a beacon chain testnet, you can pick and choose your spec (mainnet or minimal) and the number of validators (typically something above 64). If you want to do an accurate simulation the computational load of running in production, you need to choose the mainnet spec and have at least 16,384 validators (i.e. the minimum number of validators required to launch the Beacon Chain mainnet). If you're not concerned with simulating the actual computational load (e.g., you're demonstrating APIs) then you'll likely choose minimal spec and a validator count in the tens or hundreds.

At the time of writing (and as far as we know), this is the first mainnet testnet with 16,384 validators. It has been a huge undertaking to get this testnet running and the Lighthouse team is proud of this achievment.

Choosing the mainnet spec means that the BeaconState object is much larger; some fields grow from a length of 64 to 8,192. Merkle hashing, serialization, database interactions and copying in memory become orders of magnitude more onerous. Additionally, choosing a higher validator count means even more Merkle hashing and more BLS signatures. BLS is a primary bottleneck for block and attestation verification.

Taking these challenges into consideration, we ask that you bear with us whilst we improve our sync times. We're syncing at about 4-8 blocks/sec on a consumer laptop at the moment, but we've seen successive major improvements over the past week as we focus on optimization. To give an idea of how fast we're progressing, less than a week ago we were syncing at less than 0.2 blocks/sec. Although sync is presently slow, we're comfortably running a Lighthouse node on an Amazon t2.medium that's managing 4,096 validators. Performance is reasonable once synced.

Additionally, we have our validators highly concentrated and we're expecting to deploy malicious nodes over the coming weeks. We're going to start trying to crash this testnet and I suspect we'll be successful. If you decide to run a node and contribute to the project by reporting bugs and making suggestions, we'll be very grateful. If we need to reboot the testnet, just reach out if you need more Goerli ETH.

The Technical Part

What's missing?

Although we're seeking to simulate the production beacon chain, we still don't have all features. Here's a list of how we diverge from what we can expect to run in production:

Attestation Aggregation

Presently we are not using the attestation aggregation scheme in the spec. Instead, we are running our validators across a handful of nodes and these nodes are aggregating the attestations (mitigating the need for a distributed aggregation scheme).

Once we implement the attestation aggregation scheme, we can expect to see an increase in network traffic, computational load and code complexity. Expect to see PRs for the aggregation scheme in the coming weeks.

Slashing Protection or Detection

Our validator client is presently without validator slashing protection. Whilst we have an implementation in this PR, we decided not to make it a priority for this testnet. We chose this because it's not expected to have a significant impact on computational load and it's also interesting to see how the network can survive validators casting conflicting votes.

Expect to see slashing protection in the master branch in the next two weeks.

Large Node Counts

Presently we have less than 10 nodes on the network at any given time. On the one hand, this makes problems with syncing much more obvious because there's no one else to fall back on. It also makes the network more unstable, which helps us detect bugs easier. On the other hand, it fails to trigger a whole other set of bugs that arise from large DHTs, noisy networks, etc.

Our intention in the next few weeks is to use some cloud container service (e.g., AWS ECS) to spin up hundreds or thousands of nodes on this network and observe the results.

Optimizing State Transition

When moving over to the mainnet spec with 16k validators, we are primarily concerned with block import times (this involves verifying the block and storing it in the DB). Specifically, we are interested in the time it takes to process a block in two scenarios;

  1. When syncing from some old block (e.g., first boot).
  2. When following the head (i.e., when we've finished syncing).

They are different beasts because (1) involves importing lots of successive blocks really fast and (2) involves processing a block or two in short bursts. For Lighthouse, we can surprisingly import blocks much faster in scenario (2) than in (1). We're using LevelDB and are noticing more than a 10x slowdown in write times when importing multiple blocks. We only identified this as an issue on Sunday and we will work to solve it this week.

When we first ran with mainnet specs, we found the SSZ serialization was far too slow (multiple seconds). This was due to two main factors;

  1. We were decoding all BLS public keys into actual curve co-ordinates
  2. We were storing our tree hash (Merkle root) caches to disk.

We solved (1) by just treating public keys as byte arrays until the point we need to do actual cryptographic operations (@protolambda and I have been talking about this for months). We solved (2) by simply not storing the tree-hash caches to disk any more, our in-memory caches are sufficient for normal operation. The trade-off here is that if someone builds off weird blocks from the past, we'll have to do more work to compute the tree hash (presently about 300ms). We can refine this approach in later releases.

Our "fork choice" times are also quite notable (hundreds of ms). This time not only involves running the fork choice algorithm but also updating the head, caches and persisting the current beacon chain state to disk (e.g., the canonical head, other known heads, the state of fork choice, etc.). We have a clear path forward to reduce these times:

  • Store some ancestors for each block in our reduced-tree fork choice (this means less state reads).
  • Be more granular in when we persist the beacon chain to disk (e.g., only store things that have changed, not everything always).

I'm confident there's still a lot of fat in block processing, I think it's safe to expect another order of magnitude improvement in the coming weeks. Time will tell.

This marks the first official release of Lighthouse (v0.1.0).