How to make a cluster server on an ARM processor and test VPS on AWS Graviton

Started by searchcandy, Aug 05, 2022, 12:53 PM

Previous topic - Next topic

searchcandyTopic starter

Creating a Server Cluster on Raspberry/Banana/Orange Pi

Now, there are many different options for assembling a cluster based on various single-board computers. You can assemble it yourself or buy a ready-made box (CASE) to fill with Raspberry Pi CoM modules.

For personal and research purposes, such clusters have the right to life. But for the mass market, this approach is not acceptable for the following reasons:

    Low performance of the used ARM processors.
    In most cases, data storage is only on eMMC or microSD.
    A large set of unnecessary peripherals, such as a Wi-Fi and Bluetooth communication block, HMDI video output ports, etc., which leads to an extra cost of the board and an increase in power consumption.
    Dense arrangement of modules is not possible.
    A huge number of extra wires for connecting power lines and Ethernet.

To solve the above disadvantages, it is necessary to switch to the concept - Computer-on-Module (Computer on Module, CoM).

Computer on Module (CoM)

A computer on a block (CoM) is a board that integrates the main elements of a data processing system and allows them to be connected to a carrier board using connectors or soldering. Such a board houses the main processor, RAM, additional modules, chips. The board usually contains peripheral functions, while the carrier board allows the implementation of device-specific functions, such as connection with external devices through various interfaces.

The convenience of using CoM lies in the ability to create your own design for the carrier board. Designing a carrier board is much easier than a CoM block. On the carrier board, you can separate the power block and the Ethernet switch, thereby reducing the size of the system and eliminating unnecessary wires.

The miniNodes startup develops low-cost ARM servers for small cloud services, sites and IoT applications. They developed a micro-server consisting of a carrier board and 5 Raspberry Pi 3 CoM modules. The Carrier Board contains an integrated gigabit switch that provides connectivity to all 5 modules. On the opposite side of the board is a power block to provide power to the CoM. Each of the compute modules also has a separate power on/off switch. The Raspberry Pi CM3+ modules come with 8GB, 16GB, or 32GB eMMCs onboard, so there's no need to use SD cards as you would on regular Raspberry Pi boards. So when fully loaded there are 5 nodes consisting of 4 cores, 1GB RAM and up to 32GB eMMC each for a total of 20 cores, 5GB RAM and up to 170GB storage.

Which approach to building micro-servers based on the Raspberry Pi is certainly good, but is unlikely to be taken seriously by the business. Any business is primarily interested in the duration of support for the purchased system, so startup solutions are not an option for businesses. Additionally, the closeness of Broadcom processors for the Raspberry Pi does not add popularity in this market segment.

Blade server from Firefly on Rockchip processors

Among the available solutions, processors from Rockchip stand out. Compared to processors from other manufacturers, more Linux drivers are available for Rockchip processors and there are datasheets in the public domain. Firefly develops CoM modules on Rockchip processors and completes blade servers from them. The company's catalog contains 9 CoM modules of different performance that could be used to complete the server. Modules for connection have a standard SODIMM interface.

The company has developed the Cluster Server R1 server in a 1U form factor that could contain up to 11 CoM modules.
The server is designed to run applications on Linux, cloud games, virtual desktops, testing mobile applications (up to 110 Android virtual phones). It is possible to run OS: Linux and Android.

CoM modules are available for configuration:

    RK3399(AI) Core Board (Core-3399-JD4): two A72 cores + four A53 cores, frequency up to 1.8GHz
    RK3328 Core Board (Core-3328-JD4): Quad-core A53, frequency up to 1.6GHz
    RK1808(AI) Core Board (Core-1808-JD4): two cores with A35, frequency up to 1.7GHz

On the front panel of the blade server there are: 4 Gigabit Ethernet ports, HDMI, two USB2.0 ports, OTG, an additional 3.5-inch hot-swappable SATA/SSD hard drive, a 4G-LTE block SIM card slot is available for modules.

To manage nodes, BMC (Baseboard Management Controller) is used, with which you could manage nodes: enable / disable, remote access, status monitoring, hardware configuration management.

At the beginning of this year, the second version of the cluster server, Cluster Server R2, was introduced. Cluster Server R2 comes in a 2U form factor and includes:

    9 blade nodes (each node contains 8 CoM modules).
    Two 3.5" SATA hard drives/SSDs.
    4 Gigabit Ethernet ports.
    two USB 3.0 ports, USB 2.0, HDMI port.

The cluster server also works under OS: Android, Ubuntu or some other Linux distributions. Server use cases: "cloud phone", virtual desktop, cloud gaming, cloud storage, blockchain, multi-channel video decoding, etc. The presence of AI (NPU Neural Processing Unit) makes the cluster look like Solidrun Janux GS31 Edge AI. A server designed for real-time output of many video streams for monitoring smart cities and infrastructure, intelligent corporate / industrial video surveillance, object detection, recognition and classification, intelligent visual analysis, etc.

What can be run on an ARM processor?

Running applications with Docker and Kubernetes has long been the de facto standard for Linux. Therefore, let's consider what are the most popular containers that can be run under ARM: - management and monitoring of containers using a net interface.
    OpenVPN is the most popular free VPN server
    SoftEther VPN is a multi-protocol VPN server with graphical interest under Windows.
    Databases - all official Docker images are also built for ARM architecture: PostgreSQL, Mariadb, MongoDB.
    Nginx-proxy - Nginx proxy server, Alpine-based image
    Traefik is a reverse proxy alternative to Nginx
    Wordpress is a popular CMS
    Elasticsearch - Java search engine
    Asterisk PBX - open source computer telephony (including VoIP) from Digium
    Zabbix is a system for monitoring and tracking the status of various computer network services, servers and network equipment

Separately, it is necessary to note the project, which builds containers based on the most popular applications for Linux, assemblies are also being prepared for the ARM architecture. The popular Linux applications are in the form of containers for ARM systems, so you can already start testing.

Testing VPS on AWS Graviton2

Amazon provides test instances based on the ARM processor AWS Graviton2. This is a great opportunity to test the software for compatibility with the ARM architecture for free and just get experience in operating the system on an ARM processor. Free t4g.micro instance until June 30, 2021 24x7. For testing, it is enough to register and deploy an instance on Ubuntu Server 20.04 LTS.
t4g.micro instance configuration:

    2 vCPUs 2.5 GHz
    1 GiB memory
    16GB SSD

The t4g.micro instance is available for deployment at various sites. The closest site to us is Europe (Frankfurt) eu-central-1, the average ping from St. Petersburg is 78 ms.

The lscpu command gives the following information:

ubuntu@host:~$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
Vendor ID: ARM
Model: 1
Model name: Neoverse-N1
Stepping: r3p1
BogoMIPS: 243.75
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 4 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0.1
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

root@host:/home/ubuntu# inxi -b
System: Host: host Kernel: 5.4.0-1038-aws aarch64 bits: 64 Console: tty 1
            Distro: Ubuntu 20.04.2 LTS (Focal Fossa)
Machine: Type: Other-vm? System: Amazon EC2 product: t4g.micro v: N/A serial: ec21ba8c-f1b0-3f47-74e0-c648a84383c4
            Mobo: Amazon EC2 model: N/A serial: N/A UEFI: Amazon EC2 v: 1.0 date: 11/1/2018
CPU: Dual Core: Model N/A type: MCP speed: 0
Graphics: Message: No Device data found.
            Display: server: No display server data found. headless machine? tty: 130x42
            Message: Advanced graphics data unavailable for root.
Network: Device-1: Elastic Network Adapter driver: ena
Drives: Local Storage: total: 8.00 GiB used: 2.52 GiB (31.5%)
Info: Processes: 145 Uptime: 2d 23h 46m Memory: 952.5 MiB used: 336.2 MiB (35.3%) Shell: bash inxi: 3.0.38

CPU test

We test the processor with sysbench:
root@host:/home/ubuntu# sysbench --test=cpu --cpu-max-prime=20000 --num-threads=1 run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 1
Initializing random number generator from current time
Prime number limit: 20000
cpu speed:
    events per second: 1097.02
General statistics:
    total time: 10.0002s
    total number of events: 10972
Latency (ms):
         min: 0.91
         avg: 0.91
         max: 0.95
         95th percentile: 0.92
         sum: 9998.11
thread fairness:
    events (avg/stddev): 10972.0000/0.00
    execution time (avg/stddev): 9.9981/0.00

RAM test:

root@host:/home/ubuntu# sysbench --test=memory --num-threads=4 --memory-total-size=512MB run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Total operations: 524288 (3814836.42 per second)
512.00 MiB transferred (3725.43 MiB/sec)
General statistics:
    total time: 0.1360s
    total number of events: 524288
Latency (ms):
         min: 0.00
         avg: 0.00
         max: 8.00
         95th percentile: 0.00
         sum: 315.89

Disk test
Let's see what dd shows. The speed to a full-fledged SSD falls short, but already a full-fledged SATA:

root@home:/home/ubuntu# dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.03664 s, 153 MB/s

7-zip benchmark results:

root@host:/home/ubuntu# 7zab
7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs LE)

CPU Freq: - - - - - - 512000000 - -

RAM size: 1024 MB, # CPU hardware threads: 2
RAM usage: 441 MB, # Benchmark threads: 2

                       compression | Decompression
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
         KiB/s % MIPS MIPS | KiB/s % MIPS MIPS

22: 6957 167 4054 6768 | 8353 199 358 713
23: 757 172 448 772 | 22509 200 975 1948
24: 7073 185 4118 7606 | 80370 200 3535 7056
25: 6831 185 4227 7800 | 77906 199 3480 6934
---------------------------------- | ------------------------------
Avr: 177 3212 5736 | 199 2087 4163
Tot: 188 2649 4950

Comparing the cost of a t4g.micro instance with x86 VPS

Let's use the calculator and calculate how much ARM is cheaper than x86 VPS on AWS. The calculation will be made for the Europe (Frankfurt) eu-central-1 website. The closest analogue of x86 in terms of characteristics is the t2.micro instance.

t2.micro instance configuration:

    1 vCPUs
    1 GiB memory
    8GB SSD

Let's say the instance runs for 365 days * 24 hours * 1 year = 8760.0000 hours.

The monthly payment for renting an on-demand instance, excluding traffic, will be:

    t4g.micro (ARM): 5.33 USD
    t2.micro (x86): $7.96 ~ Rs. (at the rate of 1 $ ).

Gets that the server on ARM is 33% cheaper than its analogue on x86. Separately, it is necessary to calculate the amount of network traffic, the cost does not depend on the type of instance architecture. The first gigabyte of traffic per month will be free, then the amount up to 11 TB is paid at a price of $0.09 per GB. Incoming traffic is not charged. If we proceed from an average network load of 150 GB of outgoing traffic per month, then the cost for traffic will be:

    150 GB * $0.09 = $13.5 ~ $1000

As a result, a t4g.micro (ARM) VPS instance with a traffic volume of 150 GB per month will cost - $20 / month.

If, for comparison, we take VPS from VDSina for $5 / month (1 core, 30 GB NVMe, 32 TB of traffic), then for a competitive advantage, servers on ARM will still grow and grow.

The lack of native software for ARM architecture will be a stopping factor in the mass transition for a long time to come. But one way or another, a partial transition to ARM servers is already a trend. There are three powerful factors driving the transition. The first factor is independence from major x86 processor vendors.
You can choose the solution which suits you best. The second factor is the possibility of maximum "optimization for yourself", the exclusion of all unnecessary and the addition of specialized blocks, such as NPU, FPGA, etc. The third factor is the openness and availability of Linux.
If we compare ARM servers for the mass consumer sector, then x86 architecture will dominate in this segment for a long time. Most likely we will see the creation of a new market segment for special tasks focused on the advantages of the ARM architecture, such as a server with FPGA blocks or NPUs.


If my memory serves me good, then in AWS one ARM core of a virtual machine corresponds to one physical core of a stone (because there is no HT in Graviton), but one x86 core of a virtual machine corresponds to one half of the physical core (one when lucky and the neighbor does not load its logical core ) by reason of AWS sells logical cores.

Thus, t4g.micro means 4 times more cores (2 physical ARM cores) than the more expensive t2.micro (0.5 physical x86 cores).

Perhaps the author was not disingenuous, saying that he always sells innovations cheaply so as not to create excess margins to attract competitors, and cites Apple as an anti-example, which, from some first days, began to hard milk the iPhone, and, thereby, attracted many strong competitors by enticing them with gigantic margins.