Silicon Perception is building a new type of chip to enable agile autonomous robots.
To achieve the absolute lowest latency and highest throughput for real time streaming AI inference, we employ the following hardware architecture choices. First, the model weights are stored entirely on-chip, distributed across a large number of memory instances. Second, we execute the model per-row, using only row buffers for intermediate storage per layer. By fully distributing both the model parameters and tensor storage on-chip, we remove all memory access bottlenecks while reducing the overall footprint and power consumption.
To support ultra high performance streaming CNN models, we utilize three types of parallel execution. First, each Conv2d layer is executed in parallel using dedicated, pipelined hardware. Second, within each pipelined layer, rows are evaluated in 1..32 parallel strips. Third, within each layer and strip, output channel dot products are computed using 1..512 parallel dedicated FPU MAC units. In conjunction, these three degrees of parallel execution enable >10,000X speedup over serial execution of Conv2d.
To generate optimal synthesizable Verilog code for a given CNN model, we implemented a compiler and optimizer which accept the input row rate and the Verilog clock rate as parameters. The maximum row rate provides guaranteed hard real time execution, and the maximum clock rate provides a flexible way to produce Verilog that can be targeted to either FPGA or ASIC silicon platforms.
As a reference image encoder for general purpose downstream tasks, we are releasing under MIT license the IE120R single chip image encoder as a drop-in replacement for a Resnet-18 feature extractor. The IE120R accepts a 896x896 RGB input image and runs at 120 frames/s with <2 ms constant latency. The IE120R produces output feature maps with shapes [512,7,7], [256,14,14], [128,28,28], [64,56,56], [64,112,112] which are compatible with Resnet-18. This model fits in the Agilex AGF 027 device and supports flexible camera and downstream interfaces.
The IE120R model is trained from scratch to replicate a pretrained Resnet-18 CNN backbone using a pairwise distance loss function and cosine learning rate schedule. It is a Resnet-18 clone, running at 120 frames/s with <2ms latency, available as a single chip solution.
To get started with pretrained Silicon Perception image encoders, first install the PyTorch and HuggingFace dependencies. Then install the Silicon Perception models.
pip install siliconperception --upgrade
The pretrained IE120R model can be loaded as follows.
import torch
import torchinfo
from siliconperception.IE120R import IE120R
encoder = IE120R.from_pretrained("siliconperception/IE120R")
torchinfo.summary(encoder, input_size=(1,3,896,896))
Properly pretraining a foundation vision backbone is a challenging task which requires the right combination of loss function, optimizer and learning rate schedule, and training data distribution. We encourage community involvement to improve the pretrained weights. The Silicon Perception image encoder models and weights are released under MIT license.
Today, deep learning inference engines are typically workloads on top of general-purpose SoC platforms such as multicore CPU or GPU, using a software stack such as PyTorch or Tensorflow. This approach is ubiquitous and provides extensive flexibility to run any model, at the expense of power, performance, and silicon area.
In embedded applications where the model and input shape are fixed, the flexibility of a software implementation is unnecessary. Silicon Perception has chosen to remove the software stack completely, converting deep models into single chip hardware components with programmable weights. These components have streaming interfaces and execute in hard real time, with the highest throughput and lowest latency in the industry.
Recently, the HumanPlus project at Stanford [paper] demonstrated that imitation learning with ego-centric vision and pose shadowing is effective. At the same time, Google DeepMind demonstrated a specialized table tennis robot with human level hand-eye coordination [paper]. In order to achieve their full commercial potential, agile humanoid robots will need to be able to catch a ball and pass a vision test. We propose to achieve this by following the shadow learning approach with high-speed low-latency inference, using new silicon.
Silicon Perception is offering a new type of silicon component which functions as a trainable image encoder for low latency, high throughput, and power constrained applications such as autonomous humanoid robots. For real time robotics applications, pretrained Resnet18 (12M weights) and Resnet50 (30M weights) are popular image encoder feature extractors. The pretrained Silicon Perception IE120R image encoder contains 13M FP8 weights, is fully compatible with Resnet-18, and achieves accuracy exceeding Resnet-18 on a linear probe classification task.
The emerging market for autonomous humanoid robots is perhaps the largest TAM in history. At Silicon Perception, we believe that humanoid robots will primarily consist of differentiable components which perform the image and signal encoder, motion encoder, pose decoder, and language model functions. The perception encoders and pose decoder will run at high speed and low latency, while the multimodal language model will condition the pose decoder at a reduced rate. This approach can also incorporate perception embeddings for tactile, inertial, proprioceptive, and audio sensors. If you are a company or lab working on humanoid robots and would like to collaborate on a prototype, please contact us.
896x896 RGB image input
7x7x512 feature map output
Resnet-18 drop-in replacement
100 frames/s, 12 ms latency, hard real time
Single chip solution, Agilex AGF 027
We are currently seeking investment for lab space and equipment
Copyright © 2024 Silicon Perception Inc. - All Rights Reserved.
Powered by GoDaddy
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.