Silicon Perception is building a new type of chip to enable agile autonomous robots.
To achieve the absolute lowest latency and highest throughput for real time streaming AI inference, we employ the following hardware architecture choices. First, the model weights are stored entirely on-chip, distributed across a large number of memory instances. Second, we execute the model per-row, using only row buffers for intermediate storage per layer. By fully distributing both the model parameters and tensor storage on-chip, we remove all memory access bottlenecks while reducing the overall footprint and power consumption.
To support ultra high performance streaming CNN models, we utilize three types of parallel execution. First, each Conv2d layer is executed in parallel using dedicated, pipelined hardware. Second, within each pipelined layer, rows are evaluated in 1..32 parallel strips. Third, within each layer and strip, output channel dot products are computed using 1..512 parallel dedicated FPU MAC units. In conjunction, these three degrees of parallel execution enable >10,000X speedup over serial execution of a Conv2d model.
To generate optimal synthesizable Verilog code for a given CNN model, we implemented a compiler and optimizer which accept the input row rate and the clock rate as parameters. The compiler then generates synthesizable Verilog code which exactly meets the performance requirements. The maximum row rate provides guaranteed hard real time execution, and the maximum clock rate provides a flexible way to produce Verilog that can be targeted to either FPGA or ASIC silicon platforms.
We are releasing the IE120R single chip image encoder as a drop-in replacement for a Resnet-18 feature extractor. The IE120R accepts a 896x896 RGB input image and runs at 100 frames/s with 11.5 ms constant latency. This performance is achieved using a power efficient 250MHz clock rate. The IE120R produces an output feature map with shape [7,7,512] which is compatible with Resnet-18. This model fits in the Agilex 027 device and supports flexible camera and downstream interfaces.
The IE120R model is trained from scratch to replicate a pretrained Resnet-18 CNN backbone using a pairwise distance loss function. It is effectively a high resolution Resnet-18 clone, running at 100 frames/s with 11.5ms latency, available as a single chip solution.
As a companion device, we are also releasing the DX120P single chip pose decoder which maps a dense feature trajectory to a low dimensional pose vector. The decoder runs at 100 frames/s with 8.5 ms fixed latency. It is a 14 layer CNN with 8.4M weights and fits in the Agilex 027. Together with a 2Hz LLM, the IE120R and DX120P form a concept for an autonomous agile robot.
Additional documentation is available here.
Properly pretraining a foundation vision backbone or pose decoder is a challenging task which requires the right combination of loss function, optimizer and learning rate schedule, and training data. We encourage community involvement to improve the pretrained weights. The Silicon Perception IE120R image encoder model, DX120P pose decoder model, pretrained weights, and corresponding Verilog code are released under MIT license.
Today, deep learning inference engines are typically workloads on top of general-purpose platforms such as multicore CPU or GPU, using a software stack such as PyTorch or Tensorflow. This approach is ubiquitous and provides extensive flexibility to run any model, at the expense of power, performance, and silicon area.
In embedded applications where the model and input shape are fixed, the flexibility of a software implementation is unnecessary. Silicon Perception has chosen to remove the software stack completely, converting deep models into single chip hardware components with programmable weights. These components have streaming interfaces and execute in hard real time, with the highest throughput and lowest latency in the industry.
Recently, the HumanPlus project at Stanford [paper] demonstrated that imitation learning with ego-centric vision and pose shadowing is effective. At the same time, Google DeepMind demonstrated a specialized table tennis robot with human level hand-eye coordination [paper]. In order to achieve their full commercial potential, agile humanoid robots will need to be able to catch a ball and pass a vision test. We propose to achieve this by following the shadow learning approach with high-speed low-latency inference, using new silicon.
Silicon Perception is offering a new type of silicon component which functions as a trainable image encoder for low latency, high throughput, and power constrained applications such as autonomous humanoid robots. For real time robotics applications, pretrained Resnet-18 (12M weights) and Resnet-50 (30M weights) are popular image encoder feature extractors. The pretrained Silicon Perception IE120R image encoder contains 10M bfloat18 weights, is compatible with Resnet-18, and achieves accuracy comparable to Resnet-18 on a linear probe classification task.
The emerging market for autonomous humanoid robots is perhaps the largest TAM in history. At Silicon Perception, we believe that humanoid robots will primarily consist of differentiable components which perform image encoder, pose decoder, and language model functions. The image encoders and pose decoder run at high speed and low latency, while the multimodal language model conditions the pose decoder at a reduced rate. This approach can also incorporate perception embeddings for tactile, inertial, proprioceptive, and audio sensors through the use of spectrograms. If you are a company or lab working on autonomous robots and would like to collaborate on a prototype, please contact us.
We are currently seeking partners for lab space and equipment
Copyright © 2024 Silicon Perception Inc. - All Rights Reserved.
Powered by GoDaddy
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.