Audio-Visual systems enable applications like teleconferencing, presentation systems, and more. Techniques like keyword spotting and speech transcription help computers react to vocal commands, and computer vision enables computers to infer meaning from imagery. Combining these two types of sensor data enables a wide range of multimedia applications.
This demo shows a audio-visal system in which audio analyzed for specific keywords. These keywords control which areas the camera is focused on such that specific areas can be zoomed in on to better show a speaker - this keyword spotting uses a neural network running on the Arm CPUs. In tandem, a deep learning model runs on images from the camera to recognize where peoples' faces are using the C7xMMA deep learning accelerator. The faces are isolated and displayed on a section of the screen.
Source code is available on Texas Insturments github under the edgeai-demo-audio-visual repository.
Please find the following resources for reproducing the demo. This requires:
These steps are validated on the 9.0 Edge AI Linux SDK.
Purpose | Link |
---|---|
Edge AI Studio; Model Analyzer and Model Composer | https://dev.ti.com/edgeaistudio/ |
Top level github page for Edge AI | https://github.com/TexasInstruments/edgeai |
AM62A Datasheet (superset device) | https://www.ti.com/product/AM62A7 |
AM62A Academy (Basic Linux Training/bringup) | https://dev.ti.com/tirex/explore/node?node=A__AB.GCF6kV.FoXARl2aj.wg__AM62A-ACADEMY__WeZ9SsL__LATEST |
Support Forums (See Processors -> AM62A7) | https://e2e.ti.com |