kpu benchmark



  • I want to measure DNN performace on kpu. The SDK lacks benchmark tool/config for end-to-end network. It is also better to measure the hardware execution time of each layer in DNN.

    How to implement this tool/config? Does anyone have an idea?

    SDK info: kendryte-standalone-sdk branch:feature/kpu
    demo: kendryte-standalone-demo/face_detect branch: master



  • i use kernel 3x3 ,input image size is 3x320x240 ,output 12x320x240,include batchnorm and leakyrelu,got about 130Gmac/s, does add cost half the time ? some demo board donate 230Gmul /s ?



  • as i thought,the calc and load is streamed,so one time calc a layer is slow than it in the stream. i meansure it by calc the layer 100 times and send out a little result set,so get a more accurate result?



  • @manageryzy Thanks! So you think that the method of calculating layer-by-layer cycles is correct, but the KPU maybe take less than 1ms, especially the pooling layer used by the kpu demo, resulting in the millisecond results not being displayed correctly. Am I right?


  • Staff

    @Yu-Jiageng you can't write a c code like that. this code would not work if kpu takes less than 1ms



  • help!

    In order to compute the hardware execution time of each layer, my idea is to set LAYER_BURST_SIZE = 1 and that will trigger a IRQN_AI_INTERRUPT when the calc of each layer is done. So, I can calculate cycles between twice IRQN_AI_INTERRUPT.

    But, it seems to get the wrong result:

    layer[0] execute time = 0 ms
    layer[1] execute time = 0 ms
    layer[2] execute time = 0 ms
    layer[3] execute time = 0 ms
    layer[4] execute time = 0 ms
    layer[5] execute time = 0 ms
    layer[6] execute time = 0 ms
    layer[7] execute time = 0 ms
    layer[8] execute time = 0 ms
    layer[9] execute time = 0 ms
    layer[10] execute time = 0 ms
    layer[11] execute time = 0 ms
    layer[12] execute time = 0 ms
    layer[13] execute time = 2 ms
    layer[14] execute time = 0 ms
    layer[15] execute time = 2 ms
    

    I don't know why. I had set kpu->fifo_threshold.data.fifo_full_threshold = 1.

    Could someone help me to point out what's wrong with my plan? Thanks.



  • This post is deleted!


  • Thanks for the message. Can this be achieved?

    sdk: kendryte-standalone-sdk develop
    demo: kendryte-standalone-demo/kpu v0.5.3rc

    network benchmark:

    diff --git a/kpu/main.c b/kpu/main.c
     index 8119023..438e4f1 100644
     --- a/kpu/main.c
     +++ b/kpu/main.c
     @@ -27,6 +27,8 @@
     
      #define CLASS_NUMBER 20
     
     +#define AI_NETWORK_BENCHMARK 1
     +
      kpu_task_t task;
      static region_layer_t detect_rl;
     
     @@ -319,9 +321,16 @@ int main(void)
                  ;
     
              /* start to calculate */
     +#if AI_NETWORK_BENCHMARK
     +        uint64_t cycle = read_cycle(); 
     +#endif // AI_NETWORK_BENCHMARK
              kpu_run(&task);
              while(!g_ai_done_flag);
              g_ai_done_flag = 0;
     +#if AI_NETWORK_BENCHMARK
     +        cycle = read_cycle() - cycle;
     +        printf("\nKPU network execute time = %ld ms\n", cycle/(sysctl_clock_get_freq(SYSCTL_CLOCK_CPU)/1000));
     +#endif // AI_NETWORK_BENCHMARK
     
              /* start region layer */
              region_layer_run(&detect_rl, NULL);
    

    layer by layer benchmark:

     diff --git a/lib/drivers/kpu.c b/lib/drivers/kpu.c
     index 3338b42..f71754a 100644
     --- a/lib/drivers/kpu.c
     +++ b/lib/drivers/kpu.c
     @@ -9,7 +9,14 @@
      #include <string.h>
      #include "bsp.h"
     
     +#define AI_LAYER_BENCHMARK 1
     +
     +#if AI_LAYER_BENCHMARK
     +#define LAYER_BURST_SIZE 1
     +static uint64_t layer_cycle;
     +#else
      #define LAYER_BURST_SIZE 12
     +#endif // AI_LAYER_BENCHMARK
     
      volatile kpu_config_t *const kpu = (volatile kpu_config_t *)AI_BASE_ADDR;
      static volatile uint32_t kpu_status;
     @@ -27,6 +34,23 @@ static int kpu_config_input(void *ctx)
          kpu_task_t *task = (kpu_task_t *)ctx;
          kpu->interrupt_clear.reg = 7;
     
     +#if AI_LAYER_BENCHMARK
     +    if (task->remain_layers_length < task->layers_length)
     +    {
     +        layer_cycle = read_cycle() - layer_cycle;
     +        printf("layer[%d] execute time = %ld ms\n",
     +                task->layers_length - task->remain_layers_length - 1,
     +                layer_cycle/(sysctl_clock_get_freq(SYSCTL_CLOCK_CPU)/1000));
     +        if (task->remain_layers_length == 0)
     +        {
     +            kpu->interrupt_mask.reg = 7;
     +            return 0;
     +        }
     +    }
     +    // layer_cycle will be calculated at next ai interrupt
     +    layer_cycle = read_cycle();
     +#endif // AI_LAYER_BENCHMARK
     +
          if (task->remain_layers_length <= LAYER_BURST_SIZE)
          {
              for (uint32_t i = 0; i < task->remain_layers_length; i++)
     @@ -45,7 +69,9 @@ static int kpu_config_input(void *ctx)
                  kpu->layer_argument_fifo = task->remain_layers[i].dma_parameter.reg;
              }
              task->remain_layers_length = 0;
     +#ifndef AI_LAYER_BENCHMARK
              kpu->interrupt_mask.reg = 7;
     +#endif // AI_LAYER_BENCHMARK
          }
          else
          {
    

  • Staff |  Mod

    We are making a k210 USB stick to benchmark, test, and debug KPU.
    If you want to measure the time consumption of each layer, you can enable KPU's interrupt for each layer, and calculate cycles between twice KPU's interrupts.