Notes on Linux DRM GPU scheduler

yaongtime 2021-01-20 22:32:14
notes linux drm gpu scheduler

Kernel documentation :
The GPU scheduler provides entities which allow userspace to push jobs into software queues which are then scheduled on a hardware run queue. The software queues have a priority among them. The scheduler selects the entities from the run queue using a FIFO. The scheduler provides dependency handling features among jobs. The driver is supposed to provide callback functions for backend operations to the scheduler like submitting a job to hardware run queue, returning the dependencies of a job etc.
The organisation of the scheduler is the following:
1. Each hw run queue has one scheduler
2. Each scheduler has multiple run queues with different priorities (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
3. Each scheduler run queue has a queue of entities to schedule
4. Entities themselves maintain a queue of jobs that will be scheduled on the hardware.
The jobs in a entity are always scheduled in the order that they were pushed.
Principle overview :
as everyone knows , modern GPU to CPU Provides command flow (command stream) Interface , And these command flows are used to control GPU Hardware , Issue coloring program , And deliver OpenGL or vulkan The required state values, etc .
linux Medium GPU scheduler It's for GPU Command flow scheduling , This part of the code is from AMD GPU driver It's independent of .
From the description of the kernel documentation ,GPU scheduler Provides the user program with entities, Can be used for user programs to submit jobs, these jobs Be added to first software queue On , And then through the scheduler to hardware(GPU) On .
GPU The last command flow channel corresponds to one GPU scheduler.
GPU scheduler There are two scheduling strategies , One is scheduling by priority , The second is the first queue first scheduling under the same priority , namely FIFO Pattern .GPU scheduler Through the callback function method to achieve different hardware jobs Submit .
stay jobs Before being committed to hardware GPU scheduler Provides dependency checking features , Only when jobs When all dependencies of are available , Will be submitted to the hardware .
GPU The last command flow channel corresponds to one gpu scheduler, One gpu scheduler There are multiple run queue, these run queue It represents different priorities .
When there is a new job Need to submit to GPU Upper time , First submitted to entities On , Submitted entity Through load balancing algorithm , Determine the entity It will eventually be dispatched to gpu scheduler, And put entity Add to the selected gpu scheduler Of run Queue List of , Waiting to be scheduled .
Basic usage :
1.GPU scheduler Before you can work , It needs to be initialized , And provide callback function related to hardware operation , and GPU A command flow channel on the corresponds to a scheduler
2. stay scheduler There is software run queue, these run queue Corresponding to different priorities , High priority run queue Priority scheduling
3. new job First submitted to entity On , Then entity Be added to scheduler Of run queue Of the queue . At the same priority ,job and entity Schedule according to FIFO rule (FIFO)
4. When scheduler When scheduling begins , First, from the highest priority run queue Choose the first to enter entity, And then from the elected entity in , Choose the first to join job
5. In a job Can be submitted to GPU HW forward , Need to do dependency detection , For example, soon render Of framebuffer Is it available ( Dependency detection is also implemented through callback functions )
6.scheduler Elected by the job, Finally, we need to use the callback function implemented in initialization , take job Submitted to GPU Of hardware run queue On
7. stay GPU Deal with one job after , adopt dma_fence Of callback notice GPU scheduler, and signal finish fence
1. register sched
struct drm_gpu_scheduler;
int drm_sched_init(struct drm_gpu_scheduler *sched, //sched: scheduler instance
const struct drm_sched_backend_ops *ops, //ops: backend operations for this scheduler
unsigned hw_submission, //hw_submission: number of hw submissions that can be in flight
unsigned hang_limit, //hang_limit: number of times to allow a job to hang before dropping it
long timeout, //timeout: timeout value in jiffies for the scheduler
const char *name) //name: name used for debugging
By function drm_sched_init() Finish one struct drm_gpu_scheduler *sched Initialization work .
Once it's done , Will start a kernel thread , This kernel thread implements the main scheduling logic .
The kernel thread is waiting , Until something new job Be submitted to , And wake up the scheduling job .
hw_submission: Appoint GPU The number of commands that can be submitted simultaneously by a single command channel on the .
By function drm_sched_init() Initialize a GPU scheduler. Where, through the parameter const struct drm_sched_backend_ops *ops Provide platform related callback interface .
struct drm_sched_backend_ops {
struct dma_fence *(*dependency)(struct drm_sched_job *sched_job,
struct drm_sched_entity *s_entity);
struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
void (*timedout_job)(struct drm_sched_job *sched_job);
void (*free_job)(struct drm_sched_job *sched_job);


dependency: When one job As the next scheduling object , The interface is called . If it's time to job Dependency exists , Need to return one dma_fence The pointer ,GPU scheduler It'll be back here dma_fence Of callback list Add wake-up action to , Once it's time to fence By signal, Can wake up again GPU scheduler. If there are no dependencies , Then return to NULL.
run_job: once job After all the dependencies of become available , The interface is called . This interface mainly implements GPU HW Related command submission . The interface successfully submitted the command to GPU After the , Return to one dma_fence,gpu scheduler It's going to be to this dma_fence Of callback list Add finish fence Wake up operation , And this dma_fence Usually in GPU It's time to job After being signal.
timedout_job: When one submits to GPU When the execution time is too long , The interface will be called , To trigger GPU Perform the recovery process .
free_job: Use as a pawn job Related resource release work after being processed .
2. initialization entities
int drm_sched_entity_init(struct drm_sched_entity *entity,
enum drm_sched_priority priority,
struct drm_gpu_scheduler **sched_list,
unsigned int num_sched_list,
atomic_t *guilty)
Initialize a struct drm_sched_entity *entity.
priority: Appoint entity The priority of the , The priorities currently supported are ( From low to high ):


sched_list:entity in job Can be submitted gpu scheduler list . When gpu There are multiple gpu Command flow channel , This is the same job There are multiple potential channels that can be submitted (HW relevant , need gpu Support ),sched_list These potential channels are preserved in .
When one entity There are many. Gpu scheduler when ,drm scheduler Support load balancing algorithm .
num_sched_list: It specifies sched_list in gpu scheduler The number of .
function drm_sched_entity_init() Can be found in open() Called in function , So when multiple applications call their own open() After the function ,driver One will be created for each application entity.
As mentioned earlier ,entity yes job The submission point of ,gpu A command flow channel on the corresponds to a gpu scheduler, When there are multiple applications running to the same gpu Command flow channel commit job when ,job First added to the respective entity On , Wait for gpu scheduler The unified scheduling of .
3. initialization job
int drm_sched_job_init(struct drm_sched_job *job,
struct drm_sched_entity *entity,
void *owner)
entity: Appoint job Will be submitted to entity. If entity Of sched_list More than one , Will call the load balancing algorithm , from entity Of sched_list Choose the best of gpu scheduler Conduct job Dispatch .
function drm_sched_job_init() It's time to job Initialize two dma_fence:scheduled and finished, When scheduled fence By signaled, Indicates that the job To be sent to GPU On , When finished fence By signaled, Indicates that the job Already in gpu I'm done with it .
So through these two fence You can tell the outside world job Current state .
4. Submit job
void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
struct drm_sched_entity *entity)
When one job By drm_sched_job_init() After the initialization , You can use the function drm_sched_entity_push_job() Submitted to the entity Of job_queue Yes .
If entity It's the first time it's been submitted job On top of it job_queue On , The entity Will be added to gpu scheduler Of run queue On , And wake up gpu scheduler Scheduling thread on .
5.dma_fence The role of
DMA fence yes linux For different kernel modules DMA Primitives for synchronous operations , Commonly used in GPU rendering、displaying buffer And so on .
Use DMA FENCE It can reduce the waiting time in user mode , Let the synchronization of data take place in the kernel . for example GPU rendering and displaying buffer Synchronization of ,GPU rendering Responsible for providing framebuffer Write render data to ,displaying Responsible for frambuffer The data is displayed on the screen . that displaying Need to wait GPU rendering After completion , To read framebuffer Data in ( And vice versa ,gpu rendering You need to wait displaying When the display is finished , Can be in framebuffer Draw the next image on the screen ). We can synchronize the two in the application , Immediate wait rendering After the end , Only called displaying modular . Applications tend to sleep while waiting , You can't do anything else ( Like preparing for the next frame framebuffer Rendering of ). etc. displaying When the display is finished , Call again GPU rendering Also can make gpu Unsaturated , At rest .
Use DMA fence after , take GPU rendering and displaying Put the synchronization in the kernel , Application calls GPU rendering after , Return to one out fence, Don't have to etc. GPU rendering Completion , You can call displaying, And put GPU rendering Of out fence Pass to displaying Module as in fence. So in the kernel displaying The module will wait for display output in fence By signal, And once GPU rendering After completion , will signal The fence. No more application involvement .
Please refer to this article for more detailed explanation :
stay Gpu scheduler in , One job Before being scheduled, determine whether it has dependencies , One job After being dispatched, you need to inform the outside world of your current state , Both are through fence To achieve .
job The dependency of is called in_fence,job Self status report is called out_fence, It's essentially the same data structure .
When it exists in fence And not in signaled state , The job Need to continue to wait ( The way to wait here is not by calling dma_fence_wait()), Until all in fence By signal. Before waiting Gpu scheduler A new callback interface will be registered dma_fence_cb To in fence On ( It includes arousal Gpu scheduler Scheduler code ), When in fence By signal when , This callback Will be called , In which to awaken again to the job The scheduling .
One job There are two out fence:scheduled and finished, When scheduled fence By signaled, Indicates that the job To be sent to GPU On , When finished fence By signaled, Indicates that the job Already in gpu I'm done with it .
In the following VC4 Test code on :
Vc4 driver Not used drm Of gpu scheduler As a scheduler .VC4 stay in fence In terms of treatment, it is blocking Type , That is, the application will block here , This doesn't seem to fit fence The purpose of . And the current driver Also, it only supports single in fence, and vulkan Multiple dependencies can be passed in on .
The actual test found that , With the original driver Compared with gpu scheduler No significant performance change , But it can solve fence The problem of , Of course, the main purpose is to practice , It mainly refers to v3d driver Code for .
1. The first is to call drm_sched_init() establish scheduler.Vc4 There are two command entries on the ,bin and render, So here we create two scheduler.
 1 static const struct drm_sched_backend_ops vc4_bin_sched_ops = {
 2 .dependency = vc4_job_dependency,
 3 .run_job = vc4_bin_job_run,
 4 .timedout_job = NULL,
 5 .free_job = vc4_job_free,
 6 };
 8 static const struct drm_sched_backend_ops vc4_render_sched_ops = {
 9 .dependency = vc4_job_dependency,
10 .run_job = vc4_render_job_run,
11 .timedout_job = NULL,
12 .free_job = vc4_job_free,
13 };
15 int vc4_sched_init(struct vc4_dev *vc4)
16 {
17 int hw_jobs_limit = 1;
18 int job_hang_limit = 0;
19 int hang_limit_ms = 500;
20 int ret;
22 ret = drm_sched_init(&vc4->queue[VC4_BIN].sched,
23 &vc4_bin_sched_ops,
24  hw_jobs_limit,
25  job_hang_limit,
26  msecs_to_jiffies(hang_limit_ms),
27 "vc4_bin");
28 if (ret) {
29 dev_err(vc4->, "Failed to create bin scheduler: %d.", ret);
30 return ret;
31  }
33 ret = drm_sched_init(&vc4->queue[VC4_RENDER].sched,
34 &vc4_render_sched_ops,
35  hw_jobs_limit,
36  job_hang_limit,
37  msecs_to_jiffies(hang_limit_ms),
38 "vc4_render");
39 if (ret) {
40 dev_err(vc4->, "Failed to create render scheduler: %d.", ret);
41  vc4_sched_fini(vc4);
42 return ret;
43  }
45 return ret;
46 }
2. stay drm driver Of open Add... To the callback interface ,entity Initial code of .
 1 static int vc4_open(struct drm_device *dev, struct drm_file *file)
 2 {
 3 struct vc4_dev *vc4 = to_vc4_dev(dev);
 4 struct vc4_file *vc4file;
 5 struct drm_gpu_scheduler *sched;
 6 int i;
 8 vc4file = kzalloc(sizeof(*vc4file), GFP_KERNEL);
 9 if (!vc4file)
10 return -ENOMEM;
12  vc4_perfmon_open_file(vc4file);
14 for (i = 0; i < VC4_MAX_QUEUES; i++) {
15 sched = &vc4->queue[i].sched;
16 drm_sched_entity_init(&vc4file->sched_entity[i],
18 &sched, 1,
19  NULL);
20  }
22 file->driver_priv = vc4file;
24 return 0;
25 }
3. stay driver It's done job After you pack it , You can go to entity Submit on file job 了 .
 1 static void vc4_job_free(struct kref *ref)
 2 {
 3 struct vc4_job *job = container_of(ref, struct vc4_job, refcount);
 4 struct vc4_dev *vc4 = job->dev;
 5 struct vc4_exec_info *exec = job->exec;
 6 struct vc4_seqno_cb *cb, *cb_temp;
 7 struct dma_fence *fence;
 8 unsigned long index;
 9 unsigned long irqflags;
 11 xa_for_each(&job->deps, index, fence) {
 12  dma_fence_put(fence);
 13  }
 14 xa_destroy(&job->deps);
 16 dma_fence_put(job->irq_fence);
 17 dma_fence_put(job->done_fence);
 19 if (exec)
 20 vc4_complete_exec(&job->dev->base, exec);
 22 spin_lock_irqsave(&vc4->job_lock, irqflags);
 23 list_for_each_entry_safe(cb, cb_temp, &vc4->seqno_cb_list, work.entry) {
 24 if (cb->seqno <= vc4->finished_seqno) {
 25 list_del_init(&cb->work.entry);
 26 schedule_work(&cb->work);
 27  }
 28  }
 30 spin_unlock_irqrestore(&vc4->job_lock, irqflags);
 32  kfree(job);
 33 }
 35 void vc4_job_put(struct vc4_job *job)
 36 {
 37 kref_put(&job->refcount, job->free);
 38 }
 40 static int vc4_job_init(struct vc4_dev *vc4, struct drm_file *file_priv,
 41 struct vc4_job *job, void (*free)(struct kref *ref), u32 in_sync)
 42 {
 43 struct dma_fence *in_fence = NULL;
 44 int ret;
 46 xa_init_flags(&job->deps, XA_FLAGS_ALLOC);
 48 if (in_sync) {
 49 ret = drm_syncobj_find_fence(file_priv, in_sync, 0, 0, &in_fence);
 50 if (ret == -EINVAL)
 51 goto fail;
 53 ret = drm_gem_fence_array_add(&job->deps, in_fence);
 54 if (ret) {
 55  dma_fence_put(in_fence);
 56 goto fail;
 57  }
 58  }
 60 kref_init(&job->refcount);
 61 job->free = free;
 63 return 0;
 65 fail:
 66 xa_destroy(&job->deps);
 67 return ret;
 68 }
 70 static int vc4_push_job(struct drm_file *file_priv, struct vc4_job *job, enum vc4_queue queue)
 71 {
 72 struct vc4_file *vc4file = file_priv->driver_priv;
 73 int ret;
 75 ret = drm_sched_job_init(&job->base, &vc4file->sched_entity[queue], vc4file);
 76 if (ret)
 77 return ret;
 79 job->done_fence = dma_fence_get(&job->base.s_fence->finished);
 81 kref_get(&job->refcount);
 83 drm_sched_entity_push_job(&job->base, &vc4file->sched_entity[queue]);
 85 return 0;
 86 }
 88 /* Queues a struct vc4_exec_info for execution. If no job is
 89 * currently executing, then submits it.
 90 *
 91 * Unlike most GPUs, our hardware only handles one command list at a
 92 * time. To queue multiple jobs at once, we'd need to edit the
 93 * previous command list to have a jump to the new one at the end, and
 94 * then bump the end address. That's a change for a later date,
 95 * though.
 96 */
 97 static int
 98 vc4_queue_submit_to_scheduler(struct drm_device *dev,
 99 struct drm_file *file_priv,
100 struct vc4_exec_info *exec,
101 struct ww_acquire_ctx *acquire_ctx)
102 {
103 struct vc4_dev *vc4 = to_vc4_dev(dev);
104 struct drm_vc4_submit_cl *args = exec->args;
105 struct vc4_job *bin = NULL;
106 struct vc4_job *render = NULL;
107 struct drm_syncobj *out_sync;
108  uint64_t seqno;
109 unsigned long irqflags;
110 int ret;
112 spin_lock_irqsave(&vc4->job_lock, irqflags);
114 seqno = ++vc4->emit_seqno;
115 exec->seqno = seqno;
117 spin_unlock_irqrestore(&vc4->job_lock, irqflags);
119 render = kcalloc(1, sizeof(*render), GFP_KERNEL);
120 if (!render)
121 return -ENOMEM;
123 render->exec = exec;
125 ret = vc4_job_init(vc4, file_priv, render, vc4_job_free, args->in_sync);
126 if (ret) {
127  kfree(render);
128 return ret;
129  }
131 if (args->bin_cl_size != 0) {
132 bin = kcalloc(1, sizeof(*bin), GFP_KERNEL);
133 if (!bin) {
134  vc4_job_put(render);
135 return -ENOMEM;
136  }
138 bin->exec = exec;
140 ret = vc4_job_init(vc4, file_priv, bin, vc4_job_free, args->in_sync);
141 if (ret) {
142  vc4_job_put(render);
143  kfree(bin);
144 return ret;
145  }
146  }
148 mutex_lock(&vc4->sched_lock);
150 if (bin) {
151 ret = vc4_push_job(file_priv, bin, VC4_BIN);
152 if (ret)
153 goto FAIL;
155 ret = drm_gem_fence_array_add(&render->deps, dma_fence_get(bin->done_fence));
156 if (ret)
157 goto FAIL;
158  }
160  vc4_push_job(file_priv, render, VC4_RENDER);
162 mutex_unlock(&vc4->sched_lock);
164 if (args->out_sync) {
165 out_sync = drm_syncobj_find(file_priv, args->out_sync);
166 if (!out_sync) {
167 ret = -EINVAL;
168 goto FAIL;;
169  }
171 drm_syncobj_replace_fence(out_sync, &bin->base.s_fence->scheduled);
172 exec->fence = render->done_fence;
174  drm_syncobj_put(out_sync);
175  }
177  vc4_update_bo_seqnos(exec, seqno);
179  vc4_unlock_bo_reservations(dev, exec, acquire_ctx);
181 if (bin)
182  vc4_job_put(bin);
183  vc4_job_put(render);
185 return 0;
187 FAIL:
188 return ret;
189 }


Reference material :

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云