Notes on Linux DRM GPU scheduler

itread01 2021-01-21 01:46:24
notes linux drm gpu scheduler


Core documents :
 
Overview
 
The GPU scheduler provides entities which allow userspace to push jobs into software queues which are then scheduled on a hardware run queue. The software queues have a priority among them. The scheduler selects the entities from the run queue using a FIFO. The scheduler provides dependency handling features among jobs. The driver is supposed to provide callback functions for backend operations to the scheduler like submitting a job to hardware run queue, returning the dependencies of a job etc.
 
The organisation of the scheduler is the following:
1. Each hw run queue has one scheduler
2. Each scheduler has multiple run queues with different priorities (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
3. Each scheduler run queue has a queue of entities to schedule
4. Entities themselves maintain a queue of jobs that will be scheduled on the hardware.
The jobs in a entity are always scheduled in the order that they were pushed.
 
Principle overview :
 
As we all know , Modern GPU to CPU Provides command flow (command stream) Interface , And these command flows are used to control GPU hardware , Issue coloring program , And deliver OpenGL or vulkan The required state values, etc .
linux Medium GPU scheduler It's for GPU Scheduling of command flow , This part of the code is from AMD GPU driver It's independent of .
From the description of the core file ,GPU scheduler Provides users with entities, Can be used for user programs to submit jobs, These jobs Be added to first software queue On , And then through the scheduler to hardware(GPU) On .
GPU The last command flow channel corresponds to a GPU scheduler.
GPU scheduler There are two scheduling strategies , One is to schedule in order of priority , The second is the first to join the team in the same order of priority , namely FIFO Pattern .GPU scheduler Through the callback function method to achieve different hardware jobs Submit .
stay jobs Before being committed to hardware GPU scheduler Provides dependency checking features , Only when jobs When all dependencies of are available , Will be submitted to the hardware .
 
GPU The last command flow channel corresponds to a gpu scheduler, One gpu scheduler There are multiple run queue, These run queue It represents different priorities .
When there is a new job Need to submit to GPU Upper time , First submitted to entities On , Submitted entity Through load balancing algorithm , Determine the entity It will eventually be scheduled to gpu scheduler, And put entity Add to the selected gpu scheduler Of run Queue List of , Waiting to be scheduled .
 
 
Basic usage :
1.GPU scheduler Before you can work , It needs to be initialized , It also provides callback functions related to hardware operation , and GPU A command flow channel on the corresponds to a scheduler
2. stay scheduler There is software run queue, These run queue Corresponding to different priorities , High priority run queue Priority scheduling
3. new job First submitted to entity On , Then entity Be added to scheduler Of run queue In the queue of . In the same order of priority ,job and entity Schedule on a first in first out basis (FIFO)
4. When scheduler When you start scheduling , First, from the highest priority run queue Choose the first to enter entity, And then from the elected entity in , Choose the first to join job
5. In a job Can be submitted to GPU HW forward , Dependency detection is required , For example, about to render Of framebuffer Is it available ( Dependency detection is also implemented through callback functions )
6.scheduler Selected job, Finally, we need to use the callback function implemented in initialization , Will job Submitted to GPU Of hardware run queue On
7. stay GPU Finish one job After , Through dma_fence Of callback notice GPU scheduler, And signal finish fence
 
1. Register sched
 
struct drm_gpu_scheduler; int drm_sched_init(struct drm_gpu_scheduler *sched, //sched: scheduler instance const struct drm_sched_backend_ops *ops, //ops: backend operations for this scheduler unsigned hw_submission, //hw_submission: number of hw submissions that can be in flight unsigned hang_limit, //hang_limit: number of times to allow a job to hang before dropping it long timeout, //timeout: timeout value in jiffies for the scheduler const char *name) //name: name used for debugging
Through the function drm_sched_init() Finish one struct drm_gpu_scheduler *sched Initialization work .
Once it's done , Will start a core thread , This core thread implements the main scheduling logic .
The core thread is in a wait state , Until something new job Be submitted to , And wake up the schedule .
 
hw_submission: Appoint GPU The number of commands that can be submitted simultaneously by a single command channel on the .
 
Through the function drm_sched_init() Initialize a GPU scheduler. Where through arguments const struct drm_sched_backend_ops *ops Provide platform related callback interface .
 
struct drm_sched_backend_ops { struct dma_fence *(*dependency)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity); struct dma_fence *(*run_job)(struct drm_sched_job *sched_job); void (*timedout_job)(struct drm_sched_job *sched_job); void (*free_job)(struct drm_sched_job *sched_job);}

 

dependency: When a job When viewed as the next scheduled object , Will call the interface . If it's time to job There are dependencies , You need to return a dma_fence Indicators ,GPU scheduler It'll be back here dma_fence Of callback list Wake up operation is added in , Once it's time to fence By signal, You can wake up again GPU scheduler. If there are no dependencies , Then return to NULL.
run_job: once job After all the dependencies of become available , Will call the interface . The main purpose of this interface is to implement GPU HW Related command submission . The interface successfully submitted the command to GPU On the back , Back to a dma_fence,gpu scheduler I'll tell this dma_fence Of callback list Newly added finish fence Wake up operation , And this dma_fence Usually in GPU It's time to job Posterior quilt signal.
timedout_job: When one submits to GPU When the execution time is too long , The interface will be called , To trigger GPU Perform the recovery process .
free_job: Use it as job Related resource release work after being processed .
 
2. initialization entities
int drm_sched_entity_init(struct drm_sched_entity *entity, enum drm_sched_priority priority, struct drm_gpu_scheduler **sched_list, unsigned int num_sched_list, atomic_t *guilty)
Initialize a struct drm_sched_entity *entity.
priority: Appoint entity Priority of , The current support priorities are ( From low to high ):
  
 DRM_SCHED_PRIORITY_MIN DRM_SCHED_PRIORITY_NORMAL DRM_SCHED_PRIORITY_HIGH DRM_SCHED_PRIORITY_KERNEL

 

sched_list:entity in job Can be submitted gpu scheduler list . When gpu There are multiple gpu Command flow channel , So the same job There are multiple potential channels that can be submitted (HW Related to , need gpu Support ),sched_list These potential channels are stored in .
When a entity There are many Gpu scheduler When ,drm scheduler Support load balancing algorithms .
num_sched_list: It specifies sched_list in gpu scheduler Number of .
 
Function drm_sched_entity_init() Can be found in open() Function is called , So when multiple applications call each open() After function ,driver One will be created for each application entity.
As mentioned earlier ,entity yes job Submission point for ,gpu A command flow channel on the corresponds to a gpu scheduler, When there are multiple applications running to the same gpu Command flow channel commit job When ,job First added to the respective entity On , Wait for gpu scheduler The unified scheduling of .
 
3. initialization job
int drm_sched_job_init(struct drm_sched_job *job, struct drm_sched_entity *entity, void *owner)
entity: Appoint job Will be submitted to entity. If entity Of sched_list More than one hour , Call the load balancing algorithm , From entity Of sched_list Choose the best of gpu scheduler To carry out job Scheduling .
Function drm_sched_job_init() It's time to job Initialize two dma_fence:scheduled and finished, When scheduled fence By signaled, Show that you should job To be sent to GPU On , When finished fence By signaled, Show that you should job Already in gpu I'm done with it .
So through these two fence You can tell the outside world job The current state .
 
4. Submit job
 
void drm_sched_entity_push_job(struct drm_sched_job *sched_job, struct drm_sched_entity *entity)
When a job By drm_sched_job_init() After initialization , You can use the function drm_sched_entity_push_job() Submitted to the entity Of job_queue Yes .
If entity It's the first time it's been submitted job On top of it job_queue On , The entity Will be added to gpu scheduler Of run queue On , And wake up gpu scheduler The scheduling thread on .
 
5.dma_fence The role of
 
DMA fence yes linux For different core modules DMA Primitives for synchronous operations , Often used in GPU rendering、displaying buffer And so on .
Use DMA FENCE It can reduce the waiting time in user mode , Let data synchronization take place in the core . for example GPU rendering and displaying buffer Synchronization of ,GPU rendering Be responsible to framebuffer Write render data in ,displaying Be responsible for frambuffer The data is displayed on the screen . So displaying Need to wait GPU rendering When it's done , To read framebuffer Information in ( The reverse is also true ,gpu rendering You need to wait displaying When the display is finished , Can be in framebuffer Draw the next image on the screen ). We can synchronize the two in an application , Immediate wait rendering After that , To call displaying Module . Applications tend to sleep while waiting , You can't do anything else ( Like preparing for the next frame framebuffer Rendering of ). etc. displaying When the display is finished , Call again GPU rendering It will also make gpu Unsaturated , In an idle state .
Use DMA fence After , Will GPU rendering and displaying Put the synchronization in the core , Application call GPU rendering After , Back to a out fence, Don't have to etc. GPU rendering Completion , Call displaying, And put GPU rendering Of out fence Pass to displaying Module as in fence. So in the core displaying The module will wait for display output in fence By signal, And once GPU rendering When it's done , Will be signal The fence. Application participation is no longer required .
Please refer to this article for more detailed explanation :https://www.collabora.com/news-and-blog/blog/2016/09/13/mainline-explicit-fencing-part-1/.
 
stay Gpu scheduler in , One job Make sure it has dependencies before it is scheduled , One job After being scheduled, you need to inform the outside world of your current state , Both are through fence To achieve .
job The dependency of is called in_fence,job Self status report is called out_fence, It's essentially the same data structure .
When there is in fence And not in signaled Status , The job Need to continue to wait ( The way to wait here is not by calling dma_fence_wait()), Until all in fence By signal. Before waiting Gpu scheduler A new callback interface will be registered dma_fence_cb To in fence On ( It includes arousal Gpu scheduler The code of the scheduler ), When in fence By signal When , This callback Will be called , In which to awaken again to the job The schedule of .
One job There are two out fence:scheduled and finished, When scheduled fence By signaled, Show that you should job To be sent to GPU On , When finished fence By signaled, Show that you should job Already in gpu I'm done with it .
 
In the following VC4 Test code on :
Vc4 driver Not used drm Of gpu scheduler As a scheduler .VC4 stay in fence In terms of treatment, it is blocking Type , That is, the application will block here , This doesn't seem to fit fence The purpose of . And now driver It only supports a single in fence, and vulkan Multiple dependencies can be passed in on .
The actual test found that , And the original driver Compared with gpu scheduler There's no significant performance change , But it can solve fence The problem of , Of course, the main purpose is to practice , It mainly refers to v3d driver Code for .
 
1. The first is the call drm_sched_init() establish scheduler.Vc4 There are two command entries on the ,bin and render, So here are two scheduler.
 
 1 static const struct drm_sched_backend_ops vc4_bin_sched_ops = { 2 .dependency = vc4_job_dependency, 3 .run_job = vc4_bin_job_run, 4 .timedout_job = NULL, 5 .free_job = vc4_job_free, 6 }; 7 8 static const struct drm_sched_backend_ops vc4_render_sched_ops = { 9 .dependency = vc4_job_dependency,10 .run_job = vc4_render_job_run,11 .timedout_job = NULL,12 .free_job = vc4_job_free,13 };14 15 int vc4_sched_init(struct vc4_dev *vc4)16 {17 int hw_jobs_limit = 1;18 int job_hang_limit = 0;19 int hang_limit_ms = 500;20 int ret;21 22 ret = drm_sched_init(&vc4->queue[VC4_BIN].sched,23 &vc4_bin_sched_ops,24 hw_jobs_limit,25 job_hang_limit,26 msecs_to_jiffies(hang_limit_ms),27 "vc4_bin");28 if (ret) {29 dev_err(vc4->base.dev, "Failed to create bin scheduler: %d.", ret);30 return ret;31 }32 33 ret = drm_sched_init(&vc4->queue[VC4_RENDER].sched,34 &vc4_render_sched_ops,35 hw_jobs_limit,36 job_hang_limit,37 msecs_to_jiffies(hang_limit_ms),38 "vc4_render");39 if (ret) {40 dev_err(vc4->base.dev, "Failed to create render scheduler: %d.", ret);41 vc4_sched_fini(vc4);42 return ret;43 }44 45 return ret;46 }
 
2. stay drm driver Of open Add... To the callback interface ,entity The initial code for .
 1 static int vc4_open(struct drm_device *dev, struct drm_file *file) 2 { 3 struct vc4_dev *vc4 = to_vc4_dev(dev); 4 struct vc4_file *vc4file; 5 struct drm_gpu_scheduler *sched; 6 int i; 7 8 vc4file = kzalloc(sizeof(*vc4file), GFP_KERNEL); 9 if (!vc4file)10 return -ENOMEM;11 12 vc4_perfmon_open_file(vc4file);13 14 for (i = 0; i < VC4_MAX_QUEUES; i++) {15 sched = &vc4->queue[i].sched;16 drm_sched_entity_init(&vc4file->sched_entity[i],17 DRM_SCHED_PRIORITY_NORMAL,18 &sched, 1,19 NULL);20 }21 22 file->driver_priv = vc4file;23 24 return 0;25 }
 
3. stay driver It's done job After you pack it , You can go to entity Submit on file job 了 .
 1 static void vc4_job_free(struct kref *ref) 2 { 3 struct vc4_job *job = container_of(ref, struct vc4_job, refcount); 4 struct vc4_dev *vc4 = job->dev; 5 struct vc4_exec_info *exec = job->exec; 6 struct vc4_seqno_cb *cb, *cb_temp; 7 struct dma_fence *fence; 8 unsigned long index; 9 unsigned long irqflags; 10 11 xa_for_each(&job->deps, index, fence) { 12 dma_fence_put(fence); 13 } 14 xa_destroy(&job->deps); 15 16 dma_fence_put(job->irq_fence); 17 dma_fence_put(job->done_fence); 18 19 if (exec) 20 vc4_complete_exec(&job->dev->base, exec); 21 22 spin_lock_irqsave(&vc4->job_lock, irqflags); 23 list_for_each_entry_safe(cb, cb_temp, &vc4->seqno_cb_list, work.entry) { 24 if (cb->seqno <= vc4->finished_seqno) { 25 list_del_init(&cb->work.entry); 26 schedule_work(&cb->work); 27 } 28 } 29 30 spin_unlock_irqrestore(&vc4->job_lock, irqflags); 31 32 kfree(job); 33 } 34 35 void vc4_job_put(struct vc4_job *job) 36 { 37 kref_put(&job->refcount, job->free); 38 } 39 40 static int vc4_job_init(struct vc4_dev *vc4, struct drm_file *file_priv, 41 struct vc4_job *job, void (*free)(struct kref *ref), u32 in_sync) 42 { 43 struct dma_fence *in_fence = NULL; 44 int ret; 45 46 xa_init_flags(&job->deps, XA_FLAGS_ALLOC); 47 48 if (in_sync) { 49 ret = drm_syncobj_find_fence(file_priv, in_sync, 0, 0, &in_fence); 50 if (ret == -EINVAL) 51 goto fail; 52 53 ret = drm_gem_fence_array_add(&job->deps, in_fence); 54 if (ret) { 55 dma_fence_put(in_fence); 56 goto fail; 57 } 58 } 59 60 kref_init(&job->refcount); 61 job->free = free; 62 63 return 0; 64 65 fail: 66 xa_destroy(&job->deps); 67 return ret; 68 } 69 70 static int vc4_push_job(struct drm_file *file_priv, struct vc4_job *job, enum vc4_queue queue) 71 { 72 struct vc4_file *vc4file = file_priv->driver_priv; 73 int ret; 74 75 ret = drm_sched_job_init(&job->base, &vc4file->sched_entity[queue], vc4file); 76 if (ret) 77 return ret; 78 79 job->done_fence = dma_fence_get(&job->base.s_fence->finished); 80 81 kref_get(&job->refcount); 82 83 drm_sched_entity_push_job(&job->base, &vc4file->sched_entity[queue]); 84 85 return 0; 86 } 87 88 /* Queues a struct vc4_exec_info for execution. If no job is 89 * currently executing, then submits it. 90 * 91 * Unlike most GPUs, our hardware only handles one command list at a 92 * time. To queue multiple jobs at once, we'd need to edit the 93 * previous command list to have a jump to the new one at the end, and 94 * then bump the end address. That's a change for a later date, 95 * though. 96 */ 97 static int 98 vc4_queue_submit_to_scheduler(struct drm_device *dev, 99 struct drm_file *file_priv,100 struct vc4_exec_info *exec,101 struct ww_acquire_ctx *acquire_ctx)102 {103 struct vc4_dev *vc4 = to_vc4_dev(dev);104 struct drm_vc4_submit_cl *args = exec->args;105 struct vc4_job *bin = NULL;106 struct vc4_job *render = NULL;107 struct drm_syncobj *out_sync;108 uint64_t seqno;109 unsigned long irqflags;110 int ret;111 112 spin_lock_irqsave(&vc4->job_lock, irqflags);113 114 seqno = ++vc4->emit_seqno;115 exec->seqno = seqno;116 117 spin_unlock_irqrestore(&vc4->job_lock, irqflags);118 119 render = kcalloc(1, sizeof(*render), GFP_KERNEL);120 if (!render)121 return -ENOMEM;122 123 render->exec = exec;124 125 ret = vc4_job_init(vc4, file_priv, render, vc4_job_free, args->in_sync);126 if (ret) {127 kfree(render);128 return ret;129 }130 131 if (args->bin_cl_size != 0) {132 bin = kcalloc(1, sizeof(*bin), GFP_KERNEL);133 if (!bin) {134 vc4_job_put(render);135 return -ENOMEM;136 }137 138 bin->exec = exec;139 140 ret = vc4_job_init(vc4, file_priv, bin, vc4_job_free, args->in_sync);141 if (ret) {142 vc4_job_put(render);143 kfree(bin);144 return ret;145 }146 }147 148 mutex_lock(&vc4->sched_lock);149 150 if (bin) {151 ret = vc4_push_job(file_priv, bin, VC4_BIN);152 if (ret)153 goto FAIL;154 155 ret = drm_gem_fence_array_add(&render->deps, dma_fence_get(bin->done_fence));156 if (ret)157 goto FAIL;158 }159 160 vc4_push_job(file_priv, render, VC4_RENDER);161 162 mutex_unlock(&vc4->sched_lock);163 164 if (args->out_sync) {165 out_sync = drm_syncobj_find(file_priv, args->out_sync);166 if (!out_sync) {167 ret = -EINVAL;168 goto FAIL;;169 }170 171 drm_syncobj_replace_fence(out_sync, &bin->base.s_fence->scheduled);172 exec->fence = render->done_fence;173 174 drm_syncobj_put(out_sync);175 }176 177 vc4_update_bo_seqnos(exec, seqno);178 179 vc4_unlock_bo_reservations(dev, exec, acquire_ctx);180 181 if (bin)182 vc4_job_put(bin);183 vc4_job_put(render);184 185 return 0;186 187 FAIL:188 return ret;189 }

 

References :
 
https://dri.freedesktop.org/docs/drm/gpu/drm-mm.html#gpu-scheduler
https://rosenzweig.io/blog/from-bifrost-to-panfrost.html
https://www.collabora.com/news-and-blog/blog/2017/01/26/mainline-explicit-fencing-part-3/
&nbs
版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://javamana.com/2021/01/20210121014355157l.html

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云