update template

This commit is contained in:
jackfiled 2024-11-24 20:27:02 +08:00
parent 5e5275b73b
commit a5a2b78429
6 changed files with 371 additions and 22 deletions

View File

@ -1,4 +1,4 @@
\usepackage{ctex} % invole CJKfntef
\usepackage[fontset=windows]{ctex} % invole CJKfntef
\usepackage{xeCJKfntef}
\usepackage{setspace} % spacing
\usepackage{xcolor} % color
@ -104,21 +104,23 @@ pdfborder=001, linkcolor=black, citecolor=black, urlcolor=black]{hyperref} %
\usepackage{titletoc}
\newif{\ifpagenumber}
\pagenumbertrue
\renewcommand\contentsname{\centerline{\toctitlefont{目\qquad{}录}}}
\titlecontents{chapter}[0em]{\tocchapterfont\vspace{0.4mm}} {%
\renewcommand\contentsname{\centerline{\heiti\zihao{3}\textbf{{目\qquad{}录}}}}
\titlecontents{chapter}[0em]{\heiti\zihao{-4}\vspace{0.4mm}} {%
\ifpagenumber \CTEXnumber{\CJKsection}{\thecontentslabel}{第\CJKsection{章}\quad{}} \fi}{%
} {%
\ifpagenumber \hspace{.5em}\titlerule*[6pt]{$\cdot$}\contentspage\fi}
\titlecontents{section}[1em]{\tocsectionfont\vspace{0.4mm}}{%
\titlecontents{section}[1em]{\songti\zihao{-4}\vspace{0.4mm}}{%
\thecontentslabel\quad{}}{}{%
\ifpagenumber \hspace{.5em}\titlerule*[6pt]{$\cdot$}\contentspage\fi}%
\titlecontents{subsection}[2em]{\tocsubsectionfont\vspace{0.4mm}}{%
\titlecontents{subsection}[2em]{\songti\zihao{-4}\vspace{0.4mm}}{%
\thecontentslabel\quad{}}{}{%
\ifpagenumber \hspace{.5em}\titlerule*[6pt]{$\cdot$}\contentspage\fi}%
\titlecontents{subsubsection}[3em]{\tocsubsubsectionfont\vspace{0.4mm}}{%
\titlecontents{subsubsection}[3em]{\songti\zihao{-4}\vspace{0.4mm}}{%
\thecontentslabel\quad{}}{}{%
\ifpagenumber \hspace{.5em}\titlerule*[6pt]{$\cdot$}\contentspage\fi}%
\makeatletter % Content Page style
% Content Page style
\makeatletter
\renewcommand\frontmatter{%
\if@openright\cleardoublepage\else\clearpage\fi%
\@mainmatterfalse%
@ -127,7 +129,7 @@ pdfborder=001, linkcolor=black, citecolor=black, urlcolor=black]{hyperref} %
}
\makeatother%
\makeatletter % Content Page style
\makeatletter
\renewcommand\mainmatter{%
\if@openright\cleardoublepage\else\clearpage\fi%
\@mainmattertrue%
@ -146,12 +148,12 @@ pdfborder=001, linkcolor=black, citecolor=black, urlcolor=black]{hyperref} %
% Text style
\usepackage{titlesec}
\titleformat{\chapter}[hang]{\heiti\zihao{3}\centering\textbf}{第\chinese{chapter}章}{1em}{}
\titleformat{\chapter}[hang]{\heiti\zihao{3}\centering\bfseries}{第\chinese{chapter}章}{1em}{}
% 缩短章节标题的上边距
\titlespacing{\chapter}{0pt}{-20pt}{12pt}
\titleformat{\section}{\heiti\zihao{4}\textbf}{\thesection}{1em}{}
\titleformat{\subsection}{\heiti\zihao{-4}\textbf}{\qquad{}\thesubsection}{1em}{}
\titleformat{\subsubsection}{\heiti\zihao{-4}\textbf}{\qquad{}\thesubsubsection}{1em}{}
\titleformat{\section}{\heiti\zihao{4}\bfseries}{\thesection}{1em}{}
\titleformat{\subsection}{\heiti\zihao{-4}\bfseries}{\qquad{}\thesubsection}{1em}{}
\titleformat{\subsubsection}{\heiti\zihao{-4}\bfseries}{\qquad{}\thesubsubsection}{1em}{}
% Reference style
% \usepackage[numbers,sort&compress]{natbib}
@ -164,9 +166,12 @@ pdfborder=001, linkcolor=black, citecolor=black, urlcolor=black]{hyperref} %
% Appendix
\usepackage{appendix}
%% Some tweaking/features/styles
% Figure & Table
\usepackage{graphicx}
\usepackage{array,booktabs,multirow} % multirow, multicolumn and more professional format support
\usepackage{tabularx} % deal with text wrapping in tables
\usepackage{longtable}
\usepackage{caption}
\usepackage[position=t,singlelinecheck=off]{subfig}

View File

@ -24,15 +24,18 @@
\normalsize
这里撰写摘要大概500字左右。
本科毕业设计论文是本科专业培养方案的重要组成部分是人才培养的关键环节是实现教学、科研与社会实践相结合的重要结合点。为全面贯彻党的教育方针围绕立德树人根本任务构建面向产出的本科毕业设计论文管理体系提高人才培养质量进一步加强我校毕业设计论文教学管理依据《北京邮电大学本科毕业设计论文管理办法》校发202314 号《北京邮电大学本科毕业设计论文质量管理实施细则试行党政办发202314 号)等相关文件要求,特制定 2025 届本科毕业设计(论文)指导手册。
现参照2025 届本科毕业设计(论文)指导手册制作毕业设计论文模板。
\quad{}
\par\noindent\heiti\zihao{-4}\textbf{关键词}\quad{}
{
\songti\zihao{-4}
关键词一\quad
关键词二\quad
毕业设计\quad
论文模板\quad
Latex\quad
}
\end{titlepage}
@ -61,15 +64,18 @@
\normalsize
This is the english abstract, and the content should be the same as the chinese abstract.
The final project (dissertation) is an important part of the undergraduate professional training program and a key link in talent cultivation. It is an important convergence point for integrating teaching, scientific research, and social practice. To comprehensively implement the Party's educational principles and fulfill the fundamental task of cultivating people with integrity, Beijing University of Posts and Telecommunications (BUPT) has established a graduation project (dissertation) management system oriented towards output to enhance the quality of talent cultivation. In order to further strengthen the management of graduation project (dissertation) teaching at BUPT, this manual is formulated based on the relevant documents, including the \textit{Regulations on the Management of Undergraduate Final Projects (Dissertations) at Beijing University of Posts and Telecommunications} (No. 14, School of BUPT, 2023) and the \textit{Implementation Rules for Quality Management of Undergraduate Final Projects (Dissertations) at Beijing University of Posts and Telecommunications (for Trial Implementation)} (No. 14, Office of the Party Committee and the Office of the President, BUPT, 2023), etc.
Now, we are referring to the 2025 Graduation Thesis (Paper) Guidance Manual to create a template for graduation thesis papers.
\quad{}
\par\noindent\zihao{-4}\textbf{KEY WORDS}\quad{}%
{
\zihao{-4}
Keyword1\quad
Keyword2\quad
Graduate Design\quad
Thesis Template\quad
Latex\quad
}
\end{titlepage}

View File

@ -2,6 +2,77 @@
\begin{document}
\chapter{设计模板的背景和意义}
\normalsize
\chapter{引言}
引言部分是毕业设计论文的开篇,旨在为读者提供研究背景、研究的必要性和目的、研究问题以及论文的结构安排。以下是引言部分的详细撰写。
\section{研究背景}
\begin{itemize}
\item 阐述本科毕业设计论文在学术教育中的作用。
\item 分析当前本科毕业设计论文模板存在的问题和不足。
\end{itemize}
随着高等教育的普及和学术研究的深入,本科毕业设计论文已成为衡量学生综合运用所学知识解决实际问题能力的重要标准。一个结构合理、格式规范的论文模板对于指导学生如何撰写高质量的学术论文具有重要意义。然而,当前的毕业设计论文模板存在诸多不足,如格式不统一、指导性不强、缺乏灵活性等问题,这些问题影响了论文撰写的效率和质量。因此,设计一种高效、规范且易于操作的本科毕业设计论文模板显得尤为迫切。\cite{cai_coala_2024}
\section{研究意义}
本研究旨在设计并实现一种本科毕业设计论文模板,以提高学生的论文撰写能力,规范学术写作标准,并为教师提供有效的教学辅助工具。通过本研究,我们期望能够:
\begin{itemize}
\item 提升学生对学术规范的认识和遵循;
\item 增强论文的可读性和专业性;
\item 为教师提供便捷的论文评阅和指导途径;
\item 促进学术交流和知识传播。
\end{itemize}
\section{研究目标和问题}
本研究的主要目标是设计并实现一种本科毕业设计论文模板,该模板应满足以下要求:
\begin{itemize}
\item 规范性:模板应遵循学术写作的通用规范和标准。
\item 易用性:模板应易于学生理解和使用,减少学习成本。
\item 灵活性:模板应能够适应不同学科和研究类型的需要。
\item 指导性:模板应提供明确的写作指导和格式要求。
\end{itemize}
研究问题包括:
\begin{itemize}
\item 如何设计一个符合学术规范的毕业设计论文模板?
\item 如何确保模板的易用性和灵活性?
\item 如何在模板中嵌入有效的写作指导?
\end{itemize}
\section{论文结构}
本文将按照以下结构进行第2章文献综述分析国内外毕业设计论文模板的研究现状和发展趋势。
第3章介绍毕业设计论文模板设计的理论基础和设计原则。
第4章详细描述模板的设计方案和实现过程。
第5章通过案例研究展示模板的应用效果。
第6章讨论研究的局限性、贡献和未来研究方向。
第7章总结研究的主要发现和实践意义。
\section{文献综述}
\subsection{国内外研究现状}
在国内外学术界,毕业设计论文模板的研究主要集中在以下几个方面:
模板设计原则:研究者们探讨了设计毕业设计论文模板时应遵循的原则,如一致性、可读性、可访问性和国际化等。\cite{dubach_compiling_nodate}
模板功能与要求:文献中提出了毕业设计论文模板应具备的基本功能,包括格式规范、结构指导、参考文献管理等。\cite{auerbach_lime_nodate}
模板的实现技术随着信息技术的发展研究者们开始探索使用LaTeX、Word等工具实现模板的自动化和智能化。\cite{besard_effective_2019}
模板的用户体验:用户体验在模板设计中的重要性日益凸显,研究者们分析了如何通过模板设计提升用户的写作体验。\cite{faingnaert_flexible_2022}
\subsection{毕业设计论文模板的功能与要求}
毕业设计论文模板应满足以下功能和要求:\cite{tiotto_experiences_2024}
格式规范:模板应符合学术出版的标准格式,包括页边距、字体大小、行间距等。\cite{perez_user-driven_2023}
结构指导:模板应提供清晰的论文结构指导,帮助学生理解论文的组织方式。\cite{hutchison_accull_2012}
参考文献管理模板应支持主流的参考文献格式如APA、MLA等以便于学生管理和引用文献。\cite{malawski_sycl-bench_2020}
图表和附录:模板应提供图表、附录等附加内容的插入和管理指南。\cite{y_y_2014}
\end{document}

13
chapters/chapter2.tex Normal file
View File

@ -0,0 +1,13 @@
\documentclass[../main.tex]{subfiles}
\begin{document}
\chapter{文献综述}
\section{国内外研究现状}
\section{毕业设计论文模板的功能与要求}
\section{研究差距与挑战}
\end{document}

View File

@ -33,6 +33,8 @@
\subfile{chapters/chapter1}
\subfile{chapters/chapter2}
% 参考文献
\clearpage
\phantomsection\addcontentsline{toc}{chapter}{参考文献}
@ -44,7 +46,21 @@
\chapter*{\qquad{}}
\phantomsection\addcontentsline{toc}{chapter}{\qquad{}}
Thanks to everyone who helps me when finishing my bachelor thesis.
\normalsize
在本研究工作即将完成之际,我首先要表达我最深切的感激之情。感谢所有在这段学术旅程中给予我支持、指导和帮助的人。
首先,我要感谢我的导师,[导师姓名]教授。在论文的选题、研究设计、实验过程以及论文撰写的各个阶段,[导师姓名]教授都给予了我悉心的指导和无私的帮助。[导师姓名]教授严谨的学术态度、深邃的学术见解和对科研工作的热爱深深地影响了我,为我今后的学术道路奠定了坚实的基础。在此,我向[导师姓名]教授表示最诚挚的敬意和感谢。
我还要感谢[学院/系]的所有老师和同学。在学习和研究过程中,他们提供了宝贵的建议和帮助。特别是[同学/同事姓名],在实验设计和数据分析方面给予了我极大的帮助,使我的研究工作得以顺利进行。我们共同讨论问题、分享研究成果,这些经历将成为我宝贵的记忆。
感谢[实验室/研究组]的全体成员,他们在日常实验和研究中给予了我许多帮助和支持。我们共同度过了许多难忘的日夜,一起面对挑战,分享成功的喜悦。这段经历不仅让我在学术上有所收获,也让我学会了团队合作和相互支持的重要性。
我还要感谢我的家人,他们一直是我最坚强的后盾。在我遇到困难和挫折时,他们总是给予我鼓励和支持,让我有勇气继续前行。没有他们的理解和支持,我不可能完成这项研究工作。在此,我要向他们表达我最深切的爱意和感激。
最后,我要感谢所有参与和支持我研究工作的人员和机构。感谢[资助机构]为我的研究提供资金支持,感谢[合作单位]提供的实验资源和帮助。每一位给予我帮助和支持的人,我都铭记在心,感激不尽。
在未来的学术道路上,我将带着这份感激之情,继续努力,不断探索,以期取得更多的研究成果,回报所有关心和支持我的人。
% 附录
\setcounter{figure}{0}

238
ref.bib
View File

@ -0,0 +1,238 @@
@inproceedings{auerbach_compiler_2012,
address = {San Francisco California},
title = {A compiler and runtime for heterogeneous computing},
isbn = {978-1-4503-1199-1},
url = {https://dl.acm.org/doi/10.1145/2228360.2228411},
doi = {10.1145/2228360.2228411},
abstract = {Heterogeneous systems show a lot of promise for extracting highperformance by combining the benefits of conventional architectures with specialized accelerators in the form of graphics processors (GPUs) and reconfigurable hardware (FPGAs). Extracting this performance often entails programming in disparate languages and models, making it hard for a programmer to work equally well on all aspects of an application. Further, relatively little attention is paid to co-execution—the problem of orchestrating program execution using multiple distinct computational elements that work seamlessly together.},
language = {en},
urldate = {2024-07-16},
booktitle = {Proceedings of the 49th {Annual} {Design} {Automation} {Conference}},
publisher = {ACM},
author = {Auerbach, Joshua and Bacon, David F. and Burcea, Ioana and Cheng, Perry and Fink, Stephen J. and Rabbah, Rodric and Shukla, Sunil},
month = jun,
year = {2012},
pages = {271--276},
file = {Auerbach et al. - 2012 - A compiler and runtime for heterogeneous computing.pdf:/home/ricardo/Zotero/storage/LCRKBKYC/Auerbach et al. - 2012 - A compiler and runtime for heterogeneous computing.pdf:application/pdf},
}
@article{y_y_2014,
title = {异构并行编程模型研究与进展},
volume = {25},
issn = {1000-9825},
url = {https://kns.cnki.net/kcms2/article/abstract?v=Dm4VI7mKrXMfvAZUNMUgX8reCA9i2gYJadV_oeNwrIXov3W3N3cznGwXoHcCBEa4U5IUycTU9RRAyeLGki8bNkCldPuZc4yQ0E68KW7fvo9-mj97g39uJA==&uniplatform=NZKPT&language=gb},
doi = {10.13328/j.cnki.jos.004608},
abstract = {近年来,异构系统硬件飞速发展.为了解决相应的编程和执行效率问题,异构并行编程模型已被广泛使用和研究.从异构并行编程接口与编译/运行时支持系统两个角度总结了异构并行编程模型最新的研究成果,它们为异构架构和上层应用带来的技术挑战提供了相应的解决方案.最后,结合目前的研究现状以及异构系统的发展,提出了异构并行编程模型的未来方向.},
language = {中文;},
number = {7},
journal = {软件学报},
author = {刘, 颖 and 吕, 方 and 王, 蕾 and 陈, 莉 and 崔, 慧敏 and 冯, 晓兵},
year = {2014},
keywords = {GPU, 异构并行编程模型, 异构系统, 编程接口, 编译, 运行时系统},
pages = {1459--1475},
file = {异构并行编程模型研究与进展_刘颖:/home/ricardo/Zotero/storage/GJDAISVR/异构并行编程模型研究与进展_刘颖.pdf:application/pdf},
}
@article{cai_coala_2024,
title = {{COALA}: {A} {Compiler}-{Assisted} {Adaptive} {Library} {Routines} {Allocation} {Framework} for {Heterogeneous} {Systems}},
volume = {73},
copyright = {https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html},
issn = {0018-9340, 1557-9956, 2326-3814},
shorttitle = {{COALA}},
url = {https://ieeexplore.ieee.org/document/10495065/},
doi = {10.1109/TC.2024.3385269},
language = {en},
number = {7},
urldate = {2024-10-14},
journal = {IEEE Transactions on Computers},
author = {Cai, Qinyun and Tan, Guanghua and Yang, Wangdong and He, Xianhao and Yan, Yuwei and Li, Keqin and Li, Kenli},
month = jul,
year = {2024},
pages = {1724--1737},
file = {PDF:/home/ricardo/Zotero/storage/LVBVKYIS/Cai et al. - 2024 - COALA A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous Systems.pdf:application/pdf},
}
@article{dubach_compiling_nodate,
title = {Compiling a high-level language for {GPUs}: (via language support for architectures and compilers)},
abstract = {Languages such as OpenCL and CUDA offer a standard interface for general-purpose programming of GPUs. However, with these languages, programmers must explicitly manage numerous lowlevel details involving communication and synchronization. This burden makes programming GPUs difficult and error-prone, rendering these powerful devices inaccessible to most programmers.},
language = {en},
author = {Dubach, Christophe and Cheng, Perry and Rabbah, Rodric and Bacon, David F and Fink, Stephen J},
file = {PDF:/home/ricardo/Zotero/storage/NXNGV5KB/Dubach et al. - Compiling a high-level language for GPUs (via language support for architectures and compilers).pdf:application/pdf},
}
@article{auerbach_lime_nodate,
title = {Lime: a {Java}-compatible and synthesizable language for heterogeneous architectures},
abstract = {The halt in clock frequency scaling has forced architects and language designers to look elsewhere for continued improvements in performance. We believe that extracting maximum performance will require compilation to highly heterogeneous architectures that include reconfigurable hardware.},
language = {en},
author = {Auerbach, Joshua and Bacon, David F and Cheng, Perry and Rabbah, Rodric},
file = {PDF:/home/ricardo/Zotero/storage/F7TKF8C2/Auerbach et al. - Lime a Java-compatible and synthesizable language for heterogeneous architectures.pdf:application/pdf},
}
@article{besard_effective_2019,
title = {Effective {Extensible} {Programming}: {Unleashing} {Julia} on {GPUs}},
volume = {30},
copyright = {https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html},
issn = {1045-9219, 1558-2183, 2161-9883},
shorttitle = {Effective {Extensible} {Programming}},
url = {https://ieeexplore.ieee.org/document/8471188/},
doi = {10.1109/TPDS.2018.2872064},
abstract = {GPUs and other accelerators are popular devices for accelerating compute-intensive, parallelizable applications. However, programming these devices is a difficult task. Writing efficient device code is challenging, and is typically done in a low-level programming language. High-level languages are rarely supported, or do not integrate with the rest of the high-level language ecosystem. To overcome this, we propose compiler infrastructure to efficiently add support for new hardware or environments to an existing programming language. We evaluate our approach by adding support for NVIDIA GPUs to the Julia programming language. By integrating with the existing compiler, we significantly lower the cost to implement and maintain the new compiler, and facilitate reuse of existing application code. Moreover, use of the high-level Julia programming language enables new and dynamic approaches for GPU programming. This greatly improves programmer productivity, while maintaining application performance similar to that of the official NVIDIA CUDA toolkit.},
language = {en},
number = {4},
urldate = {2024-10-20},
journal = {IEEE Transactions on Parallel and Distributed Systems},
author = {Besard, Tim and Foket, Christophe and De Sutter, Bjorn},
month = apr,
year = {2019},
pages = {827--841},
file = {PDF:/home/ricardo/Zotero/storage/7VH3HSRD/Besard et al. - 2019 - Effective Extensible Programming Unleashing Julia on GPUs.pdf:application/pdf},
}
@article{faingnaert_flexible_2022,
title = {Flexible {Performant} {GEMM} {Kernels} on {GPUs}},
volume = {33},
copyright = {https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html},
issn = {1045-9219, 1558-2183, 2161-9883},
url = {https://ieeexplore.ieee.org/document/9655458/},
doi = {10.1109/TPDS.2021.3136457},
abstract = {General Matrix Multiplication or GEMM kernels take centre place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIAs Tensor Cores. Their exploitation is hampered by the twolanguage problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of components. Because rephrasing algorithms in terms of established components often introduces overhead, the libraries lack of flexibility limits the freedom to explore new algorithms. Researchers using GEMMs can hence not enjoy programming productivity, high performance, and research flexibility at once. In this paper we solve this problem. We present three sets of abstractions and interfaces to program GEMMs within the scientific Julia programming language. The interfaces and abstractions are co-designed for researchers needs and Julias features to achieve sufficient separation of concerns and flexibility to easily extend basic GEMMs in many different ways without paying a performance price. Comparing our GEMMs to state-of-the-art libraries cuBLAS and CUTLASS, we demonstrate that our performance is in the same ballpark of the libraries, and in some cases even exceeds it, without having to write a single line of code in CUDA C++ or assembly, and without facing flexibility limitations.},
language = {en},
number = {9},
urldate = {2024-10-20},
journal = {IEEE Transactions on Parallel and Distributed Systems},
author = {Faingnaert, Thomas and Besard, Tim and De Sutter, Bjorn},
month = sep,
year = {2022},
pages = {2230--2248},
file = {PDF:/home/ricardo/Zotero/storage/JUHZTABS/Faingnaert et al. - 2022 - Flexible Performant GEMM Kernels on GPUs.pdf:application/pdf},
}
@inproceedings{tiotto_experiences_2024,
address = {Edinburgh, United Kingdom},
title = {Experiences {Building} an {MLIR}-{Based} {SYCL} {Compiler}},
copyright = {https://doi.org/10.15223/policy-029},
isbn = {9798350395099},
url = {https://ieeexplore.ieee.org/document/10444866/},
doi = {10.1109/CGO57630.2024.10444866},
abstract = {Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges.},
language = {en},
urldate = {2024-10-29},
booktitle = {2024 {IEEE}/{ACM} {International} {Symposium} on {Code} {Generation} and {Optimization} ({CGO})},
publisher = {IEEE},
author = {Tiotto, Ettore and Pérez, Víctor and Tsang, Whitney and Sommer, Lukas and Oppermann, Julian and Lomüller, Victor and Goli, Mehdi and Brodman, James},
month = mar,
year = {2024},
pages = {399--410},
file = {PDF:/home/ricardo/Zotero/storage/LJBFS32J/Tiotto et al. - 2024 - Experiences Building an MLIR-Based SYCL Compiler.pdf:application/pdf},
}
@article{perez_user-driven_2023,
title = {User-driven {Online} {Kernel} {Fusion} for {SYCL}},
volume = {20},
issn = {1544-3566, 1544-3973},
url = {https://dl.acm.org/doi/10.1145/3571284},
doi = {10.1145/3571284},
abstract = {Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing specific tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization, and kernel launch. While in applications with one or two short-running kernels the overhead can be negligible, it can be noticeable when these short-running kernels dominate the overall number of kernels in an application, as it is the case in graph-based neural network models, where there are several small memory-bound nodes alongside few large compute-bound nodes.
To reduce the overhead, combining several kernels into a single, more optimized kernel is an active area of research. However, this task can be time-consuming and error-prone given the huge set of potential combinations. This can push programmers to seek a tradeoff between (a) task-specific kernels with low overhead but hard to maintain and (b) smaller modular kernels with higher overhead but easier to maintain. While there are DSL-based approaches, such as those provided for machine learning frameworks, which offer the possibility of such a fusion, they are limited to a particular domain and exploit specific knowledge of that domain and, as a consequence, are hard to port elsewhere. This study explores the feasibility of a user-driven
kernel fusion
through an extension to the SYCL API to address the automation of kernel fusion. The proposed solution requires programmers to define the subgraph regions that are potentially suitable for fusion without any modification to the kernel code or the function signature. We evaluate the performance benefit of our approach on common neural networks and study the performance improvement in detail.},
language = {en},
number = {2},
urldate = {2024-10-29},
journal = {ACM Transactions on Architecture and Code Optimization},
author = {Pérez, Víctor and Sommer, Lukas and Lomüller, Victor and Narasimhan, Kumudha and Goli, Mehdi},
month = jun,
year = {2023},
pages = {1--25},
file = {PDF:/home/ricardo/Zotero/storage/MRYW3TTN/Pérez et al. - 2023 - User-driven Online Kernel Fusion for SYCL.pdf:application/pdf},
}
@incollection{hutchison_accull_2012,
address = {Berlin, Heidelberg},
title = {{accULL}: {An} {OpenACC} {Implementation} with {CUDA} and {OpenCL} {Support}},
volume = {7484},
isbn = {978-3-642-32819-0 978-3-642-32820-6},
shorttitle = {{accULL}},
url = {http://link.springer.com/10.1007/978-3-642-32820-6_86},
abstract = {The irruption in the HPC scene of hardware accelerators, like GPUs, has made available unprecedented performance to developers. However, even expert developers may not be ready to exploit the new complex processor hierarchies. We need to find a way to leverage the programming effort in these devices at programming language level, otherwise, developers will spend most of their time focusing on device-specific code instead of implementing algorithmic enhancements. The recent advent of the OpenACC standard for heterogeneous computing represents an effort in this direction. This initiative, combined with future releases of the OpenMP standard, will converge into a fully heterogeneous framework that will cope the programming requirements of future computer architectures. In this work we present accULL, a novel implementation of the OpenACC standard, based on the combination of a source to source compiler and a runtime library. To our knowledge, our approach is the first providing support for both OpenCL and CUDA platforms under this new standard.},
language = {en},
urldate = {2024-11-06},
booktitle = {Euro-{Par} 2012 {Parallel} {Processing}},
publisher = {Springer Berlin Heidelberg},
author = {Reyes, Ruymán and López-Rodríguez, Iván and Fumero, Juan J. and De Sande, Francisco},
editor = {Hutchison, David and Kanade, Takeo and Kittler, Josef and Kleinberg, Jon M. and Mattern, Friedemann and Mitchell, John C. and Naor, Moni and Nierstrasz, Oscar and Pandu Rangan, C. and Steffen, Bernhard and Sudan, Madhu and Terzopoulos, Demetri and Tygar, Doug and Vardi, Moshe Y. and Weikum, Gerhard and Kaklamanis, Christos and Papatheodorou, Theodore and Spirakis, Paul G.},
year = {2012},
doi = {10.1007/978-3-642-32820-6_86},
note = {Series Title: Lecture Notes in Computer Science},
pages = {871--882},
file = {PDF:/home/ricardo/Zotero/storage/I3TR6EWF/Reyes et al. - 2012 - accULL An OpenACC Implementation with CUDA and OpenCL Support.pdf:application/pdf},
}
@incollection{malawski_sycl-bench_2020,
address = {Cham},
title = {{SYCL}-{Bench}: {A} {Versatile} {Cross}-{Platform} {Benchmark} {Suite} for {Heterogeneous} {Computing}},
volume = {12247},
isbn = {978-3-030-57674-5 978-3-030-57675-2},
shorttitle = {{SYCL}-{Bench}},
url = {https://link.springer.com/10.1007/978-3-030-57675-2_39},
abstract = {The SYCL standard promises to enable high productivity in heterogeneous programming of a broad range of parallel devices, including multicore CPUs, GPUs, and FPGAs. Its modern and expressive C++ API design, as well as flexible task graph execution model give rise to ample optimization opportunities at run-time, such as the overlapping of data transfers and kernel execution. However, it is not clear which of the existing SYCL implementations perform such scheduling optimizations, and to what extent. Furthermore, SYCLs high level of abstraction may raise concerns about sacrificing performance for ease of use. Benchmarks are required to accurately assess the performance behavior of high-level programming models such as SYCL. To this end, we present SYCLBench, a versatile benchmark suite for device characterization and runtime benchmarking, written in SYCL. We experimentally demonstrate the effectiveness of SYCL-Bench by performing device characterization of the NVIDIA TITAN X GPU, and by evaluating the efficiency of the hipSYCL and ComputeCpp SYCL implementations.},
language = {en},
urldate = {2024-11-11},
booktitle = {Euro-{Par} 2020: {Parallel} {Processing}},
publisher = {Springer International Publishing},
author = {Lal, Sohan and Alpay, Aksel and Salzmann, Philip and Cosenza, Biagio and Hirsch, Alexander and Stawinoga, Nicolai and Thoman, Peter and Fahringer, Thomas and Heuveline, Vincent},
editor = {Malawski, Maciej and Rzadca, Krzysztof},
year = {2020},
doi = {10.1007/978-3-030-57675-2_39},
note = {Series Title: Lecture Notes in Computer Science},
pages = {629--644},
file = {PDF:/home/ricardo/Zotero/storage/7YQEHBJJ/Lal et al. - 2020 - SYCL-Bench A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing.pdf:application/pdf},
}
@inproceedings{dagli_shared_2024,
address = {Edinburgh United Kingdom},
title = {Shared {Memory}-contention-aware {Concurrent} {DNN} {Execution} for {Diversely} {Heterogeneous} {System}-on-{Chips}},
isbn = {9798400704352},
url = {https://dl.acm.org/doi/10.1145/3627535.3638502},
doi = {10.1145/3627535.3638502},
abstract = {Two distinguishing features of state-of-the-art mobile and autonomous systems are: 1) There are often multiple workloads, mainly deep neural network (DNN) inference, running concurrently and continuously. 2) They operate on shared memory System-on-Chips (SoC) that embed heterogeneous accelerators tailored for specific operations. State-of-the-art systems lack efficient performance and resource management techniques necessary to either maximize total system throughput or minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a novel scheme that characterizes and maps layers in concurrently executing DNN inference workloads to a diverse set of accelerators within an SoC. Our scheme uniquely takes per-layer execution characteristics, shared memory (SM) contention, and inter-accelerator transitions into account to find optimal schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN can minimize memory contention by up to 45\% and improve total latency and throughput by up to 32\% and 29\%, respectively, compared to the state-of-the-art.},
language = {en},
urldate = {2024-11-18},
booktitle = {Proceedings of the 29th {ACM} {SIGPLAN} {Annual} {Symposium} on {Principles} and {Practice} of {Parallel} {Programming}},
publisher = {ACM},
author = {Dagli, Ismet and Belviranli, Mehmet E.},
month = mar,
year = {2024},
pages = {243--256},
file = {PDF:/home/ricardo/Zotero/storage/4UPZN9QQ/Dagli and Belviranli - 2024 - Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips.pdf:application/pdf},
}
@article{zhou_deeptm_2024,
title = {{DeepTM}: {Efficient} {Tensor} {Management} in {Heterogeneous} {Memory} for {DNN} {Training}},
volume = {35},
copyright = {https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html},
issn = {1045-9219, 1558-2183, 2161-9883},
shorttitle = {{DeepTM}},
url = {https://ieeexplore.ieee.org/document/10606082/},
doi = {10.1109/TPDS.2024.3431910},
abstract = {Deep Neural Networks (DNNs) have gained widespread adoption in diverse fields, including image classification, object detection, and natural language processing. However, training large-scale DNN models often encounters significant memory bottlenecks, which ask for efficient management of extensive tensors. Heterogeneous memory system, which combines persistent memory (PM) modules with traditional DRAM, offers an economically viable solution to address tensor management challenges during DNN training. However, existing memory management methods on heterogeneous memory systems often lead to low PM access efficiency, low bandwidth utilization, and incomplete analysis of model characteristics. To overcome these hurdles, we introduce an efficient tensor management approach, DeepTM, tailored for heterogeneous memory to alleviate memory bottlenecks during DNN training. DeepTM employs page-level tensor aggregation to enhance PM read and write performance and executes contiguous page migration to increase memory bandwidth. Through an analysis of tensor access patterns and model characteristics, we quantify the overall performance and transform the performance optimization problem into the framework of Integer Linear Programming. Additionally, we achieve tensor heat recognition by dynamically adjusting the weights of four key tensor characteristics and develop a global optimization strategy using Deep Reinforcement Learning. To validate the efficacy of our approach, we implement and evaluate DeepTM, utilizing the TensorFlow framework running on a PMbased heterogeneous memory system. The experimental results demonstrate that DeepTM achieves performance improvements of up to 36\% and 49\% compared to the current state-of-the-art memory management strategies AutoTM and Sentinel, respectively.},
language = {en},
number = {11},
urldate = {2024-11-18},
journal = {IEEE Transactions on Parallel and Distributed Systems},
author = {Zhou, Haoran and Rang, Wei and Chen, Hongyang and Zhou, Xiaobo and Cheng, Dazhao},
month = nov,
year = {2024},
pages = {1920--1935},
file = {PDF:/home/ricardo/Zotero/storage/QFNGXW66/Zhou et al. - 2024 - DeepTM Efficient Tensor Management in Heterogeneous Memory for DNN Training.pdf:application/pdf},
}
@article{yao_memory-constraint-aware_nodate,
title = {A {Memory}-{Constraint}-{Aware} {List} {Scheduling} {Algorithm} for {Memory}-{Constraint} {Heterogeneous} {Muti}-{Processor} {System}},
abstract = {An effective scheduling algorithm is vital for the execution efficiency of applications on Heterogeneous Muti-Processor System (HMPS), especially Memory-Constraint Heterogeneous Muti-Processor System (MCHMPS). Stringent local and external memory constraints have significant impact on the execution performance of applications executed on MCHMPS, predictability is also a critical factor for task scheduling on MCHMPS. Therefore, a novel list scheduling algorithm termed Memory-constraint-aware Improved Predict Priority and Optimistic Processor Selection Scheduling (MIPPOSS), essentially a heuristic search optimization algorithm, is proposed in this paper. In MIPPOSS, a predictive approach is applied for task prioritization and processor selection, and a novel memory-constraint-aware approach is employed in the processor selection phase. MIPPOSS has polynomial complexity and produces better results for application scheduling on target architecture. Randomly generated DAGs and 3 real-world applications experiments, including Cybershake, LIGO, and Montage, show that MIPPOSS outperforms the other five competing algorithms by a large margin.},
language = {en},
author = {Yao, Yu and Song, Yukun and Huang, Ying and Ni, Wei and Zhang, Duoli},
file = {PDF:/home/ricardo/Zotero/storage/7UL7SXWV/Yao et al. - A Memory-Constraint-Aware List Scheduling Algorithm for Memory-Constraint Heterogeneous Muti-Process.pdf:application/pdf},
}