My proposal for TableGen Formatter got selected for the 2023 LLVM Developers’ Meeting! I will be giving a short technical talk on this project alongside my teammate, Nikhil. Shout-out to Min-Yih Hsu (now Dr. Min-Yih Hsu) for his guidance!
About
TableGen is a domain-specific language and code generator used in LLVM compiler infrastructure. It allows developers to define target-specific instruction sets, register classes, and machine descriptions in a concise and declarative manner.
TableGen’s main purpose is to simplify the process of generating code for instruction selection, register allocation, and other compiler optimizations. By utilizing TableGen, LLVM can efficiently target a wide range of hardware architectures and provide a flexible and modular framework for compiler development.
Here is an example of TableGen code -
// MyTargetInstrInfo.td
let InstructionSet = "MyTarget";
def ADD : MyInstruction<"add", OpRegClass, (outs RRegs:$dst), (ins RRegs:$src1, RRegs:$src2),
"add $dst, $src1, $src2">;
def SUB : MyInstruction<"sub", OpRegClass, (outs RRegs:$dst), (ins RRegs:$src1, RRegs:$src2),
"sub $dst, $src1, $src2">;
def MUL : MyInstruction<"mul", OpRegClass, (outs RRegs:$dst), (ins RRegs:$src1, RRegs:$src2),
"mul $dst, $src1, $src2">;
Problem Statement
In LLVM, TableGen is used widely in many subprojects to specify target-specific information and also in MLIR. Overall, there is a substantial amount of TableGen code (340+ KLOC) in LLVM, and yet it lacks its own code formatter.
Manually formatting TableGen code poses significant challenges for developers - inconsistent code styles, time-consuming formatting tasks, and increased errors, hindering code readability and maintenance.
These difficulties necessitate a solution that automates the formatting process, enabling developers to focus on the core aspects of TableGen development while ensuring clean and professional-looking code.
Solution Overview
The aim of this project was to create a code formatter for TableGen language that improves the readability, maintainability and consistency of TableGen code.
The TableGen Formatter stands as an exclusive tool designed specifically to address the complexities and challenges of formatting TableGen code, enabling developers to achieve consistent, and readable.
The aim is also to upstream these changes to the LLVM repo so that the community can benefit from this formatter.
Implementation
- Getting familiar with LLVM
The first step was to get familiar with LLVM and its project structure. I learned not just how to build LLVM but how to build it efficiently, as I knew we would have to build it several hundreds of time if not thousands.
- Understanding TableGen - its syntax, lexer and parser
In this step, I understood the TableGen language basics. I went through the TableGen’s grammar to understand its syntax. The next step was to understand its lexer and parser. This was needed as I had to decide if I could reuse its lexer to develop the formatting tool.
- Understanding how Clang-Format works under the hood
Since the aim was to build a formatting tool, I started looking into the working of Clang-Format. I picked one of the Clang-Format rules, RemoveSemicolon
,
and started tracing its flow in the Clang-Format libraries. Doing this was very essential because this helped me understand the different parts of a
code formatter. Some of the most important ones are -
TokenAnnotator - the component that maintains a track for different *flags* for a token. eg: an *Optional* flag set for a token would indicate that token is not necessary in the final output UnwrappedLineParser - the logic that takes care of unwrapping a line in code that can possibly span multiple lines (since the language is whitespace insensitive) WhitespaceManager - the component that takes care of tracking the whitespaces required in the formatted code TokenAnalyzer - the logic that does the actual formatting after analyzing the different flags set for a token
- Brainstorming the formatter design - should we reinvent the wheel?
This was probably the most crucial step of this project. There were 3 options that were explored -
Building everything from scratch - The idea here was to a slight radical approach to build our own parse tree. This would make the TableGen free of any dependency over Clang-Format while giving more control on the implementation details.
*Pros - no dependency on Clang-format!*
*Cons - this would require a lot (read: a LOT LOT LOT LOT) more work*TableGen Formatter as a wrapper over Clang-Format - The idea here was to pull the necessary Clang-Format libraries out of Clang and make them LLVM-wide available. Then, write a wrapper over these libraries to format TableGen.
*Pros - there's little dependency on Clang-Format while allowing control over implementation by writing a wrapper*
*Cons - this was too big a change, and it was very difficult to pull any Clang-Format libraries out of Clang.*Add Support for TableGen in Clang-Format - Clang-Format supports formatting for several languages (C/C++/Java/JavaScript/JSON/ Objective-C/Protobuf/C#). Adding TableGen support to it would require minimal changes among all the 3 options.
*Pros - more realistic approach to achieve required results + reusing the Clang-Format Style options, making it easier to use*
*Cons - there's a lot of dependency on Clang-Format.*
- Adding Support for TableGen in Clang-Format
I started out by picking the most common Clang-Format Style options that are used and adding support for those options for TableGen. Soon enough, I realized this was not the best approach. Then I shifted the approach to focus on specific TableGen constructs first such as loops, conditional statements, and others and ensured that Clang-Format was able to identify these. Once I was able to make Clang-Format recognize them, I found out that Clang-Format was able to handle a lot of basic formatting.
- Current and Future Scope
We have covered the following in the current scope -
Recognizing keywords such as *def*, *defvar*, *multiclass*, *let* Merging/Treating TableGen DAG operands as single tokens Recognizing and treating TableGen bang operators such as *!and*, *!if*, etc as single tokens Support for macros Support for *#* as a concatenation operator Support for specific Clang-Format options such as *SeparateDefinitionBlocks*, *InsertBraces*, *RemoveBracesLLVM*
In the future, the current strategy can be used to add support for the remaining relevant Clang-Format Style options for TableGen.
TableGen Formatter in Action
What I learned
The key takeaways for me from this project are -
- Working with a large codebase - LLVM is one of the largest codebases that is being maintained by the open source community. Working on such a large code repository has its own challenges.
- Importance of testing - It automatically became clear to me how important tests are while working with such a large codebase. Every single change needs to be accounted for in terms of tests because you never know what change might break an existing functionality.
- (Potentially) Contributing to the open source communities - Working on this project made me appreciate the open source communities even more and I hope to contribute to other open source communities too!
- Perseverance is the key to success - When trying to modify existing code, it is very important to have the patience to understand what the code does and what are the repercussions of adding/removing lines of code. With the choice of adding support for TableGen within Clang-Format, it was very long until we saw the results and the only way to achieve what we achieved was persevering through it.