2024-03-19 04:20:07

As per title, I would like to modify LLVM such that, when compiling a C program, it can only select from the ARM instructions that I allow for. Other instructions, if possible, should be replaced by those in my set. If this is not possible, an error should be returned.


I have tried looking this up and have found virtually nothing. One problem I have is that when looking at stuff like tablegen in the LLVM/lib/target/arm it's not exactly clear what, if anything, to modify. Lib/IR contains some more stuff, but again, I'm not sure what to touch here.

I found a SO answer that (I'm pretty sure) modified the IR and inserted a call to expand a given instruction if possible. This is unreasonable, as my subset is relatively small. I'd have to include a *lot* of exclusion/expansion calls.

I also found some notes that suggest writing a pass to go and call erase from parent or something along those lines, but again, I'm not entirely sure what this means.


I'm familiar with the general architecture of how LLVM works, that is, frontend, transformation into IR, codegen, optimization passes, but that's about it. I'm not afraid of reading, either, so feel free to throw terms or links at me. I've briefly skimmed the manual and found the adding new instruction section, but I'm still somewhat confused as to how to lock out instructions like, say, fadd (floatingpoint add) so that doing so in C is an error in this modified compiler.


Thank you for the help.

2024-03-19 04:53:02

You can't necessarily make the use of an operation in the compiler itself an error (at least, not unless your in a language like Ada, where you can disable features of the language with configuration pragmas) or your willing to go modifying the compiler's parser/code generator. What you can do is tell LLVM to (not) use certain instruction sets, which is probably what your looking for. For instance, if you turn off software floating point and disable all floating-point instruction sets on x86, I'm pretty sure your code will still compile up to the point when LLVM needs to emit code, whereupon it'll complain (though probably not in the most helpful of ways). However, the best solution is to avoid ways of generating said instructions altogether in your code. If you don't use floating-point, don't use any floating-point code/types. If you don't want to use SSE/AVX, don't use vector register types and, for GCC, pass  -mgeneral-regs-only which will force GCC to only use GPRs and no other registers (this I don't think stops you from using floating-point code, but does, I believe, disable vector/mask/floating-point builtins). In general, GCC/Clang won't use SSE/AVX/AVX512 unless you explicitly enable it via options like -msse/-msse2/etc. or if you use -march= and -mtune= and other options, though it (may) choose to use these if the optimizer gets involved.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2024-03-19 05:28:48

I mean sure, the obvious answer is to just not use the instructions within my program, but I have no way of telling people this. The goal is to toss a C program (I don't know anything about it) and have it spit out either a fixed version or error. I was aware of disabling certain subsets of instructions, but that's, unfortunately, not granular enough. My subset consists of roughly 40 instructions like STP, STUR, LDP, LDUR, ..., all of them seeming to fall within a general category rather than some specific domain.

I am willing to modify codegen/parser if that's what this comes down to, I was just hoping that folks would have more direct, "Here's what you can look up," type of pointers. So far it has been kinda stumbling around with little idea what I'm exactly searching for (nothing I googled gave terribly instructive responses).

2024-03-19 07:04:29

@3, which CPU architecture are you targeting? If your targeting something like RISC-V where the architecture is very modular, specifying architecture extensions (or just explicitly specifying "use only the baseline instruction set") via command-line options is good enough. Beyond that I'm honestly unsure what to tell you. I've very little knowledge on how LLVM works myself, so the only thing I could suggest is writing a custom optimization pass that tries to filter out instructions or something

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2024-03-19 15:39:18

I am unfortunately targeting ARMV8

2024-03-19 20:30:58

Yeah ok so.  You don't need LLVM for this exactly.  If your instructions are literally a random set of instructions all over the place nothing and I mean nothing is going to save you.  But most of the time it's either a set of instructions from an extension or a set of instructions from an architecture version and you can disable it by telling the compiler to disable said extensions or use an older version of the architecture.

In gcc and probably also clang these are a list of various -march and -mtune instructions.

If you need to just flat out error at some stage in the process if these instructions get used rather than convincing the compiler to emit something else, run a disassembler on the command line and then grep for the forbidden instructions with a regex.

I don't know the llvm codebase but if none of this is good enough and you do really need to ban a specific random set of instructions then...rethink your project because it basically ain't happening.  You're not finding an instruction listing you can turn on and off because there isn't one.  Compilers use rather complex models of the processors, it's not a table lookup, if you need to flat out remove support for a single instruction...good luck, you'll really really need it.

If none of that is good enough and you are still dead set on doing this, and if you don't need modern optimizations, then maybe look into lcc or tcc or any of the other various tiny educational C compilers and modify one of those.  Unfortunately I believe most of them are C89.

My Blog
Twitter: @ajhicks1992

2024-03-19 22:10:03

You misunderstand: It's not that I need to exclude particular instructions. It's that I have a particular subset of instructions that my program *must* use. In that sense, I would need to exclude *everything else*, which I guess answers the next question but I'm still going to ask. Is that still equally difficult? Can you explain why, exactly, would instruction listing that you can turn on and off not work for a compiler? After all, we have lexer, which doesn't care--it can still parse tokens to its heart's content. We have the parser, which would represent the tokens as IR (or rather, convert tokens into some sort of internal representation). Modifying codegen phase sounds simple, then. Why, exactly, is it not?


Optimizations are not an issue, either.

2024-03-19 23:05:44

Well, for x86 for example the time an instruction takes depends on what it is, whether or not there's dependencies on previous instructions, which execution ports it uses, and what the last instruction in source order is.  The last being different from the first because it doesn't matter if there's dependencies.  All of this is exposed if you're wondering via something like 100 performance counters if you want to profile off it.  On top of that, not all instructions can use all registers, some instructions implicitly modify state (for example comparisons, or detecting overflows), and there are special cases documented in the Intel manuals such as xor register, register being faster to zero a register than moving zero into it (to pick the most famous example).  Those manuals in the case of x86 are the size of an average college textbook.  On top of all of this X86 has registers that are actually views of parts of other registers (for example al is the low half of, I forget, eax?) and the set of available registers (not just the instructions) depends on CPU ISA extensions such as avx.  On top of all of this, there are sometimes subtle differences between amd and intel, most notably around cpuid, and in some cases multiple valid encodings for instructions.  Though it is not true so much nowadays, code size also matters because of caches, and it used to be the case that alignment of loops (as in memory address of the code) mattered for performance.  Compilers factor all of this in, you can't easily just go smash an instruction out if it's one that the compiler thinks should be present based on your specified architecture and ISAs.  This isn't even the full list of considerations either, I'm not an assembly expert, these are just the ones I know because I've done simd for my dsp projects.

Arm is less bad in terms of special cases but worse in terms of variable performance because Arm chips may or may not be using the official circuitry from--god, I forget what company, they were softbank for a while but I think they're actually just arm now.  One good case of that is Apple, who implements the architecture but does so with their own from-scratch chips.

So...I mean yeah good luck with your simple tables.  I'll wait.  It did used to be like that but...

Maybe tell us what you're doing?  The standard answer to this is that if you are limited to a subset of instructions then either that means you're limited to a specific version of the architecture, in which case you just have to figure out what that is, or you're using something that probably has a custom C compiler from the manufacturer.  If this is college or something, I would ask the professor what the standard solution you're supposed to use is, because I really doubt it's "hack Clang".

My Blog
Twitter: @ajhicks1992

2024-03-19 23:28:45 (edited by amerikranian 2024-03-19 23:31:10)

We're writing an OOO processor that only supports a subset of ARM. Unfortunately, the base implementation of the listed ISA is, as my professor put it, "only worth 90 points." One of the extensions listed on the assignment page was, "Modify gcc to compile C programs into exclusively listed subset assembly code." Now we're here. I suppose I can just abandon the idea, but other extensions like implementing a "modern" branch predictor do not look... fun. This is all in Verilog, by the way.

So, I looked into GCC. Found next to nothing. Looked at LLVM. Found a lot more documentation about it. Now we're here.

2024-03-20 00:54:12

If the professor seriously expects you to modify gcc and is not willing to provide guidance than I don't know what the fuck they're thinking.  All compilers on the market basically rewrite their internals every 5 years or so tops whenever the state of the art of enough things advances that the old way starts getting shaky.  If you're going to pass the class otherwise I'd just take the hit.  If you have the option of modifying any C compiler and it does not have to be gcc, then tcc is probably a better bet.

Now that said, what you might be able to do is take one of these, gcc, tcc, clang, whatever, and add a new architecture to it.  I don't know how and I wouldn't want to but people do it every time they invent a new chip so...

You're making a mistake about LLVM. It isn't documented.  The external API surface area is documented because it is a "compiler in a library" and what it is, sort of, exposing is C.  Kinda.  It's not C but it's not assembly, it's "like C until it's not but it's usually not" let's say.  Think of it more like a programming language and the docs are documenting the programming language because that's exactly what it is.  For most users of it, the internals are, and are supposed to be, mostly opaque.  This said, by all accounts it is an easier codebase to modify because GCC drags a whole lot of serious tech debt along for the ride, but I haven't tried to modify either and that may or may not be true.

From time to time I have seen machine-parsable databases of instructions. If it is the case that you can get LLVM to do what you want by listing every possible instruction except yours, then the solution is find one of those and write a small Python script to spit out the big block of boring code for you.

My Blog
Twitter: @ajhicks1992

2024-03-20 01:12:02

I’m not familiar with this specific topic, but this reminds me of a school assignment I failed in middle school. We got 16 questions. 15 of them were absolutely ridiculous, and question 16 was a statement that basically boiled down to "Ignore questions 1 through 15." I failed because I didn’t get the hint.

Could your teacher intentionally be giving you a ridiculous assignment to test you on whether or not you'll actually do it, with the goal of improving critical thinking skills? Maybe you're supposed to question them on this one.

I hope this was helpful.

Discord: dangero#0750
Steam: dangero2000
TWITCH
YOUTUBE and YOUTUBE DISCORD SERVER

2024-03-20 01:38:16

No, it's of reasonable difficulty.  If and only if the professor is like "you're modifying gcc 10 look here here and here".  But if there's no plan other than meh go figure it out...shrug. Don't think it's a trick personally. Think the professor is just one of those dicks who doesn't believe in giving out 100s.

My Blog
Twitter: @ajhicks1992

2024-03-20 03:16:40 (edited by Ethin 2024-03-20 03:18:13)

Yeah asking you to modify a compiler without providing you any assistance or information is absolutely insane. Modern compilers are crazy complicated. Messing around with a compiler like TCC or LCC would be easier but it's questionable whether he'd accept that since he clearly expects you to be a master of compiler engineering and ancient codebases that are 20-30 years old and have a ton of technical debt. I'd just take the hit if I were you. Not even my professors were that evil.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2024-03-21 07:38:59

hello,
at least clang is newer than gcc if the modification is the case
but you might write a new architecture for it I suppose.
pick one of its architectures (I suppose arm in your case), modify it not to spit other instructions (except yours) and then you need to modify llvm IR as per your arch.
maybe your other bet is to use its pass manager? (I suppose you need to do a huge set of optimizations by your own as the list is huge and you need to modify them).
really I was not into llvm internals, and my answers can be wrong.
but in case of yours, you have a lot to do as those internals are very hard to modify.
if clang is your option and your professor accepts it, instead of modifying gcc go for it. maybe you can create a new arch for llvm and use it as the basis of your machine arch.