STM32 without CubeIDE (Part 1): The bare necessities

Ever since I started programming microcontrollers, I have almost exclusively done so using a vendor-provided (usually Eclipse-based) IDE, which does a lot of stuff automagically behind the scenes. But since I like to know how stuff works I figured I would ditch the IDE and try starting from scratch. In this blog post series I am going to go back to basics and explore how to program an STM32 microcontroller, starting as simple as possible and then gradually adding more and more stuff along the way. In this first post I am going to start off with nothing more than an editor and the command-line. I will be writing my own linker script, startup code and a simple blink application without using any vendor-provided tools or drivers. Then I will build an executable and flash it to an STM32 microcontroller using a few command-line tools. I will be using Visual Studio Code as my text editor, but if you are going to following along, feel free to use whatever editor you prefer.

All code from this blog post is available on Github.

Throughout the series I will be using a NUCLEO-F410RB development board which features an Arm Cortex-M4-based STM32F410RB microcontroller and an integrated ST-LINK for programming over USB. However, it should be a fairly straight-forward task porting my examples to any other STM32 MCU or development board. I am using Windows, but I will almost exclusively be using tools that are also available on Linux (and perhaps Mac).

Let’s get started by first installing some of these tools.

Installing the development tools

The STM32 microcontrollers are built around the Arm Cortex-M processor. To convert our code – whether it be C, C++ or assembly – to executable code that the processor understands, we are going to need the Arm GNU Toolchain. This toolchain contains, among other things, a cross-compiler for C, namely arm-none-eabi-gcc. The term cross-compiler simply means that the compiler runs on one architecture (e.g. x86_64 for Windows) but creates executables that run on another architecture (e.g. ARMv7E-M for a Cortex-M4).

The Arm GNU Toolchain can be downloaded for Windows, Linux or Mac here, but since I will be using other Unix utilities (such as make) later on in this blog post series, I will instead install it through MSYS2. MSYS2 (“minimal system 2”) is a small collection of Unix tools for Windows, based on MinGW and Cygwin. It features the pacman package manager, making it really simple to install everything we need. After installing MSYS2, we can open an MSYS2 MINGW64 terminal and install the Arm GNU Toolchain with the command pacman -S mingw-w64-x86_64-arm-none-eabi-gcc. Then just add C:\msys64\mingw64\bin and C:\msys64\usr\bin (or wherever you installed MSYS2) to our path and we can now invoke the compiler directly from a command prompt.

To load our executable onto the microcontroller, we are going to use OpenOCD. You can clone the Github repository and compile the source code yourself or just get the pre-compiled Windows distribution under Releases. Whatever you choose, remember to add the openocd/bin/ folder to your path.

The build process explained

Before diving into the code, let’s first review how source code is turned into machine code and uploaded to the microcontroller. The figure below shows a simplified overview of this process:

First, our source files are passed into the (cross-)compiler which turns them into individual object files. The object files are then “linked together” by the linker to form an executable. We then use a programmer to upload the executable to the microcontroller, which then executes our code.

Let’s go into a bit more detail with a concrete example. We are going to need the following files:

a linker script: linker_script.ld
startup code: startup.c
application code: main.c

The .c source files are passed to the C compiler (arm-none-eabi-gcc) which creates an object file for each source file. These object files are then passed to the linker (arm-none-eabi-ld) along with a linker script, telling the linker where each section of the object files should be placed in memory on the microcontroller. The output of the linker is our final executable binary file blink.elf. In practice, we will not be invoking gcc and ld separately, since pre-processing, compiling and linking is all handled by just invoking gcc. If we are interested in any of the intermediate output, there are several command-line options available, for example -c to return the compiled object files without linking or -Wl,-Map=filename.map to instruct the linker to output the memory map file.

After building the executable, we are ready to flash it to the target. Our host PC communicates through OpenOCD with the ST-LINK programmer, which in turn communicates with the microcontroller. Our executable is stored in non-volatile flash memory as specified in the linker script. When the microcontroller boots up, our startup code will ensure that the initialized data section (.data) is copied to SRAM and the uninitialized data section (.bss) is filled with zeroes. Then the main() function is called and our application is running.

The linker script – telling the linker where to place things

The purpose of the linker script is basically to tell the linker:

Where to start executing code when the processor boots, i.e. the program entry point
Which memory regions exist on the microcontroller
Where to place the different code sections in memory

The script is written in Linker Command Language, which you can read more about here. However, we will only need to use a few commands for this simple script.

The entry point is specified using the ENTRY() command. In the startup code we are going to create a function called reset_handler() which will handle all memory initialization, so this will be our entry point:

ENTRY(reset_handler)

Next, to determine the memory layout of the MCU we will consult the STM32F410RB reference manual section 2, “System and memory overview”. We need to figure out the starting address and the size for each memory region on the MCU. Figure 1 shows an overview of the system architecture, where we see that there is 128 KB of flash and 32 KB of SRAM available:

If we scroll down a bit further, we will find figure 2 which shows the memory map. We see that the flash memory starts at address 0x08000000 and SRAM starts at 0x20000000:

Now, we have all the information we need to define the memory regions:

MEMORY
{
  FLASH (rx): ORIGIN = 0x08000000, LENGTH = 128K
  SRAM (rwx): ORIGIN = 0x20000000, LENGTH = 32K
}

The next step is to specify where each memory section of our object files should be placed in these memory regions. Let’s take another look at the MCU part in the figure of the build process presented earlier:

The first section in flash memory is the interrupt service routine vector (or just interrupt vector) which we will call .isr_vector. As stated in the STM32 Cortex-M4 programming manual the ISR vector is located at address 0x00000000 of the flash memory by default and we have no reason to relocate it, although it is possible. Next we have the .text section which contains the actual instructions of our program. Then we have the .rodata section, that contains read-only data, and finally the .data section, which contains initialized variable data that will be copied from flash to SRAM during startup. In SRAM we also see a .bss section which is meant for uninitialized variable data and should be zero-filled during startup.

To check which memory sections an .elf or .o file contains, use arm-none-eabi-objdump -h <filename>

Using the SECTIONS command to define the sections, we end up with this:

SECTIONS
{
  .isr_vector :
  {
    KEEP(*(.isr_vector))
  } >FLASH

  .text :
  {
    . = ALIGN(4);
		
    *(.text)
    *(.rodata)
		
    . = ALIGN(4);
    _etext = .;
  } >FLASH

  .data :
  {
    . = ALIGN(4);
    _sdata = .;
		
    *(.data)

    . = ALIGN(4);
    _edata = .;
  } >SRAM AT> FLASH

  .bss :
  {
    . = ALIGN(4);
    _sbss = .;
		
    *(.bss)
		
    . = ALIGN(4);
    _ebss = .;
  } >SRAM
}

After each section definition, we are specifying which memory region the section should be placed in, i.e. FLASH or SRAM. The .data section differs from the other sections in that it has two addresses: >SRAM AT> FLASH. The first address is called the virtual memory address (or relocation address) and is the memory address that the section will have when the program is executed. The second address is the load memory address, which is where the section is loaded onto the target. Since data does not persist in SRAM when the target is powered off, we have no choice but to save the data in flash and then copy it to SRAM during startup. All the other sections only have a single address defined, meaning that the virtual memory address and the load memory address are the same.

Also notice that I have also defined the symbols _etext, _sdata, _edata, _sbss and _ebss using the location counter (.). We will use these symbols in the startup code to ensure that we copy and zero-fill the correct memory addresses. I have also made sure to align everything to a 4-byte boundary as recommended in the programming guide. This is to avoid unaligned memory accesses, which is only permitted for certain instructions, is slower than aligned access and will trigger a usage fault exception if used illegally.

That’s all for the linker script – let’s continue with the startup code.

Startup code – getting things ready before main()

In startup.c we must do the following:

Initialize the main stack pointer and interrupt vector table in the .isr_vector section
Copy the .data section from flash to SRAM
Zero-fill the .bss section

The main stack pointer should be initialized to point at the end of SRAM and will then move “down” in memory as data is pushed to the stack. Since we know both the starting address and size of the SRAM region, we can find the end of SRAM:

#define SRAM_START (0x20000000U)
#define SRAM_SIZE (32U * 1024U)
#define SRAM_END (SRAM_START + SRAM_SIZE)
#define STACK_POINTER_INIT_ADDRESS (SRAM_END)

The main stack pointer address must then be stored as the first word in the interrupt vector table. When the processor boots, this address is copied to the stack pointer register (R13) in the processor core.

Besides the initial address of the main stack pointer, the interrupt vector table must also contain 15 words for the Cortex-M system exception handlers and 98 words for STM32F410RB interrupt handlers. Some of these words are reserved so we will simply put a zero at those indices. The entire vector table can be found in the reference manual in the section 9, “Interrupts and events”. I am not going to go through the process of implementing all of these interrupt handlers – just the reset_handler() and the default_handler(). Then I will alias the rest of the interrupt handlers to the default handler and declare them weak, so they can be overridden later in the application code as necessary. In the code below I have declared all the system exception handlers, but have left out the STM32 interrupt handlers for brevity. Also notice the section attribute given to the isr_vector[] array to ensure that it ends up in the correct memory section.

#include <stdint.h>
#define ISR_VECTOR_SIZE_WORDS 114

void reset_handler(void);
void default_handler(void);
void nmi_handler(void) __attribute__((weak, alias("default_handler")));
void hard_fault_handler(void) __attribute__((weak, alias("default_handler")));
void bus_fault_handler(void) __attribute__((weak, alias("default_handler")));
void usage_fault_handler(void) __attribute__((weak, alias("default_handler")));
void svcall_handler(void) __attribute__((weak, alias("default_handler")));
void debug_monitor_handler(void) __attribute__((weak, alias("default_handler")));
void pendsv_handler(void) __attribute__((weak, alias("default_handler")));
void systick_handler(void) __attribute__((weak, alias("default_handler")));
// continue adding device interrupt handlers

uint32_t isr_vector[ISR_VECTOR_SIZE_WORDS] __attribute__((section(".isr_vector"))) = {
  STACK_POINTER_INIT_ADDRESS,
  (uint32_t)&reset_handler,
  (uint32_t)&nmi_handler,
  (uint32_t)&hard_fault_handler,
  (uint32_t)&bus_fault_handler,
  (uint32_t)&usage_fault_handler,
  0,
  0,
  0,
  0,
  0,
  (uint32_t)&svcall_handler,
  (uint32_t)&debug_monitor_handler,
  0,
  (uint32_t)&pendsv_handler,
  (uint32_t)&systick_handler,
  // continue adding device interrupt handlers
};

void default_handler(void)
{
  while(1);
}

The last thing left to do in the startup code is to implement the reset_handler(), which we specified as our program entry point in the linker script. Here we will use the symbols we defined in the linker script to copy the .data section from flash (starting at _etext) to SRAM (starting at_sdata), and also write zeros to the entire .bss section in SRAM (from ._sbss to _ebss).

extern uint32_t _etext, _sdata, _edata, _sbss, _ebss;
void main(void);

void reset_handler(void)
{
  // Copy .data from FLASH to SRAM
  uint32_t data_size = (uint32_t)&_edata - (uint32_t)&_sdata;
  uint8_t *flash_data = (uint8_t*) &_etext;
  uint8_t *sram_data = (uint8_t*) &_sdata;
  
  for (uint32_t i = 0; i < data_size; i++)
  {
    sram_data[i] = flash_data[i];
  }

  // Zero-fill .bss section in SRAM
  uint32_t bss_size = (uint32_t)&_ebss - (uint32_t)&_sbss;
  uint8_t *bss = (uint8_t*) &_sbss;

  for (uint32_t i = 0; i < bss_size; i++)
  {
    bss[i] = 0;
  }
  
  main();
}

Finally, we will call our main() function.

A minimal blink application

In the main() function, our goal is to implement the classic embedded system equivalent of “Hello, World!”, namely blinking an LED. On the development board we have the anode of a green LED (LD2) connected to PA5 and the cathode connected to GND via a resistor. In order to get the LED to blink we must:

Enable the peripheral clock for GPIO port A
Set PA5 as a push-pull output
Toggle the pin at a fixed interval in the super loop

To figure out how to access these peripherals, let’s take a look at table 1 in section 2 (“System and memory overview”) of the reference manual, we see that all peripherals are memory mapped starting from address 0x40000000:

If we scroll down to table 1, we find that both GPIOA and RCC are connected to the advanced high-performance bus 1 (AHB1) which is offset by 0x20000. GPIOA and RCC are mapped to the relative AHB1 addresses 0x0 and 0x3800, respectively:

Let’s write some #defines for these addresses:

#define PERIPHERAL_BASE (0x40000000U)
#define AHB1_BASE (PERIPHERAL_BASE + 0x20000U)
#define GPIOA_BASE (AHB1_BASE + 0x0U)
#define RCC_BASE (AHB1_BASE + 0x3800U)

If we scroll through the RCC registers in the reference manual, we find the “RCC AHB1 peripheral clock enable register” at offset 0x30 where bit 0 enables GPIOA.

For the GPIO we need to set the pin as an output (it is push-pull by default) and then control its output state. This is done in the mode register and output data register, which are offset by 0x0 and 0x14, respectively. In the mode register there are two mode bits for each pin, so pin 5 is configured with bit [11:10] where the value 0b01 sets the pin in general purpose output mode. In the output data register the state is just one bit for each output, so pin 5 is at bit 5. We’ll create some more defines and then move on to implementing the main() function:

#define RCC_AHB1ENR_OFFSET (0x30U)
#define RCC_AHB1ENR ((volatile uint32_t*) (RCC_BASE + RCC_AHB1ENR_OFFSET))
#define RCC_AHB1ENR_GPIOAEN (0x00U)

#define GPIO_MODER_OFFSET (0x00U)
#define GPIOA_MODER ((volatile uint32_t*) (GPIOA_BASE + GPIO_MODER_OFFSET))
#define GPIO_MODER_MODER5 (10U)
#define GPIO_ODR_OFFSET (0x14U)
#define GPIOA_ODR ((volatile uint32_t*) (GPIOA_BASE + GPIO_ODR_OFFSET))

#define LED_PIN 5

Notice that I’ve declared the registers as volatile pointers to make sure the compiler not to optimize out any seemingly useless accesses to these addresses. See my post on the volatile qualifier.

Now we just have to write the main function where we enable the peripheral clock for GPIOA (and wait a few cycles for the peripheral to actually be enabled), set pin 5 as an output and then toggle the pin on and off in the super loop:

void main(void)
{
  *RCC_AHB1ENR |= (1 << RCC_AHB1ENR_GPIOAEN);

  // do two dummy reads after enabling the peripheral clock, as per the errata
  volatile uint32_t dummy;
  dummy = *(RCC_AHB1ENR);
  dummy = *(RCC_AHB1ENR);

  *GPIOA_MODER |= (1 << GPIO_MODER_MODER5);
  
  while(1)
  {
    *GPIOA_ODR ^= (1 << LED_PIN);
    for (uint32_t i = 0; i < 1000000; i++);
  }

}

Building the executable

Now that we have all the source code, we are ready to build our executable. As I mentioned previously, we do not need to invoke the compiler and linker separately – just invoke arm-none-eabi_gcc:

$ arm-none-eabi-gcc main.c startup.c -T linker_script.ld -o blink.elf -mcpu=cortex-m4 -mthumb -nostdlib

We pass in the source files main.c and startup.c as input. Then we pass the linker script to the linker with the -T option and specify the output file with the -o option. Since we are cross-compiler for another CPU architecture, we also need to tell the compiler which architecture and instruction set to compile for (-mcpu=cortex-m4 -mthumb). To avoid linking the standard library and standard system startup code into our executable, we’ll also use the -nostdlib flag. After running the command, blink.elf appears in our project folder and we are ready to flash the binary to the microcontroller.

Loading the program onto the microcontroller

To connect to the microcontroller using OpenOCD, we need to pass in a few configuration files specifying the programmer (or interface) and target microcontroller. We can browse through these files in the openocd/share/openocd/scripts/ folder to find something that matches our setup (e.g. interface/stlink.cfg and target/stm32f4x.cfg). This folder is searched automatically by OpenOCD when specifying input files as a command-line argument, so we do not need to copy anything to our project folder. To actually flash the .elf file to the target, we can specify an OpenOCD command with the -c option to perform the following actions:

Program the binary to the target
Verify that it was programmed correctly
Reset the target
Exit OpenOCD

$ openocd -f interface/stlink.cfg -f target/stm32f4x.cfg -c "program blink.elf verify reset exit"

At last! If everything went smoothly, we should now see the LED blinking!

Next steps

In part 2 we will be making our lives a bit easier by adding some basic CMSIS components including all register definitions for the MCU – we do not want to write all of those ourselves! Also, instead of invoking the compiler directly, we will write a Makefile and use make to handle the build process.

5 thoughts on “STM32 without CubeIDE (Part 1): The bare necessities”

Pingback: STM32 without CubeIDE (Part 2): CMSIS, make and clock configuration - Klein Embedded
Hannes says:

April 4, 2023 at 06:49

In the startup.c, the segment where the .bss is zero’d:

uint32_t bss_size = &_ebss – &_sbss;
uint8_t *bss = (uint8_t*) &_sbss;

for (uint32_t i = 0; i < bss_size; i++)
{
bss[i] = 0;
}

I had to change

"uint8_t *bss = (uint8_t*) &_sbss;"

to

"uint32_t *bss = (uint32_t*) &_sbss;"

because otherwise just a quarter of the memory assigned to .bss would be zero'd.

Log in to Reply
1. Kristian Klein-Wengel says:
  
  April 5, 2023 at 19:25
  
  Thanks for pointing that out. I see my mistake is not casting the &_ebss and &_sbss pointers to uint32_t before performing the subtraction for bss_size. Will edit that in the post.
  
  Your solution also works as long as the .bss section is 4-byte aligned, which it is in our case. If not, you risk zeroing a few bytes of whatever comes after the .bss section in memory.
  
  Log in to Reply
otann says:

June 12, 2023 at 14:28

could it be, that “_etext = .;” should stand between “*(.text)” and “*(.rodata)”? I mean something like that:
.text :
{
. = ALIGN(4);
*(.text)
_etext = .;
*(.rodata)
. = ALIGN(4);
} >FLASH

Log in to Reply
Jay says:

January 14, 2026 at 00:18

In your reset_handler you get the address of all the linker script declared variables to then case then as pointers to memory location in the different sections. I do believe that it is unnecessary to do that because they should just hold there location in memory as their data. Or I could be wrong

“`
void reset_handler(void)
{
// Copy .data from FLASH to SRAM
uint32_t data_size = (uint32_t) &_edata – (uint32_t) &_sdata;
uint8_t *flash_data = (uint8_t*) &_etext;
uint8_t *sram_data = (uint8_t*) &_sdata;

for (uint32_t i = 0; i < data_size; i++)
{
sram_data[i] = flash_data[i];
}

// Zero-fill .bss section in SRAM
uint32_t bss_size = (uint32_t) &_ebss – (uint32_t) &_sbss;
uint8_t *bss = (uint8_t*) &_sbss;

for (uint32_t i = 0; i < bss_size; i++)
{
bss[i] = 0;
}

main();
}
“`

Log in to Reply