If you’re new to this series, I’ve been documenting the process I went through upgrading my old PXA166-based Chumby 8’s 2.6.28 Linux kernel to a modern 6.x version. Here are links to parts 1, 2, 3, 4, 5, 6, 7, and 8. At this point in the project, all of the main hardware peripherals were working great. I noticed something odd when running top though. The CPU usage was always really high, and it wasn’t obvious why.

Mem: 47888K used, 55968K free, 168K shrd, 3116K buff, 27480K cached
CPU: 100% usr 0% sys 0% nic 0% idle 0% io 0% irq 0% sirq
Load average: 0.00 0.00 0.00 2/51 269
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
267 200 root R 2936 3% 100% top
100 1 root S 12240 12% 0% /sbin/udevd -d
1 0 root S 2936 3% 0% init
200 1 root S 2936 3% 0% -sh
65 1 root S 2936 3% 0% /sbin/syslogd -n
71 1 root S 2936 3% 0% /sbin/klogd -n
34 2 root SW 0 0% 0% [irq/56-mmc0]
10 2 root IW 0 0% 0% [kworker/0:1-eve]
11 2 root IW 0 0% 0% [kworker/u2:0-ev]
41 2 root IW< 0 0% 0% [kworker/0:2H-kb]
8 2 root IW 0 0% 0% [kworker/0:0-lib]
22 2 root IW< 0 0% 0% [kworker/0:1H-mm]
32 2 root SW 0 0% 0% [irq/55-mmc1]
14 2 root IW 0 0% 0% [rcu_preempt]
17 2 root IW 0 0% 0% [kworker/u2:1-ev]
15 2 root SW 0 0% 0% [kdevtmpfs]
27 2 root IW 0 0% 0% [kworker/0:2-pm]
2 0 root SW 0 0% 0% [kthreadd]
13 2 root SW 0 0% 0% [ksoftirqd/0]
3 2 root SW 0 0% 0% [pool_workqueue_]

That’s really weird! Why would top be using all of my CPU? It says 100% usr in the second line. Sometimes the usage showed up as 50% usr and 50% sys. Other times it would show up as 100% sys. And very rarely, it would show 100% idle. In that rare case, top would actually show up with 0% usage as I would expect. The 2.6.28 kernel did not have this problem, so it was something different about my newer kernel.

I started theorizing about what might be wrong here. My ideas were nothing more than wild guesses: maybe one of the drivers I got working earlier was totally monopolizing the CPU and making it look like other processes were using all the CPU. Or perhaps I was missing some kind of CPU idle support so the processor was stuck running at 100% at all times, never able to take a break. Maybe there was some power management support missing from the mainline kernel or something. I needed to figure out a way to narrow down the possibilities.

My first step in diagnosing this problem was to go back in time. Several years ago, I had played with getting Linux 3.13 working on my Chumby 8 before I gave up on the project due to not having enough time or experience. This ended up being useful in the present day because I could try booting the old 3.13 kernel to see if it had the same issue.

I quickly discovered that the 3.13 kernel had the same exact problem. This was definitely a useful data point. It meant nothing recent had broken it. Also, I didn’t have many drivers working in that old kernel, so it ruled out a lot of possible causes. I said goodbye to the old kernel and thanked it for its insight.

Next, I tried profiling my modern kernel by enabling CONFIG_PROFILING. Then I added profile=2 to my kernel command line. Lastly, I made sure System.map from my Linux build directory was copied over to /boot on my Chumby. Now I was ready to do profiling.

readprofile -r    # resets the counters
top # run top and wait for a while, then type q to exit
readprofile # this prints out the final profiling results

The readprofile results were interesting. The entire output was far too long to put here, but here is the bottom of what it spit out.

...
38 lock_is_held_type 0.0888
8 debug_lockdep_rcu_enabled 0.1000
4 __schedule 0.0019
1 preempt_schedule_irq 0.0068
1 __mutex_unlock_slowpath 0.0012
1 mutex_trylock 0.0022
3 __mutex_lock 0.0017
1 down_read_killable 0.0093
3974 default_idle_call 24.2317
4 _raw_spin_lock 0.0435
1 _raw_spin_lock_irq 0.0078
1 _raw_spin_lock_irqsave 0.0081
29 _raw_spin_unlock_irq 0.2788
118 _raw_spin_unlock_irqrestore 0.8429
0 *unknown*
4722 total 0.0006

As you can see based on the total, default_idle_call used up the vast majority of the time. This seemed normal to me. After this test I felt pretty confident that I shouldn’t be blaming any of the random peripheral drivers I had worked on. The CPU was definitely idle, but why didn’t top think so?

I traced my way through what happens in default_idle_call. It seems like the main thing it ends up doing is calling arch_cpu_idle. On ARM, this ends up calling arm_pm_idle if it exists, and otherwise it will call cpu_do_idle. arm_pm_idle doesn’t seem to be used on this particular ARM machine, so I kept tracing through cpu_do_idle. The final result is it ends up calling cpu_mohawk_do_idle on the PXA168, confirmed by disassembling the final vmlinux file with objdump.

This function doesn’t seem very complex (this is from the Linux 6.8 code):

ENTRY(cpu_mohawk_do_idle)
	mov	r0, #0
	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
	mcr	p15, 0, r0, c7, c0, 4		@ wait for interrupt
	ret	lr

Here’s the corresponding function in the 2.6.28 kernel source:

ENTRY(cpu_mohawk_do_idle)
	mov	r0, #0
#ifdef CONFIG_ENABLE_COREIDLE
	mcr	p15, 0, r0, c7, c0, 4		@ Wait for interrupt
#endif
	mov	pc, lr

So they’re pretty much the same thing, except the newer kernel also drains the write buffer. I observed that CONFIG_ENABLE_COREIDLE was not defined on the 2.6.28 kernel though, so that was quite a big difference. I double checked the 2.6.28 kernel with objdump just to make sure:

c0207000 <cpu_mohawk_do_idle>:
c0207000:       e3a00000        mov     r0, #0
c0207004:       e1a0f00e        mov     pc, lr

Yeah, the wait for interrupt was definitely not there in 2.6.28. This gave me some ideas to try and I was starting to feel some hope that I might have found something. I tinkered with making the modern kernel’s cpu_mohawk_do_idle function identical to the old version. This honestly didn’t make logical sense though. Wouldn’t removing a “wait for interrupt” instruction actually make things worse in terms of idling the CPU? I shut off my brain and tried it out anyway. Why not?

Even though I thought I was making great progress, this whole branch of research turned out to be a red herring. No matter what tinkering I did in cpu_mohawk_do_idle, top was taking up 100% CPU. I went from thinking I was on the cusp of solving the problem, right back to square one.

After walking away from the problem for a while, I realized I had been throwing around random guesses instead of using logic. I decided to attack this problem from another angle instead. One simple question led me in a new direction: how does top calculate the CPU usage it displays? I realized I had no freaking clue how top worked. It was all just magic to me. I looked at BusyBox’s source code to try to understand how it works.

It turns out that all of top’s information comes from procfs, mounted at /proc. At the beginning of the source file you can see a comment that says “At startup this changes to /proc, all the reads are then relative to that.” This is confirmed on line 1150 where it changes the working directory.

What it does is read several files: /proc/stat, /proc/meminfo, and all of the /proc/<pid>/stat files — one for each running process. Iterating through all of the processes is handled by procps_scan in BusyBox’s implementation of top. A comment in the source helpfully informs us that man 5 proc gives more info about the content of these files. I focused on /proc/stat, which is responsible for providing the info in the second line of top’s output showing how the CPU load is distributed between user, system, idle, etc.

The man page says that the lines beginning with “cpu” in /proc/stat contain a bunch of numbers corresponding to the amount of time spent in a bunch of different states. Here is some example output from my desktop computer:

cpu  33032 1025 8433 426488 1422 0 314 0 0 0

The numbers, in order from left to right, represent time spent in: user mode, nice (user mode low priority), system, idle, iowait, irq, softirq, steal, guest, and guest_nice. They are in units of USER_HZ, which it claims is usually 1/100ths of a second but can vary. I can see that is correct on my x86_64 machine:

$ getconf CLK_TCK
100

I didn’t have getconf available on my buildroot-generated Chumby rootfs. I probably could have turned on a package to add it, but it’s easy enough to just throw together a quick C program to figure it out instead:

#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
printf("%ld\n", sysconf(_SC_CLK_TCK));
return 0;
}

Surprise surprise, it’s the same as my desktop PC.

# /tmp/test
100

Now that I knew my units were 1/100ths of a second, I tried a few experiments. It appears that top simply checks the values in /proc/stat periodically, and calculates the load by looking at the differences in the values between each iteration. So I theorized that if the CPU mostly has nothing going on, I should be able to look at the file content, sleep 10 seconds, look at the file content again, and expect the idle count to go up by about 1000. That’s because 1000 * (1/100ths of a second) = 10 seconds. I decided to try it out, starting on my desktop PC:

$ cat /proc/stat | grep 'cpu' ; sleep 10 ; cat /proc/stat | grep 'cpu'
cpu 49815 370 14743 25000206 3471 0 169 0 0 0
cpu0 2606 12 713 1563406 121 0 2 0 0 0
cpu1 2605 31 680 1563428 194 0 0 0 0 0
cpu2 4065 56 804 1562009 106 0 2 0 0 0
cpu3 2688 11 812 1563209 160 0 1 0 0 0
cpu4 3100 10 840 1562442 505 0 0 0 0 0
cpu5 2820 13 1032 1562568 150 0 1 0 0 0
cpu6 2714 42 821 1562976 241 0 1 0 0 0
cpu7 4316 25 1250 1560486 175 0 2 0 0 0
cpu8 3397 6 1132 1562333 117 0 2 0 0 0
cpu9 3366 16 1035 1562254 75 0 5 0 0 0
cpu10 3435 34 748 1562466 207 0 6 0 0 0
cpu11 3044 6 803 1562672 67 0 98 0 0 0
cpu12 3202 17 1007 1562161 287 0 2 0 0 0
cpu13 2612 18 1158 1562700 400 0 0 0 0 0
cpu14 2827 49 940 1562505 422 0 39 0 0 0
cpu15 3014 18 963 1562585 238 0 2 0 0 0

…10 seconds passed by, and then this printed out:

cpu  49844 370 14756 25016161 3472 0 169 0 0 0
cpu0 2606 12 713 1564405 121 0 2 0 0 0
cpu1 2606 31 680 1564426 194 0 0 0 0 0
cpu2 4066 56 804 1563008 106 0 2 0 0 0
cpu3 2689 11 815 1564204 160 0 1 0 0 0
cpu4 3101 10 840 1563440 505 0 0 0 0 0
cpu5 2822 13 1032 1563566 150 0 1 0 0 0
cpu6 2714 42 821 1563976 241 0 1 0 0 0
cpu7 4316 25 1250 1561484 175 0 2 0 0 0
cpu8 3398 6 1132 1563332 117 0 2 0 0 0
cpu9 3367 16 1035 1563252 75 0 5 0 0 0
cpu10 3441 34 750 1563458 207 0 6 0 0 0
cpu11 3047 6 806 1563666 67 0 98 0 0 0
cpu12 3206 17 1007 1563156 287 0 2 0 0 0
cpu13 2614 18 1160 1563696 400 0 0 0 0 0
cpu14 2829 49 942 1563503 422 0 39 0 0 0
cpu15 3015 18 964 1563583 238 0 2 0 0 0

Aha, so there’s a total, along with a separate line for each CPU. Remember that the idle count is the 4th column. Let’s look at a before and after comparison of cpu13, for example:

cpu13 2612 18 1158 1562700 400 0 0 0 0 0
cpu13 2614 18 1160 1563696 400 0 0 0 0 0

That looks about right! The idle count for that CPU increased by 996, which is about 1000. Here’s a comparison between the “total CPU” lines:

cpu  49815 370 14743 25000206 3471 0 169 0 0 0
cpu 49844 370 14756 25016161 3472 0 169 0 0 0

The total idle count increased by 15955, which is about 1000 times 16. This is my AMD Ryzen 5700G with 8 cores and 16 threads. Each thread is treated as a CPU as far as Linux is concerned. Okay, the math checked out! I felt like I understood how top does its calculations.

Next, I performed a similar comparison in my modern Chumby kernel. There’s only one CPU, so it prints out cpu and cpu0 lines that are identical. I decided to filter it further to only show the total CPU:

# cat /proc/stat | grep 'cpu ' ; sleep 10 ; cat /proc/stat | grep 'cpu '
cpu 420 0 3204 4 0 0 599 0 0 0
cpu 421 0 3213 6 0 0 600 0 0 0

I had finally made some real progress in tracking down the issue. This output is just plain wrong. Even though 10 seconds elapsed between the two lines being printed, the idle count only went up by 2. This was indicating that the CPU had only spent 0.02 seconds idling during that period. That result made absolutely no sense, because the other columns only increased by a total of 1 + 9 + 1 = 11 ticks = 0.11 seconds.

At this point, I excitedly booted up the old 2.6.28 kernel which seemed to behave perfectly fine with top, and repeated the experiment:

# cat /proc/stat | grep 'cpu ' ; sleep 10 ; cat /proc/stat | grep 'cpu '
cpu 4 0 198 4067 5 1 16 0 0
cpu 4 0 199 5067 5 1 17 0 0

The idle tick count increased by exactly 1000 ticks, also known as 10 seconds — just as I would have expected! This gave me something to actually start looking at in my new kernel. It wasn’t incrementing the idle tick counter for some reason. Well, occasionally it did, but not very often. This perfectly explains why it would almost always show the CPU as being 100% busy. It’s because top just calculates the percentages of each column’s ticks added versus the total ticks added between each time it checks. With the difference in idle ticks being 0 or close to 0 every time, it makes sense that the CPU was usually showing up as 100% busy.

Now that I knew what was happening, the next question was: why? Where in the kernel would I find the code that increments the idle tick count? What could possibly be broken to only cause this to happen on the PXA16x?

I searched around a bit and happened to find this commit to the OLPC kernel. It’s very straightforward: it disables CONFIG_NO_HZ in xo_4_defconfig. The commit message by Jon Nettleton mentioned a problem that seemed very similar to mine:

There appears to be an accounting problem when CONFIG_NO_HZ is enabled in our kernel that leads to high system cpu percentage reporting. This in turn breaks fast user suspend.

Interesting! The commit title starts with “arm: mmp3” which means it’s another Marvell CPU newer than mine, so I was intrigued by this. CONFIG_NO_HZ is explained in detail in the kernel documentation. What it does is disables periodic timer interrupts used by the scheduler when the CPU is idle. It turns out that my kernel had CONFIG_NO_HZ_IDLE=y, which is effectively the same thing as what used to be CONFIG_NO_HZ=y. So I had a config similar to the config OLPC was using when they experienced this problem!

This left me with some new ideas to try. Luckily, the kernel documentation linked above explains old and new names for all of the relevant config options. In particular, it told me that the equivalent of disabling CONFIG_NO_HZ on a newer kernel would be to set CONFIG_HZ_PERIODIC=y instead. It also informed me that I could boot with an extra kernel command line option “nohz=off” to temporarily try it without recompiling. So I booted with nohz=off and reran my tests:

Mem: 50240K used, 53616K free, 200K shrd, 3116K buff, 27276K cached
CPU: 0% usr 0% sys 0% nic 99% idle 0% io 0% irq 0% sirq

I immediately knew the problem was gone, because top was no longer showing 100% CPU usage. That “99% idle” section above was music to my ears. And obviously, running the same test with /proc/stat worked correctly too:

# cat /proc/stat | grep 'cpu ' ; sleep 10 ; cat /proc/stat | grep 'cpu '
cpu 345 0 731 7170 67 0 0 0 0 0
cpu 347 0 733 8167 67 0 0 0 0 0

OLPC’s workaround fixed the problem for me too!

This was exciting, but I didn’t feel satisfied with it. It felt a bit like magic. I knew I needed to dig further to discover exactly what the problem actually was. CONFIG_NO_HZ_IDLE should have been working fine, so I was pretty sure there was a bug somewhere deeper. I wanted to be able to use my CPU in dyntick-idle mode!

I decided to trace backwards starting from /proc/stat. Where is the code in the kernel that provides /proc/stat, and where does it get its idle time number?

The content of /proc/stat is provided by the show_stat function in fs/proc/stat.c. At the bottom of the file you can see the call to proc_create which sets up “stat” as a file in /proc. Anyway, show_stat calls get_idle_time, which calls get_cpu_idle_time_us. This last function is definitely interesting because the comment above it informs us that “This time is measured via accounting rather than sampling.”

It looks at the time by calling get_cpu_sleep_time_us, which itself is calling ktime_get to find the current time to use in its accounting. I couldn’t imagine that any of the actual math being done here was wrong. Otherwise, wouldn’t there be huge uproar and lots of people with different architectures would all be seeing this? The fact that the same thing had happened on another Marvell ARCH_MMP processor was a major clue.

What does ktime_get return and how does it work? The idea behind it is simple. It returns a ktime_t which represents a monotonically increasing timer stored in nanoseconds. Specifically, the documentation says that the time returned by ktime_get starts at system boot but pauses during suspend. At some point it’s going to have to call into hardware-specific code, which is where I suspected the problem was.

If you follow the chain of calls from ktime_get, you end up going through some functions in kernel/time/timekeeping.c, eventually ending up at tk_clock_read which calls the read function provided by a clocksource. A clocksource is, as the kernel source code says, a “hardware abstraction for a free running counter.” Finally, I had found the path to some hardware-specific code.

I searched in the kernel’s arch/arm/mach-mmp source directory for anything related to clocksource. Bingo:

time.c: *   Support for clocksource and clockevents
time.c:static u64 clksrc_read(struct clocksource *cs)
time.c:static struct clocksource cksrc = {
time.c: .name = "clocksource",
time.c: clocksource_register_hz(&cksrc, rate);

Looking further at the clocksource that is registered in this file, I saw that the relevant function for grabbing the current time would be clksrc_read:

static struct clocksource cksrc = {
	.name		= "clocksource",
	.rating		= 200,
	.read		= clksrc_read,
	.mask		= CLOCKSOURCE_MASK(32),
	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
};

That is simply a wrapper for timer_read, which is the actual code that reads from the hardware timer:

/*
 * FIXME: the timer needs some delay to stablize the counter capture
 */
static inline uint32_t timer_read(void)
{
	int delay = 100;

	__raw_writel(1, mmp_timer_base + TMR_CVWR(1));

	while (delay--)
		cpu_relax();

	return __raw_readl(mmp_timer_base + TMR_CVWR(1));
}

This code is more complicated than what I expected to see. I was thinking it would just be a simple register read. Instead, it has to write a 1 to the register, and then delay for a while, and then read back the same register. There was also a very noticeable FIXME in the comment for the function, which definitely raised a red flag in my mind.

I decided to go back to the Armada 16x software manual to read up on the hardware timers. My first thought was perhaps I had the timer configured for the wrong clock rate, but I was able to confirm that the timer was indeed configured for 3.25 MHz just like code said it was. At the bottom of page 683 in the PDF, I found an interesting blurb about the timer count registers:

The timer values are read under risk of metastability. Therefore, reading each one of the CR Register values in the timer clock time base is accomplished by either of the following: Double-read procedure and comparing the two read values to ensure that the value is valid using the CVWR Register, especially effective for fast clock timers.

Some formatting or punctuation must have been lost during creation of the manual. I had to read the above paragraph about 10 times before I figured out what I think it was trying to say. Here’s my edited version:

The timer values are read under risk of metastability. Therefore, reading each one of the CR Register values in the timer clock time base is accomplished by either of the following:

  1. Double-read procedure and comparing the two read values to ensure that the value is valid.
  2. Using the CVWR Register, especially effective for fast clock timers.

Clearly the code in the Linux kernel was implementing the second solution. I checked out Marvell’s documentation for the CVWR register. Let’s just say the documentation about the register was quite sparse:

This register prevents the risk of instability on counter value reading

Write:
0x0: No effect
0x1: Capture value of CRn
Read:
Returns the captured value of CRn Register

Yeah, that’s definitely what the kernel was trying to do. On a whim, I decided to try replacing the existing code with a simple read of the counter register instead, ignoring the metastability risk altogether:

static inline uint32_t timer_read(void)
{
	return __raw_readl(mmp_timer_base + TMR_CR(1));
}

It worked! With this change, top showed good idle CPU time. I was confident I had found the source of the problem. I went back to the original broken code reading from the CVWR register and played around some more. If I increased the delay from 100 to 500 iterations, the original code started working. Tweaking it further, 200 didn’t work but 300 did.

I had finally found the root problem. My initial guesses about the issue were totally wrong. It was a simple timing problem during register reads. The existing code wasn’t waiting long enough after asking the timer to capture the latest value, so it was returning the value from the previous read attempt instead. The FIXME comment above the function was correct. Too bad the software manual didn’t give me any info about what the delay should actually be.

Around this point, I remembered that Chumby’s original 2.6.28 kernel worked correctly, so I decided to take a look at its version of this section of code. Not surprisingly, it had a different version of timer_read:

/*
 * Note: the timer needs some delay to stablize the counter capture
 */
static inline uint32_t timer_read(void)
{
	volatile int delay = 4;
	unsigned long flags;
	uint32_t val = 0;

	local_irq_save(flags);
	__raw_writel(1, TIMERS_VIRT_BASE + TMR_CVWR(0));

	while (delay--) {
		val = __raw_readl(TIMERS_VIRT_BASE + TMR_CVWR(0));
	}

	val = __raw_readl(TIMERS_VIRT_BASE + TMR_CVWR(0));
	local_irq_restore(flags);

	return val;
}

First of all it was using timer 0 instead of timer 1, but the big difference was it implemented the delay by performing multiple reads from the register rather than an empty loop. It was also disabling interrupts during the capture.

I searched for more Marvell vendor kernels to see if they did the timer read any differently:

The kernel at the first link seemed to be implementing the double-read approach that Marvell had suggested, but only if the timer was configured for 32.768 KHz. Otherwise it did something similar to the Chumby 2.6.28 kernel, but without disabling interrupts. The second link seemed to just ignore the risk of metastability altogether when in 32.768 KHz mode, and otherwise also implemented a solution similar to the old Chumby kernel.

This bug has been in the kernel ever since the MMP architecture was added. The original patch submission adding PXA168 support in early 2009 contained the problematic code, but it was using timer 0 instead of timer 1. So it’s kind of a hybrid of the Chumby kernel’s code and the newer mainline code. At this point it didn’t have FIXME in the comment:

/*
 * Note: the timer needs some delay to stablize the counter capture
 */
static inline uint32_t timer_read(void)
{
	int delay = 100;

	__raw_writel(1, TIMERS_VIRT_BASE + TMR_CVWR(0));

	while (delay--)
		cpu_relax();

	return __raw_readl(TIMERS_VIRT_BASE + TMR_CVWR(0));
}

However, the FIXME is in the initial commit in the kernel git history, so someone at the time must have known that something was still broken in the code. It appears that changing to timer 1 instead of 0 was a change added later to ensure that clockevents had their own timer instead, fixing a bug observed on the OLPC XO-1.75.

I was shocked that nobody had ever noticed this bug in the mainline kernel, but that’s what happened. I guess to be more accurate, the OLPC project did notice it at one point, but since they had a workaround it wasn’t a huge deal. I decided to go ahead and fix the underlying problem.

Because Marvell vendor kernels read from the register multiple times to implement the delay for the CVWR approach, that’s what I went for. I had to decide how many iterations to delay. Chumby’s kernel did a total of 5 reads of the CVWR register. The other two kernels did a total of 3 reads. I opted to use 4 as a middle ground, just in case Chumby had a real reason to have more iterations. I also decided against disabling interrupts during the delay. It seemed pointless — what’s the worst that could happen? Two timer reads happen nearly simultaneously and the first read ends up returning a slightly later time value than it would have otherwise returned? Or maybe the first read attempt is canceled and returns the old timer value instead? I suppose that could be problematic. Oh well, the existing code wasn’t disabling interrupts either.

I first submitted my fix for this problem in September 2022, but I didn’t receive any responses. I ended up resubmitting it later that year and CCing the main SoC maintainers the second time, and they took care of merging my fix. It was finally released with Linux 6.2 and was also backported to several 4.x, 5.x, and 6.x kernels. Ever since I implemented this fix, I haven’t noticed any problems with CPU time reporting on my Chumby.

That’s the story of how I figured out why the CPU usage was always showing up as 100% on my Chumby 8. It was quite the journey through the source code of BusyBox, /proc, and multiple layers of kernel code to find a little gremlin in the code that reads from the PXA168’s timer count register. It was very satisfying to be able to fix the problem! The time I spent was totally worth it too. I learned all about how procfs works and how top gets its info about CPU usage. I still feel like I know almost nothing about the internals of the Linux kernel, but solving a problem like this was a fantastic way to dip my toes into it.

In the next post in my Chumby kernel upgrade series, I’m going to go over how I got the real-time clock working, so that it would remember the date and time while it was turned off. Click here to go there now.

Trackback

9 comments

  1. Man, you really have a knack at keeping the reader engaged! All while digging through the most obscure corners of very niche HW. Regards!

  2. Thank you Oleg! I really appreciate your kind comment about my posts!

  3. This was a great read! Just Oleg mentioned it was captivating through the very end and I felt like I learned something. Thanks for sharing 🙂

  4. I love these kind of bugs where the fix seems simple but the journey to get there is so long and interesting! Thanks!

  5. Well written. I rarely read through to the end but I had to know what caused it. Thank you for an engaging writeup

  6. Thank you all for the nice comments about this post! I’m so glad you enjoyed reading it. It was definitely a fun one to write, with lots of twists and turns.

  7. Wow thanks for this article. I also had consternation about why my CPU is 100% used in different embedded boards. I am trying to find explanation without success it seems to big task for me. Now thanks to you I understand how top works and some nice stuff about idle.

  8. What means metastability for a register read ? Could it be handled by the hardware instead ?

  9. I’m glad the article was useful, mztulip!

    Anon, I am not sure about that either. I’m far from a hardware expert, but I wonder if it’s some kind of issue with crossing clock domains or something along those lines. It’s the first time I’ve ever run into a timer count register that can’t just be read directly.

Add your comment now