(*This content is the English version of the article posted in eSOL’s Japanese Blog in August.)
In continuation of this article, I will explain troubleshooting examples for the issue of a corrupted FAT on an SD memory card.
The hypothesis the veteran support engineer formulated was that the file system initiates memory reading before the DMA controller completes the transfer of data from the SD controller to the memory.
Figure 1: The veteran support engineer’s hypothesis (*Click this image to expand it).
Even after hearing that, I still had doubts. The hypothesis didn't seem to provide a plausible explanation for the observed symptoms.
Firstly, I embedded trap codes in the SD driver's receive processing. If, as the hypothesis suggested, the data was already corrupted upon reception, this trap code should have caught it. Upon investigating this aspect, I found that in the new hypothesis, while the received data did get corrupted, it didn’t break in the expected manner, so the trap code didn’t get triggered. The observed symptoms were characterized by a distinctive feature: only a specific segment of 32 bytes within the continuous FAT was corrupted, with no damage to the surrounding data. As a result, I set the trap condition because both before and after the specific segment were being used as FAT, while only the intermediate portion remained free. On the other hand, if the hypothesis was correct, during the reception in the SD driver, the preceding portion would be utilized as FAT, while the subsequent part would be free. Therefore, the trap code would be bypassed (as shown in Figure 1, step (2)).
I had other questions as well. After updating the FAT, if received data from the DMA arrived, was it possible that the updated FAT data might have been overwritten by the received data from DMA? Considering the presence of a cache, this could be explained. Memory access from the CPU went through the cache. As a result, when the received data was read by the CPU, it was placed in the cache (as shown in Figure 1, step (2)), and the updated FAT content was first written to the cache (as depicted in Figure 1, step (3)). Subsequently, the received data from the DMA arrived. Generally, memory access from DMA does not go through the cache (though this depends on hardware specifications and the cache level). In other words, the received data from the DMA was written directly to memory, so the updated FAT data written to the cache was not overwritten (as illustrated in Figure, step (4)).
However, I still had questions. The last 32 bytes of the updated FAT data written to the cache were all zeroes. On the other hand, the last 32 bytes of the data being written back to the SD memory card contained the continuation of the FAT data. When the cache was flushed (writing back to memory) and the memory's content was written to the SD memory card, wouldn't the last 32 bytes also be written as zeros? This aspect could be explained by the characteristics of the cache's Dirty bit. Dirty bit is a hardware flag indicating that when a write to the cache was performed from the CPU in a Write Back cache, the data was written only to the cache and hadn’t yet been written to the main memory. During cache flushing, the cache controller wrote back cache lines where the Dirty bit was set, indicating that the data in the cache needed to be written back. Cache lines with the Dirty bit unset, however, were not written back during this process. In the scenario where the cache and memory contents were inconsistent due to DMA updating the memory without writing from the CPU, the Dirty bit wasn’t set. Consequently, even if cache flushing was performed, the cache content wasn't reflected in the memory since the Dirty bit wasn't triggered. Hence, as shown in Figure 1, step (5), the visible value from the CPU (cache) becomes 0, but "4444" was written to the SD memory card as a result.
In the end, investigating along this hypothesis revealed a hardware issue where the interrupt for transfer completion could occur before the DMA transfer was completed. What are your thoughts on this? Are you able to arrive at this hypothesis by deducing from the issue?
Now, let's analyze the key points that allowed the veteran support engineer to come up with this hypothesis. By reflecting on this, I can gain insights for future troubleshooting. Looking back, I believe there are three main points to consider regarding this hypothesis.
1. Inference based on similar past cases
Actually, the senior engineer mentioned that they had encountered a similar issue in the past. This time, it turned out to be a hardware malfunction. However, the issue the senior engineer had encountered in the past was due to a software implementation error. During that incident, the setup consisted of the SD controller transferring data from the SD memory card to the FIFO and then utilizing DMA to transfer from the FIFO to memory. Due to an error, the premature triggering of the transfer completion to the FIFO by the SD controller caused an early return from the receiving API. Since the transfer to memory was handled by DMA, it should have been triggered by the DMA transfer completion. The ability to propose the hypothesis that the system started to read received data before the reception was complete was greatly influenced by having experienced a similar issue before. In other words, identifying the root cause of a problem at first glance was challenging, and it's often easier to pinpoint the cause of a similar issue when you have encountered it before. This is the reason why I decided to share this case with all of you this time. I believe that knowing about past cases makes it easier to form hypotheses for troubleshooting potential future issues. Therefore, I intend to continue sharing such cases for discussion in the future.
2. Having a Strong Understanding of the Relationship Between CPU, DMA, Cache, and Memory
This hypothesis wouldn't have been derived without a good understanding of the relationship between CPU, DMA, and cache.
Considering only the data on the SD memory card and in memory wouldn't explain this hypothesis. Taking cache into account helps in understanding and explaining the hypothesis. The hypothesis outlined in steps (1) to (4) in Figure 1 was established with the consideration that the content in the cache visible to the CPU might not align with the content written to memory by DMA. The relationship between CPU and DMA, as well as between cache and memory, is a critical aspect that often poses challenges, not only in this case but in various scenarios. I believe it's crucial for embedded software engineers to have a solid understanding of these dynamics.
3. Proficient in Cache Specifications
One of the intriguing aspects of the observed symptoms this time was that only a portion in the middle of the FAT was being altered, while the data before and after remained unaffected.
If the file system had received the data when only a portion had been transferred, as shown in Figure 1, the section with "4444" would likely have been overwritten with zeros. However, the "4444" read from the SD memory card was written back as it was. Explaining this was what step (5) in Figure 1 did, and understanding this relies on an understanding of the cache's Dirty bit, without which the explanation wouldn't make sense. Among hardware mechanisms, particularly the cache, due to its close involvement in device driver implementation, understanding details such as the role of the Dirty bit was crucial for embedded engineers.
In this article, I presented a troubleshooting scenario guided by a veteran support engineer. I've also detailed how this engineer successfully resolved the issue, highlighting three key points. I trust that fellow engineers reading this article will benefit from this simulated experience, enhancing their troubleshooting capabilities.
Moreover, eSOL’s support & maintenance service allows customers who have purchased products to consult with eSOL engineers when facing various issues during the product development process, as in this case.
At eSOL, our in-house developed products are supported by knowledgeable and experienced engineers who assist in ensuring a smooth development process for our global customers. Our support structure has received high praise from our clients.
For more information on our Support & Maintenance, please contact eSOL.