Loader bug.

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Re: Loader bug.

Postby joseluisquiroga » Fri Apr 07, 2017 7:52 pm

You have already reproduced the bug when you admit that the code provided

https://github.com/joseluisquiroga/para ... lq-test-22

fails the assert when runned after boot.

If you look at the 672 lines file

https://github.com/joseluisquiroga/para ... ader_znq.c

you will see two functions


uint8_t*
bj_memload(uint8_t* dest, const uint8_t* src, bj_size_t sz){
bj_size_t idx = 0;
for(idx = 0; idx < sz; idx++){
bj_force_assig(dest[idx], src[idx]);
}
return dest;
}


void
bj_ck_memload(uint8_t* dst, uint8_t* src, size_t sz){
bool ok = true;
long aa;
for(aa = 0; aa < sz; aa++){
if(dst[aa] != src[aa]){
ok = false;
break;
}
}
if(! ok){
write_file("SOURCE_shd_mem_dump.dat", src, sz, false);
write_file("DEST_shd_mem_dump.dat", dst, sz, false);
bjh_abort_func(9, "bj_ck_memload() FAILED !! CODE_LOADING_FAILED !!\n");
}
}

Where bj_force_assig is just:

#define bj_force_assig(var, val) do { (var) = (val); } while((var) != (val));

As you can see these are trivial implementations of memcpy and memcmp.

The rest of the code is just almost copy paste plus some renaming and NO CHANGES to the VALUES PASSED to the memcpy call in the original function e_process_elf (there called bjl_process_elf) for shared mem code.

You can copy paste this functions and the define into the original code of e_load_group and then use them toghether with the PROVIDED elf if you want a reproduction with the ORIGINAL code. Making the obvius changes so that it compiles. We want these and NOT memcpy NOR memcmp because the latter have optimizations that MAY NOT ALLOW the process to be interrupted. We want that interruption so that the memory gets corrupted by some other process (kernel process corrupting the NON insulated shared memory of the kernel process controlling /dev/epiphany/mesh0), and so the assert fails in order to REPRODUCE the bug with the ORIGINAL code. The only user processes that should have access to the shared memory are the ones that have opened /dev/epiphany/mesh0. In the given examples I am VERY confident it does NOT happen.

REMEMBER I make "NO CHANGES to the VALUES PASSED to the memcpy call". The failure in the given examples happens between my "memcpy" and my "memcmp". What has JUST been "memcpy" ed is DIFFERENT when "memcmp" ed !!!!

It is the driver my friend.

Before all this I already knew there was a bug in e_load_group because I had checked the machine code loaded from the epiphany side of a function of mine in a weard case that was happening to my code: my program blocked the epiphany when linked with a file and did not when not linking with the same file (nothing was been called nor used of that file in the running code), so I decided to check from the ephifany side what was the loaded code in shared memory for that function, and found that the machine code (the memory in shared mem) was corrupted when linking with the big file so I decided to use the source code of e_load_group to track what was happening. You can see what I was doing in the first example. I was already planning to implement my own loader for my code (because I want my code to be able to load different epiphany elfs and a common base of shared code) so I renamed and did some other changes.

Yes I thought initially that the problem was with the mmap call. I WAS WRONG. That is why I load into heap the elf file (to check if it was the mmap func) in the second example. The mems cannot overlap if the file is in the heap (well they could but that would be another story). So then I thought I was in the e_alloc call of e_load_group. The one that is already well documented in the specs. It was not. I WAS WRONG. That is why there is a copy paste of the e_alloc in the last example.

I do not have the experience, not the time, nor the interest to get into your code and I would not let you get into my code if I were you.

I have never implemented a driver in my life so literally "I cannot help you with that".
Here is what if found in the internet about these kind of drivers (shared mem):

http://myclipnotes.blogspot.com.co/2015 ... odule.html

I do not see a call to remap_pfn_range in your code but "I have never implemented a driver in my life". So what do I know if there should be one.

REMEMBER I make "NO CHANGES to the VALUES PASSED to the memcpy call". The failure in the given examples happens between my "memcpy" and my "memcmp". What has JUST been "memcpy" ed is DIFFERENT when "memcmp" ed right AFTER the "memcpy" !!!!

It is the driver my friend.

And that is my take on this issue.

Shalom.

JLQ.
joseluisquiroga
 
Posts: 24
Joined: Fri Dec 09, 2016 4:41 pm
Location: Bogota, Colombia

Re: Loader bug.

Postby GreggChandler » Tue Apr 25, 2017 5:19 am

Have you verified that enough stack space is left for your epiphany program? With a few of my applications, I have seen seemingly random behavior when enough stack space is not left after the build process. The Epiphany can't automatically grow the stack like the host can. The e-size tool with the -A option is useful to me in verifying the sizes of various program segments and their locations in the address space. Insufficient stack can cause much strange inexplicable behavior. In my case, I created a new linker script and moved some initialization code and data to the external shared memory. It runs slower, but leaves more room for the higher performance code that needs to run in core memory.
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: Loader bug.

Postby joseluisquiroga » Sun May 14, 2017 11:41 pm

It seems that I make people waste their time looking at my mistakes.

I apologize for that. I hope next time I do a better job.

Olajep in a private communication showed me that I WAS WRONG yet AGAIN. These examples show nothing except that I make mistakes and think they are others.

Ola Jeppsson: Thank you for your time and patience.

Here is the corrected code that loads ok:

https://github.com/joseluisquiroga/para ... -test-22ok

I still have to find out why when I was using the original e_load_group the shared memory was corrupted. Probably an other mistake of mine. But I as said earlier "I hope" not. Just so that I do not loose all credibility if that has not already happened.

Sorry to everyone that look at this post again. And thanks again to Ola Jeppsson.

JLQ
joseluisquiroga
 
Posts: 24
Joined: Fri Dec 09, 2016 4:41 pm
Location: Bogota, Colombia

Previous

Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 6 guests

cron