ESPUSB - Chat about the software components

User avatar
By cnlohr
#52752
_UserExceptionVector and _UserExceptionVector_1 are actually compiled to different sections, so the compiler doesn't know the address of _UserExceptionVector_1 at compile time..


It sounds like you're implying 'j' cannot be retargeted link-time, and that it must be a 'call0' for that reason. I never really considered it, and never noticed because I do everything all-together.

Actually, writing this out makes me think that you might be better off creating your own set of vectors as well - just like the esp-open-rtos approach.


This feels like a very heavy approach, and feels like it might not be as compatible with other systems. Though I agree it is a lot cleaner. Speaking of that...

This also means you don't have to --wrap anything at all, or memcpy random bits of memory.


I did... and it works... Sort of... It can enumerate and works. My latency went from 1.86us to 0.78us! (Latency can be decreased further on my end) A savings of about 74 cycles. I say "sort of" because, unless I attach my handler to the existing interrupt system as well, everything breaks. I am guessing sometimes there is an interrupt already being handled and if levels change while in their handler, everything explodes because the handler is null (Which is OKAY! I can simply throw out the interrupt)

(in a .S, anywhere...)
Code: Select all//This code will be memcpy'd over top of _UserExceptionVector, since I can't figure out how to override it with GCC.
.global replacement_user_vect
.align 4
replacement_user_vect:
//Original code:
   _wsr.excsave1   a0
   _rsr.interrupt a0
   _bbci a0, 4, not_a_gpio_interrupt
   _nop //Will get replaced with a "call0" to my code. (TODO: Can this be a jump?)
not_a_gpio_interrupt:
   _call0 _UserExceptionVector_1
   _ill
   _ill //Zero padding to make it so we can see it clearly in a memory dump.


And, here's my memcopy surgery nastyness, not yet edited or commented... But it creates the 3-byte opcode of what the operations there should be since we can't rely on the compiler to relocate nicely.
Code: Select all   int i;
   uint8_t * ovect = (uint8_t*)0x40100050;
   uint32_t * ovect32 = (uint32_t*)0x40100050;
   uint8_t * replacevect8 = (uint8_t*)(&replacement_user_vect);
   uint8_t * targ8 = (uint8_t*)(&_UserExceptionVector_1);
   uint8_t vect8copy[16]; //We only need 16 bytes.
   ets_memcpy( vect8copy, (&replacement_user_vect), 0x20 );

   //+1 to +4   (When 'call' instruction is at +3)
   //+5 to +8   (When 'call' instruction is at +5)
   //+5 to +8   (When 'call' instruction is at +6)
   //+9 to +12  (When 'call' instruction is at +9)
   //+13 to +16 (When 'call' instruction is at +12)
   int delta_gp = ((uint8_t*)&gpio_intr) - (ovect+11);
   int delta_ue =               targ8 - (ovect+14);

   delta_ue = (delta_ue & ~0x03)<<4;
   delta_gp = (delta_gp & ~0x03)<<4;

   //for call0 to the gpio handler.
   vect8copy[9] = 0x05 | (delta_gp & 0xff); //lsb of jump
   vect8copy[10] = (delta_gp >> 8)&0xff; //...
   vect8copy[11] = (delta_gp >> 16)&0xff; //msb of jump.

   //For call0 to the normal handler
   vect8copy[12] = 0x05 | (delta_ue & 0xff); //lsb of jump
   vect8copy[13] = (delta_ue >> 8)&0xff; //...
   vect8copy[14] = (delta_ue >> 16)&0xff; //msb of jump.


   printf( "%08x %08x %08x\n", ovect, replacevect8, delta_ue);
   for( i = 0; i <0x10; i++ )
   {
      printf( "%02x ", vect8copy[i] );
   }

   ovect32[0] = ((uint32_t*)vect8copy)[0];
   ovect32[1] = ((uint32_t*)vect8copy)[1];
   ovect32[2] = ((uint32_t*)vect8copy)[2];
   ovect32[3] = ((uint32_t*)vect8copy)[3];


Regarding the xtos stuff... cool. I didn't see any of that, but I made a random guess and restore the state myself with:

Code: Select all   _rsr.excsave1 a0   
   rsync
   rfe


There's a real gotcha with understanding Xtensa: exceptions and interrupts are two different things.


That is a small problem. I still have enough room in my vector to handle that if I have to. It shouldn't be too bad to read in EXCCAUSE and jump to the regular handler if it's not set to 4. But... of the nice things about this USB mess is it's okay if I miss interrupts or sometimes call the interrupt without anything legitimate - though it would be better to be clean, for initial testing, things can be very dirty. Especially in USB Full speed! With full speed, I can tell if the interrupt was spurious in about .3-.4us.


If you push the non-working wrap code & makefile you have to a branch somewhere, I can probably take a look. But like I said above, maybe you're better off just writing your own set of vectors and leaving the SDK ones as-is.


I don't know if it's worth your time - I have already asked stack overflow etc. I think I have everything I critically need now to start investigating full-speed.


P.S. I really appreciate your willingness to work with me on this, especially since there's a very high chance it will all be for naught.
User avatar
By forlotto
#52753 Subscribed to thread interesting to see if it will be done.

-forlotto
Last edited by forlotto on Fri Aug 12, 2016 10:33 pm, edited 1 time in total.
User avatar
By projectgus
#52760
cnlohr wrote:
_UserExceptionVector and _UserExceptionVector_1 are actually compiled to different sections, so the compiler doesn't know the address of _UserExceptionVector_1 at compile time..


It sounds like you're implying 'j' cannot be retargeted link-time, and that it must be a 'call0' for that reason. I never really considered it, and never noticed because I do everything all-together.


Before the objects are linked, their final addresses aren't known. If they're in the same section of the same object file, their relative addresses are known (so the compiler/assembler can generate a PC-relative jump). But if they're in different sections, the compiler can't do this - because until link-time it doesn't know where the two sections may get placed relative to each other - they may end up too far apart.

That said, the linker can often "relax" the call back to a jump after the link pass finishes, if it then looks at the two addreses and they're close enough to do a PC-relative jump. This saves some cycles because you don't need to load the literal any more, but it doesn't really save code size (I don't think).

You'd have to look at the final linked .elf output to know if this happens in this case or not - it might.

cnlohr wrote:I don't know if it's worth your time - I have already asked stack overflow etc. I think I have everything I critically need now to start investigating full-speed.


OK. Good to hear. :)

P.S. I really appreciate your willingness to work with me on this, especially since there's a very high chance it will all be for naught.


No problems, I think this is awesome. I know a few people had thought about it, but it's great you're actually doing it and it's working! If you can achieve full speed that'll be really amazing.

Let me know if I can help with anything else.
User avatar
By cnlohr
#52820 Just wanted to let y'all know, turns out you can just fix the linker script! @vogelchr on twitter pointed this out. Now, it's a lot cleaner, and I can tweak out a few more clock cycles by doing the l32r right before doing the jump.

Code: Select allEXTERN(replacement_user_vect)


Code: Select all  .text : ALIGN(4)
  {
    _stext = .;
    _text_start = ABSOLUTE(.);
    *(.UserEnter.text)
    . = ALIGN(16);
    *(.DebugExceptionVector.text)
    . = ALIGN(16);
    *(.NMIExceptionVector.text)
    . = ALIGN(16);
    *(.KernelExceptionVector.text)
    LONG(0)
    LONG(0)
    LONG(0)
    LONG(0)
    . = ALIGN(16);
    *(.relocvec.text)  <<<<<<<<<<<<<<<<<<<<<<< Right here.
    LONG(0)
    LONG(0)
    . = ALIGN(16);
    *(.DoubleExceptionVector.text)
    LONG(0)
    LONG(0)
    LONG(0)
    LONG(0)
    . = ALIGN (16);
    *(.entry.text)
    *(.init.literal)
    *(.init)
    *(.literal .text .literal.* .text.* .stub .gnu.warning .gnu.linkonce.literal.* .gnu.linkonce.t.*.literal .gnu.linkonce.t.*)
    *(.fini.literal)
    *(.fini)
    *(.gnu.version)
    _text_end = ABSOLUTE(.);
    _etext = .;
  } >iram1_0_seg :iram1_0_phdr