The 5th Whole Brain Architecture Hackathon Forum

Go back to competition Back to thread list Post in this thread

> Positional coding, what-where data structure, action output

I'm finding it a little difficult to understand the implementation for the visual pathway.

I may have missed it in the wiki, but I'm curious if anyone can provide clarification on the following:

1. How can I interpret the positional encoding? I see the test image but I'm not sure how to understand it. E.g. why are the pe_x and pe_y figures different heights? From the description, I was expecting this to be a mapping of a floating point to a 2D coordinate in screen space.
2. What is the "what-where" data structure? That is, how are the positional encodings and outputs of visual cortex concatenated/transformed into a tensor representation prior to being fed into PFC and MTL?

Then, separately, I have a high level question about the RL setup. I can see that the DM2S environment has 3 "choice actions" (for each button press, and no action) in addition to N integers representing saccades to on-screen grid locations (e.g. one of a 5x5 grid). My understanding is that the actor produces exactly one action on each step, which may be either a button action or a gaze action, but not both. Further, a non-zero reward is provided once when the correct button action is chosen during the STATE_PLAY_SHOW state. Is that correct?

Thanks!

Posted by: jrgordon @ June 23, 2021, 8:48 p.m.

The purpose of the Positional Encoding (PE) is to provide an embedding of gaze position to the higher brain centres (e.g. MTL and PFC). Since they are neural networks, a scalar floating point number for each dimension is not usable, it must be a vector representation. We used the PE concept from Transformers (see the original paper Attention Is All You Need or blogs explaining PE) as it has desirable properties. It's deterministic, each position is a unique vector and it allows relative positions.

The what-where data structure is simply a dictionary. What and Where are each fields in that dictionary.

RE High Level RL. Your understanding is correct.
It can output only one action, either eye movement or a DM2S 'choice' depending on the value of the integer. If you want to handle each differently with downstream processing, such as implementation of the Superior Colliculus, you'll need to implement some routing code that checks what range the integer is in. Or it may be desirable to create another agent (one for eye movement, one for 'choice').

Hope that helps. Don't hesitate to ask more questions or jump in the Slack for more interactive discussion.

Posted by: affogato @ June 24, 2021, 11:26 a.m.

Thanks for the detailed response!

Posted by: jrgordon @ June 24, 2021, 6:57 p.m.
Post in this thread