I see how tensor parallelism makes something like this at the algebraic level. I will study more about the Ensemble Learning to understand, thanks.
For the subnetwork part, what I meant is, for example in this picture:
https://drive.google.com/file/d/1OajxveZqxz06yDvzwh5rsaoKVJZ3AIOf/view?usp=sharing
We start from a network like the drawing on top.
Then we split the nodes in two groups, the top and bottom parts (bottom left drawing). The bottom nodes receive the values through the green lines while the top nodes through the red ones.
The info received by each node are the "frozen values" set at the beginning (starting with an initial guess, bottom right drawing). For the forward stage it would only change the bias of the node and for the backward stage I think it would be the gradient (need to think more about that one)...
Once each half has converged to an optimum then the other half is trained again but with updated "frozen values" from the new optimum.
It sounds quite complicated, but is the basic idea of substructuring methods and domain decomposition methods. It is a fixed-point iteration process between both sub-networks.