0 votes
par dans Séquence 17
Do there exist this type of iterative methods where one distributes the network nodes of each layer (so layers are cut "horizontally" = same number of layers on each sub network, but less nodes in the "vertical" dimension), train each subnetwork independently and "at the end" communicate them and retrain them? Are there or could there be any important advantages/disadvantages/concerns with this approach?

1 Réponse

0 votes
par Vétéran du GPU 🐋 (20.4k points)
sélectionné par
 
Meilleure réponse
Tensor Parallelism is a method to distribute the network by cutting them horizontally. And training different networks and making them “work together” is what we call Ensemble Learning. But I don’t see what you mean by “distribute a network and train each subnetwork independently”. Why do you need to distribute a network if you will train each subnetwork independently ? You can just train several small networks, no ?
par
edité
I see how tensor parallelism makes something like this at the algebraic level. I will study more about the Ensemble Learning to understand, thanks.

For the subnetwork part, what I meant is, for example in this picture:

https://drive.google.com/file/d/1OajxveZqxz06yDvzwh5rsaoKVJZ3AIOf/view?usp=sharing

We start from a network like the drawing on top.

 Then we split the nodes in two groups, the top and bottom parts (bottom left drawing). The bottom nodes receive the values through the green lines while the top nodes through the red ones.

The info received by each node are the "frozen values" set at the beginning (starting with an initial guess, bottom right drawing). For the forward stage it would only change the bias of the node and for the backward stage I think it would be the gradient (need to think more about that one)...

Once each half has converged to an optimum then the other half is trained again but with updated "frozen values" from the new optimum.

It sounds quite complicated, but is the basic idea of substructuring methods and domain decomposition methods. It is a fixed-point iteration process between both sub-networks.
par Vétéran du GPU 🐋 (68.8k points)
I think I understand what you are suggesting. We fall back on things that we do in HPC.
This is not a method that is used (to my knowledge) for deep learning training.
On your example it's possibly easy to implement but I'm sure on larger architectures it gets trickier.
In terms of possible performance, I cannot make an estimate. Lots of factors to take into account, such as the number of GPUs, network asynchrony, learning stability...

We are really on a solution that possibly would have an interest on the giant models.

ps: I should check if implementations of parallelism aren't using more or less the same thing already.
...