well, in school I learned matrix-multiplication goes this way:
construct a list of matrices:
multiply each cell with the same cell in the other matrix
change the rows resp. columns of both matrices
repeat
once finished apply the summation-algorithm to the resulting matrices.
so if the matrix is much smaller than the number of cores you have,
no core will really be idle since you can spread out the whole work.
as for the summation algorithm itself, also here you can make use of the cores you have in excess to the number of matrix entries. it's as you said, summation is commutative, so additional cores can just calculate the partial sums. the really tricky part is to balance out the load so that at all times the cores are occupied with actual calculations and the data is permanently occupying the busses. for this the help from the computer would be nice, but it's something that needs to be done on the level of a preprocessor. have you seen any preprocessor that's prepared to do the actual scheduling? have you seen any tool at all that can display the actual idle-time for each core and lets you manually re-arrange the program-code so that the idle-time is filled? if you can't do that manually, why expect a computer could do it automatically?
as for languages, I agree, a programmer should choose the language based on how fitting it is to the own preferences, not based on the problem. however, not always are there bindings for the needed libs in your favourite language. even oop in assembler needs some sort of preprocessor-support to reduce work and to get the lib-bindings right. either you created a huge collection of those things for your programming-language, or you use the same language most others use: c or c++! only problem with the two is, they are not compatible to each-other. apart from a problem with re-using each-other's headers, there also is the problem that in c they use a preprocessor while in c++ they use templates and such for the same purpose. so in effect you
will end up programming in a language instead into. especially oop requires something alike to struct to be present in a language, the preprocessor of c doesn't have that. c++ could be extended to reuse the language's own struct for the use within templates (i.e. template< reinterpret_cast<quot>({3,5}).den > should be compiled as template<5>)...