Parallella Community

Posted: **Sun Dec 23, 2012 9:40 pm**

Hi, I'm watching the online lectures . Currently at lecture 4.2. While I consider myself as a complete newbie, I notice getting frustrated, not because of the information provided, or "complexity" (which is very do-able), or its sporadic vageness (like introducing matrix multiplication, without use-case), or his stutter... And I know this is all for CUDA and Nvidia chips. But... while he introduces methods and steps you need to be aware of, as a parallel programmer, it is all limited to (what I call) "best practice", which in-fact are (automatic) "optimizations" upon normal procedural programming.

So what came to mind: can we build an 'optimisator' for Parallela? A [LANGUAGE]*-compiler which implements all/most of those steps we have to _write_ ourselves as programmers, so we might worry about our programming logic instead of our memory management, blocks/tiles/threads and caches of shared memory/registries and such. And so, our 'regular' code will be executed in parallel (where we didn't think of implementing it [yet] as parallel).

*) a language like C, Python, Perl, or even a webscript in PHP, or any other.

Posted: **Mon Dec 24, 2012 12:10 am**

I know mathematics and I can assure you matrices are important. more exactly most mathematical problems can be re-formulated into a problem with matrices. therefore it isn't really needed to optimize for this architecture when you make use of a lib that implements various matrix-manipulating algorithms. the real problem is the notion of "sparse matrix" which basically means you artificially introduced a matrix but in the end the matrix does not contain much info as most of its cells are zero. for example when you want to swap 2 entries in a vector, the matrix doing that work contains only ones and except for the 2 entries you want to swap all are aligned diagonally. in programming you will more likely produce something to that effect than an actually filled matrix. working with matrices basically requires some sort of classification of what kind of matrix you use. if you figure that out there are a lot of things you could deduce. but computer isn't smart enough for that, and so even if a compiler could translate your program into basic matrix multiplication, which it can't, then it still wouldn't help much. there is a limit to how much compilers can do without the programmer's help. so the actual work for making your wish come true is not in the compiler but in the mathematicians introducing new mathematical concepts into new programming languages or libs. the programmers allso must be educated of these concepts and start using them. parallella-optimized gentoo build is not a possibility, you could as well claim your gentoo is optimized for mmx...

looking at various programming languages, I have never seen any language that allows for dependency-tracking. a functional language is close to that, whenever you write a process that does not have any side-effects the compiler is allowed to know what exactly this process depends on. if such a process is created from small building-blocks it might be possible for the compiler to guess what can be parallellized. however, if during run-time of such an optimized program you would require the work of all your cores, how do you know what core is busy and for how long? it's a design-decision on how much control the programmer has over hardware-features, and most likely it is decided in favour of the programmer's freedom. for example if a ip-phone-client would use your gpu for en-/de-coding the audio-stream, you think you could still play a game at the same time as you are talking? so while dependency-tracking is possible in some languages, the compilers wont turn a single-threaded program into multithreaded without your explicit intervention. but it is an interesting idea to force the compiler into doing that when certain optimization-flags are given...

Posted: **Mon Dec 24, 2012 10:39 am**

Posted: **Tue Dec 25, 2012 1:31 pm**

you're talking of artificial intelligence here. but that isn't necessary and would only slow down the compilation. for now the computers are slow and there must first be some way of programmers informing compiler about the dependencies. that's what functional programming languages try to do. in general this doesn't work well, some means of affecting the actual computer you run programs on is needed. so in effect one would need a new programming language.

once you know about the dependencies, you still can't do parallell optimization, you further must know about the program-flow, about the timings. at some point in the program certain dependencies are needed, how much time is left for fulfilling them? this again requires some way for the programmer to tell such things like how often some loop will be executed. so far the best programming-languages for doing that are the ones that implement something alike to an "iterator". my favourite was Sather, but c++ works as well. (however iterators have become unpopular and are usually avoided, it seems.) once the if-statement is constructed in a way that iterator gets initialized to a default value and iterated in a default way, it is possible to decorate it with info on how often the loop will be repeated. next question is how often will the conditional statements in the loop return true. again one would need to somehow decorate those with such info. programming would become a bureaucratic nightmare, but at least optimization will be possible. the problem with making that bureaucracy automatic is that in general it is absolutely impossible to tell if a loop you programmed will ever stop looping. only special cases can be judged this way, and limiting the programs to just those cases isn't a good idea either. (you'll need to keep adding infinitely many such special cases, without ever reaching a stable whitepaper for your programming language.)

what I would wish is a preprocessor that allows for doing advanced mathematics, such that the run-time for a loop can be calculated by the programmer with some help from the preprocessor. a __parallell__ in front of your for-statement wont do it, you also need to somehow organize which iterations of the loop will take how much time, so that all processes/threads end at about the same time in this loop. alternatively you can make such decisions at run-time (would take a bit of administrative overhead, maybe too much for small loops), feeding the free cores with new tasks. still some way of predicting run-time is useful then so you could start filling memory with the needed data for the next iterations. parallella offers parallellization not just in terms of additional threads, also inside of the cores many tasks can run in parallell. so the first step to your goals is to improve preprocessor (or the meta-parts of the language itself) to make it easy to learn and generally a very high level kind of language, with the possibility to write maths-stuff quite naturally and transparently. if programmers manage to make use of such a preprocessor for their individual optimizations, then those optimizations could be implemented in the compiler. trying to implement everything that exists front up is quite impossible since compiler-programmers only think in terms of programming compilers -- much wider range of imagination is needed. so, let's take a popular language like c and add a high-level language as its preprocessor. or maybe take a popular language like c++ and improve what can be done in the area of constants and class-definitions and such, improve the way iterations are handled, add custom flags to each block of code and allow some high-level meta-language operate on those flags much alike to what a preprocessor would do. it would be sad if each programmer would need to download the gcc sources and add the needed optimizations there instead into the own source-code.

the idea with the ressource-management daemon is nice too. it would need to be graphical, every parallelization needs a certain shape of connections between cores and the user would have to resize and move around that shape inside of the whole set of cores, so that future tasks might still fit somewhere too.

Posted: **Tue Dec 25, 2012 9:09 pm**

Ok - by definition "regular code" does not really exist. But, also by extending this to "single-threaded C code" then I think parallella does that. Let's look at the implementation from a human/task/work paradigm

Definitions I will proffer:
Core = 1 worker/human
Code on a core = work instructions
System of code on cores = task to be done (project)
Scheduler - loader = Job Boss

So, to have a system that knows how to break down all the potential projects that a group of 16 humans can do and manage it would be really quite magic. Depending on the instructions provided to the humans they could design a whole space station or build a house or pick a field full or oranges.

What can be used to simplify the effort is often called "intelligence". This is the knowledge of putting the right instructions to the right workers to get them to do the right thing to get the project done.

Artificial intelligence is a term implying that any intelligence that is not inherently from a creature is artificial (some religion here I guess). But, creating a program that could perform the operation for work breakdown and code generation for the 16 workers (cores) would indeed need to be intelligent if it would be able to emit code for all the potential projects that the cores can get done. Projects like matrix operations and encryption and email inspection and graphics rendering and ...

So, "regular code" is probably usable on a parallella. Irregular code is not.

The need is to acquire the "intelligence" within the programmer to use regular code to get it done.

Methods of simplifying the task of most jobs include creating standard tools and methods. Well, the initial tools for compiling and running applications on the chip are probably there. But, the tools for getting all the workers to talk to a job boss is weak and getting the job boss to line up all his workers with the right instructions really could benefit from some well defined methods. In The Linux-like world there are pthreads and other POSIX APIs that are very high overhead but allow for some standard tools for the job boss and workers to use. Because these are high overhead they are not ideal for a chip such as parallella. So, alternatives are needed.

I can go on like this forever.... sorry - I am done for now.

Posted: **Fri Dec 28, 2012 2:00 am**

FWIW : one ought program INTO a language, not IN a language. By that I mean the high level methodology needs sorting first up, followed by conversion to some build process that includes a choice of language(s). Thus you could ( but heaven knows why ) enact OOP in assembler. As regards parallel programming I've taken the view that the problem space is where the difficulty lies, so a good way to begin is to identify 'symmetries'. By this is meant 'what stays the same when other things change'. This helps define both dependencies and 'independencies'.

For example in the matmul code provided in the SDK, the parallel part hinges upon the arithmetic behaviour of sums. To wit : the total of a ( finite ) sum is independent of the order of the operands. This is technically known as commutativity and is a form of symmetry in that the result is invariant with respect to operand presentation.

In a fashion, any time you parameterise a function you are assuming some symmetry in that ( often ) the algorithm enacted by the function call does not depend on the value of the operand within some range ( there are exceptions of course ). Otherwise you have identified a dependency ( of algorithm upon operand range ) and an opportunity for factorisation/separation of code components. The gut idea here is that independent aspects may be generally parallelised while dependencies often lead to serialisation ( as a very broad brush stroke comment ).

A further example - the Fast Fourier Transform is the application of a transform matrix to a vector. For a given transform size ( ie. vector length and dimension ) the transform matrix may be expressed as a product of sparser matrices. Matrix multiplication isn't commutative, but associativity rules apply. So a left sided multiplication will form a linear combination of rows while a right sided one will do that for columns. So any given product AB may be viewed as A sorting B's rows, or B sorting A's columns. Thus in a product like ABC you can do (AB)C or A(BC) and get the same result either way, though there may be more efficiency in doing an AB rather than a BC. Hence one has identified a constancy ( independence of the result ) with respect to associativity. Does that give parallelism per se? In one sense no, because you have to do an AB or a BC before a final multiplication. But in another sense yes, because one can begin the final multiplication ( of A with BC, or AB with C ) before either AB or BC are fully evaluated, provided that such evaluation proceeds in a fashion where a partial/interim result can be used. Here one can leverage the relative sparsity of the matrix factors to good effect. Well, that's my plan anyway ... :-)

Cheers, Mike.

Posted: **Fri Dec 28, 2012 11:14 am**

well, in school I learned matrix-multiplication goes this way:

construct a list of matrices:
multiply each cell with the same cell in the other matrix
change the rows resp. columns of both matrices
repeat
once finished apply the summation-algorithm to the resulting matrices.

so if the matrix is much smaller than the number of cores you have,
no core will really be idle since you can spread out the whole work.

as for the summation algorithm itself, also here you can make use of the cores you have in excess to the number of matrix entries. it's as you said, summation is commutative, so additional cores can just calculate the partial sums. the really tricky part is to balance out the load so that at all times the cores are occupied with actual calculations and the data is permanently occupying the busses. for this the help from the computer would be nice, but it's something that needs to be done on the level of a preprocessor. have you seen any preprocessor that's prepared to do the actual scheduling? have you seen any tool at all that can display the actual idle-time for each core and lets you manually re-arrange the program-code so that the idle-time is filled? if you can't do that manually, why expect a computer could do it automatically?

as for languages, I agree, a programmer should choose the language based on how fitting it is to the own preferences, not based on the problem. however, not always are there bindings for the needed libs in your favourite language. even oop in assembler needs some sort of preprocessor-support to reduce work and to get the lib-bindings right. either you created a huge collection of those things for your programming-language, or you use the same language most others use: c or c++! only problem with the two is, they are not compatible to each-other. apart from a problem with re-using each-other's headers, there also is the problem that in c they use a preprocessor while in c++ they use templates and such for the same purpose. so in effect you will end up programming in a language instead into. especially oop requires something alike to struct to be present in a language, the preprocessor of c doesn't have that. c++ could be extended to reuse the language's own struct for the use within templates (i.e. template< reinterpret_cast<quot>({3,5}).den > should be compiled as template<5>)...

Posted: **Sun Dec 30, 2012 3:24 pm**

Posted: **Tue Jan 01, 2013 3:04 pm**

not to interrupt you, but I'd like to comment that the workers zero-three are quite dumb workers, their memory-capacity is very limited. maybe in future this would change, but the design of parallella does impose that limitation on all backwards-compatible 32-bit chips!

Posted: **Fri Jan 04, 2013 6:32 pm**

(I forgot to continue after dinner -- It's again time for dinner..)

Yes, you are right. They are dumb, so give them very very small easy tasks.

Parallella Community

Running parallel, while writing 'regular' code

Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code

Re: Running parallel, while writing 'regular' code