Parallel processing is intended to increase throughput by addressing queuing delays that may be experienced by "ready" units of work that are waiting for access to the processor. Each processor is essentially a hardware server for instructions to be processed. In modern computers there are actually multiple points of parallelism and overlap processing, but the primary point is to avoid delays.
Instruction processing occurs with an architectural concept known as pipelining. It is like an assembly line where the instructions are fetched from memory into the level one (L1) cache1. Data operands are fetched, instructions are decoded and interpreted, and placed on a queue for actual execution.
In some designs the instruction sequence can be processed so that there is a small degree of parallelism when instructions are detected that don't conflict with each other. In those cases, instructions may be processed simultaneously (and potentially out of sequence) with the order of operations being preserved when the results of the operation are committed. This form of parallelism can occur without any programming effort and is used to improve throughput at the processor level.
There is also no throughput solution in creating longer instructions, because complex instructions must still perform the basic underlying operation, so there are no shortcuts in performing such functions. Complex instructions may be implemented by microcode or millicode, but the basic operation is still the same.
The only way an individual program can gain performance increases is if the software can be written with the express purpose of dividing up its functions in such a way that it is running in parallel. The important distinction is that the speed hasn't increased, but we have an "effective" increase in the application's throughput by performing two or more operations in tandem.
By analogy, one can imagine traffic on a highway. Adding lanes increases the number of vehicles that can simultaneously move, but it cannot improve the speed of any individual vehicle except by removing obstacles or "competitors". If we imagine that an application is moving passengers, then we can envision a vehicle transporting an individual(s) and then returning to transport more. If we can have more vehicles that we can simultaneously transfer more people and gain a performance improvement. However, you cannot simply suggest something like a bus or train, because there is no corrolary in computing to simply increase the functionality of a single operation which is what that would mean.
While much has been made of parallelism as a way of achieving improved performance, the ability to exploit parallel architectures by applications can be a problem. Many computing problems simply don't lend themselves to parallelism especially when operations within a program are dependent on results obtained from earlier operations. In those cases, it makes no sense to try and exploit parallelism because the units of work must wait on other tasks to complete. In addition, many problems may have opportunities for one or even several parallel units of work, but this isn't nearly sufficient to capitalize on the processing power available. If a program can be rendered 100% parallel (which isn't likely), it will improve in performance only by the number of simultaneous tasks that can be executed. Therefore three (3) tasks would be a 300% improvement2. While this may sound significant, it is trivial when workloads are defined in tens of thousands of programs per hour. At present, excepting mathematically intensive problems (scientific programming), the most widely used applications that can exploit high degrees of parallelism are databases and those using graphics3.
While processor parallelism is already being heavily used to manage queuing for multiple units of work (multi-programming), the ability to apply it to individual applications has limited utility and will not likely change in any appreciable fashion in the future.
Another point often raised, is that newer programming languages will resolve the issue of parallel computing, but this is also not likely to happen. In the first place, all programming languages must be resolved to actual machine language instruction streams. Therefore, whatever the programmer doesn't explicitly code, must be generated by the language interpreter or compiler. What makes this approach less promising, is that despite the hype, higher level languages are never as efficient as their low level equivalents. While they are certainly easier for the programmers and novices to use, they are significantly more resource intensive and rather heavy handed in generated solutions.
Parallel computing can certainly be a significant benefit for problems that can be programming in that fashion. Unfortunately, much general computing doesn't lend itself to such techniques and consequently cannot derive much benefit from it.
1 This is a greatly simplified explanation of instruction processing and should be understood as a very rough approximation of what occurs.
2 Amdahl's Law
S = 1 / (1-x+x/p)
S = Speed-up Factor
X = Fraction of Process Affected
P = Speed Increase of Process
3 Special effects, gaming, and simulators are prime candidates for parallel processing because of their need to perpetually calculate spatial coordinates.