Unparalleled Value
SENSOR + Simultaneous Computing
= True Parallel Efficiency
For a summary of this page, see
Q & A p. 1 question
5.
Also see:
From Wikipedia, "Parallel
Algorithm":
"In
computer science, a parallel algorithm, as
opposed to a traditional
serial algorithm, is an
algorithm which can be executed a piece at a time
on many different processing devices, and then combined together again at
the end to get the correct result
[1].
.......
Parallelizability
Some problems
cannot be split up into parallel portions, as they require the results from
a preceding step to effectively carry on with the next step – these are
called inherently serial problems.
Examples include iterative
numerical methods, such as
Newton's method, iterative solutions to the
three-body problem, and most of the available
algorithms to compute
pi
(π)."
In general, the goal of our simulation
efforts is global predictive optimization of our recovery operations.
Virtually any question that we might attempt to answer towards that end
through simulation involves the evaluation of large
numbers of sensitivities of recovery to variables that we can observe and
control, and to variables that are uncertain. Evaluation of each
sensitivity generally requires a simulation run (one run for each discreet
time period for any dynamic control variables). Some optimization
methods may not require all these specific evaluations, but most if not all
create the need to make significant numbers of independent runs as quickly
as possible. If there were 16 CPUs available instead
of just one, and many runs to make, should they be used to collectively make
one run at a time 10 times faster, when they could be used individually, in
a much simpler and more reliable manner, to make 16 runs at a time 16
times faster? Of course not!
Parallel reservoir simulation through
domain decomposition has allowed larger problems
to be studied, and it has reduced the time required to make individual runs
in the simulators it has been applied to. But if you have significant
numbers of independent runs to make,
it is much simpler, faster, and cheaper to make them using a serial
simulator. The main reasons for that are parallel overhead and solver
degradation, and hardware and software cost, complexity, and efficiency.
In the past, distributed memory
parallel reservoir simulation enabled larger models to be run because of 32
bit hardware limitations. But those limitations were practically eliminated
with 64 bit technology, which has a theoretical limit of 8 TB of memory on a
single processor and which would allow Sensor black oil Impes cases up to
5 billion cells. The maximum amount currently available is
192 GB,
allowing Sensor Impes black oil cases up to 120 million cells,
or up to 12 simultaneous 10 million cell cases.
A few thousand dollars buys a 64 bit machine with 16 MB RAM that can run
cases up to 10 million cells, or up to 10 simultaneous
million cell cases. Practical reservoir model
size at sufficient levels of grid resolution is even now generally limited
by run time, not by memory, regardless of the numbers of processors or cores
available and their configuration..
If your workflow can generate and
execute only one case at a time, parallel reservoir simulators may have high
value, if they are much faster than Sensor. But in any realistic
reservoir study, the number of possibilities represented by the possible
values of the unknowns and their combinations are infinite.
Modification of that workflow to eliminate those restrictions and to
simultaneously and iteratively consider a large number of those
possibilities has the potential to extract orders of magnitude more
information in the same time. In this case, potential improvement in
speed, and/or number and resolution of variables, is limited only by the
number of realizations that your modified workflow and
hardware can simultaneously generate, execute, and evaluate within a given
iteration.
"Parallel efficiency" is conventionally
measured and reported in reservoir simulation* as the ratio of
same-application parallel speedup (serial/parallel wall clock time) to the
number of nodes (processors) used, for a single run. Equivalently,
parallel efficiency is the ratio of the time required to make n serial runs
simultaneously on n independent nodes to the time required to make n parallel runs on
those nodes in groups. Parallel efficiency is almost always much less than one. High linear solver and overall parallel
efficiency in reservoir simulation requires weakly coupled static (or
possibly even dynamic) domains. Those simply do not exist in most real
cases. As a result of the domain couplings**, reservoir simulation is
not well suited to parallel processing through domain decomposition - it
exhibits poor scalability, rapidly reaching a number of additional
processors for which little or no increase in parallel speedup is achieved. Efficient
massively parallel simulation through domain decomposition is virtually
impossible, as opposed to massively parallel execution of serial jobs, which
has unlimited and potentially perfect scalability.
Sure, a parallel reservoir simulator using
16 processors might execute a single job 8 or 10 times faster than it does
running in serial on one processor, but anything less than ideal parallel speedup of 16, or
parallel efficiency of 1, is less than worthless when large numbers of runs
need to be made.
If you can run at least as many
independent simulations at one time as you have compute nodes (processors, or
theoretically cores, if they can be made efficient), then running
serial jobs in a distributed computing, grid computing, cluster, or
multiprocessor (assuming no hardware inefficiencies)*** environment is faster than running parallel jobs, using the
same hardware and application, by a factor equal to 1.0 - parallel
efficiency. If those serial simulations are run using Sensor, then the
time required to run them is additionally reduced by a factor equal to
Sensor's (up to 15x, and much higher in certain cases) serial speedup.
Serial speedup is far more valuable than parallel speedup. It is 100%
efficient and massively parallelizable.
What if you have more compute nodes
than independent runs to make? In that case, the extra nodes can be
used for additional studies, or to consider a greater number of possible
history matches for better quantification of uncertainty in optimizations,
or to increase accuracy through consideration of a much larger number of,
and/or a more discrete description of, uncertain / history matching /
optimal control variables. There is no way around overall parallel
inefficiency in making large numbers of runs, especially when comparing to
Sensor.
History Matching
How many runs need to be made in a
history matching study? That number, and the quality of the match,
depends on the complexity of the field, available data and its quality, the
expertise of the user, and the effectiveness of the method used to evaluate
results and to select and adjust the model parameters. But in general,
the answer is “a large number”. The increasingly recognized need to
account for uncertainty in predictions by generating multiple viable history
matches is making that number even higher, while proportionately increasing
the number of independent simulations that can be made simultaneously.
Prediction
How many runs need to be made in an
optimization study? That number depends on the number of control
variables to optimize, the expertise of the user, and the effectiveness of
the method used to optimize (evaluate results and adjust) the controls.
Since the number of possible control variables is generally large, and the
dynamic control variables are discretely optimized as a function of time, in
general the answer again is “a large number”, which is also increased
greatly when uncertainty is quantified.
Advanced Workflows
The difficulty of reservoir
characterization and manual history matching and optimization has led to the
development and increasing use of advanced (assisted or automated) workflows
around reservoir simulation. These tools promise to reduce the time
requirements for reservoir studies, improve match quality, reduce errors and
quantify uncertainty with improved optimizations in predictions, and to
hugely increase the number of runs made in a given time. Most of the algorithms used, including
stochastic, gradient-based, and genetic methods, inherently provide the
ability to make large numbers of runs simultaneously, and efficient
implementation requires that they do so.
The Bottom Line
Maximum productivity in any realistic reservoir study
requires that you evaluate differences in results between a large number of
well-designed runs as quickly and effectively as possible. Even a
simple, manual sensitivity study provides large numbers of independent runs
that can be made simultaneously. Advanced workflows provide tools that
can be used to generate, submit, and evaluate the results of very large
numbers of simultaneous runs. That number of runs could easily reach
many thousands even for reservoirs of moderate complexity and resolution of
unknowns.
Obviously, 16 machines running a serial
reservoir simulator will run twice as many jobs in the same time as a single
16 node cluster running that same simulator in parallel (at 50% parallel
efficiency), with much less complexity. Add to this the fact that
Sensor is usually 3 to 10 times faster than any other model serially.
On average, Sensor running a single job on a $2000 to $5000 PC (depending on
memory, 2 to 16 GB) will outperform available parallel reservoir simulators
running on an 8-node cluster. Sensor running serial simulations in a simultaneous
computing environment is by far the simplest, fastest, cheapest, and most
accurate and reliable 'parallel system' available that is sufficient to study the large majority
of the world’s reservoirs that are subjected to isothermal recovery
processes.
In making advanced reservoir studies
today, "parallel efficiency" is an oxymoron.
A more appropriate term and measure
is "parallel inefficiency".
A Simple Example
* The original and true definition
of "parallel efficiency" arose in the field of computational mathematics,
and is based on the speed of the fastest serial algorithm. To
our knowledge, there are no reports of absolute timings for reproducible
examples in technical and marketing publications of commercial parallel reservoir model
performance, from which true parallel efficiency might be computed.
** Parallel computing through
domain decomposition was originally designed to be and is in general
applicable to systems that can be divided into independent domains. In
reservoir models, decomposition of the coupled system
of equations requires an additional outer loop around
the domain solves in the linear solver, which must be iterated to convergence of the global
system. This is because in each domain solve, the values of the
off-domain variables appearing in the domain equations must be 'lagged', i.e. set equal to the previous
iterate value from that off-domain solve. This inability to
simultaneously resolve all variable interdependencies causes degradation of
solver performance that increases rapidly with the number of domains used.
And any failure to fully converge the global system leads to error in the
parallel solution. Although solver degradation is the main problem with reservoir model parallelization,
additional inefficiencies and extreme complexities exist due to
communication, load balancing, and synchronization requirements.
*** Multiprocessor and multicore
hardware inefficiencies currently exist, affecting simultaneous serial and
parallel jobs to the same degree, due to shared memory, cache, and system bus. Perfect
multiprocessor/multicore hardware efficiency requires that each processor or
core have independently efficient access to cache and to sufficient memory.
Hardware inefficiency is reflected in reduced parallel efficiency on these
machines, and for simultaneous serial jobs causes n simultaneous simulations
to take longer than one running by itself, by a fraction increasing with n,
for n equal to 2 up to the number of processors or cores. Our
observation is that the efficiencies of cores used beyond the number of
processors are very poor for memory-intensive applications in high
performance computing, such as for field-scale cases in reservoir simulation. The table below is an example of
Sensor64 timings obtained
running 1, 2, 3, and 4 simultaneous cases using data file spe10_case2.dat
(141,000 cell black oil) on Machine 2 of our Benchmarks page (dual processor, dual core Intel Woodcrest 5160, running Windows XP
Professional x64 with 16 GB RAM):
Simultaneous Jobs |
CPU Seconds |
Elapsed Seconds |
Runs/Hour |
1 |
1785 |
1792 |
2.01 |
2 |
2028
2020 |
2036
2029 |
3.53 |
3 |
2674
2653
2286 |
2687
2666
2298 |
4.08 |
4 |
3313
3307
3302
3306 |
3331
3320
3316
3317 |
4.32 |
If the cores were independently efficient, the number of
runs made per hour would be equal to 2.01 times the number of simultaneous
jobs. Running two simultaneous jobs makes use of 2 cores, one on each
processor. The incremental efficiency in going from 1 to 2 simultaneous
jobs (due to use of the second processor and core) is (3.53-2.01) / 2.01 = .76.
The incremental efficiency due to use of the third core is (4.08-3.53) /
2.01 = .27 while that due to use of the fourth is only (4.32-4.08) / 2.01 = .12.
Overall hardware efficiency of the machine running 4 simultaneous jobs is
4.32 / 8.04 = .54, barely better than what would be obtained from only
2 completely independent single-core processors (.5). Parallel applications would
pay this penalty, and in addition all of the other parallel-related
penalties discussed above. |