home
goals
about sensor
why sensor?
who's fastest?
p10 p50 p90
SensorPx
bayes and markov
drainage radius
dca
frac conductivity
capillary pressure
miscible
spe10
parallel?
gridding
fd vs fe
map2excel
plot2excel
third party tools
services
publications
q & a
ethics
contact us
Dr. K. H. Coats

 

Unparalleled Value

SENSOR + Simultaneous Computing = True Parallel Efficiency

For a summary of this page, see Q & A p. 1 question 5.

Also see:

From Wikipedia, "Parallel Algorithm":

"In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result [1].

.......

Parallelizability

Some problems cannot be split up into parallel portions, as they require the results from a preceding step to effectively carry on with the next step – these are called inherently serial problems. Examples include iterative numerical methods, such as Newton's method, iterative solutions to the three-body problem, and most of the available algorithms to compute pi (π)."

In general, the goal of our simulation efforts is global predictive optimization of our recovery operations.  Virtually any question that we might attempt to answer towards that end through simulation involves the evaluation of large numbers of sensitivities of recovery to variables that we can observe and control, and to variables that are uncertain.  Evaluation of each sensitivity generally requires a simulation run (one run for each discreet time period for any dynamic control variables).  Some optimization methods may not require all these specific evaluations, but most if not all create the need to make significant numbers of independent runs as quickly as possible.  If there were 16 CPUs available instead of just one, and many runs to make, should they be used to collectively make one run at a time 10 times faster, when they could be used individually, in a much simpler and more reliable manner, to make 16 runs at a time 16 times faster?  Of course not!

Parallel reservoir simulation through domain decomposition has allowed larger problems to be studied, and it has reduced the time required to make individual runs in the simulators it has been applied to.  But if you have significant numbers of independent runs to make, it is much simpler, faster, and cheaper to make them using a serial simulator.  The main reasons for that are parallel overhead and solver degradation, and hardware and software cost, complexity, and efficiency.

In the past, distributed memory parallel reservoir simulation enabled larger models to be run because of 32 bit hardware limitations.  But those limitations were practically eliminated with 64 bit technology, which has a theoretical limit of 8 TB of memory on a single processor and which would allow Sensor black oil Impes cases up to 5 billion cells.  The maximum amount currently available is 192 GB, allowing Sensor Impes black oil cases up to 120 million cells, or up to 12 simultaneous 10 million cell cases.  A few thousand dollars buys a 64 bit machine with 16 MB RAM that can run cases up to 10 million cells, or up to 10 simultaneous million cell cases.  Practical reservoir model size at sufficient levels of grid resolution is even now generally limited by run time, not by memory, regardless of the numbers of processors or cores available and their configuration..

If your workflow can generate and execute only one case at a time, parallel reservoir simulators may have high value, if they are much faster than Sensor.  But in any realistic reservoir study, the number of possibilities represented by the possible values of the unknowns and their combinations are infinite.  Modification of that workflow to eliminate those restrictions and to simultaneously and iteratively consider a large number of those possibilities has the potential to extract orders of magnitude more information in the same time.  In this case, potential improvement in speed, and/or number and resolution of variables, is limited only by the number of realizations that your modified workflow and hardware can simultaneously generate, execute, and evaluate within a given iteration.

"Parallel efficiency" is conventionally measured and reported in reservoir simulation* as the ratio of same-application parallel speedup (serial/parallel wall clock time) to the number of nodes (processors) used, for a single run.  Equivalently, parallel efficiency is the ratio of the time required to make n serial runs simultaneously on n independent nodes to the time required to make n parallel runs on those nodes in groups.  Parallel efficiency is almost always much less than one.  High linear solver and overall parallel efficiency in reservoir simulation requires weakly coupled static (or possibly even dynamic) domains.  Those simply do not exist in most real cases.  As a result of the domain couplings**, reservoir simulation is not well suited to parallel processing through domain decomposition - it exhibits poor scalability, rapidly reaching a number of additional processors for which little or no increase in parallel speedup is achieved.  Efficient massively parallel simulation through domain decomposition is virtually impossible, as opposed to massively parallel execution of serial jobs, which has unlimited and potentially perfect scalability.

Sure, a parallel reservoir simulator using 16 processors might execute a single job 8 or 10 times faster than it does running in serial on one processor, but anything less than ideal parallel speedup of 16, or parallel efficiency of 1, is less than worthless when large numbers of runs need to be made.

If you can run at least as many independent simulations at one time as you have compute nodes (processors, or theoretically cores, if they can be made efficient), then running serial jobs in a distributed computing, grid computing, cluster, or multiprocessor (assuming no hardware inefficiencies)*** environment is faster than running parallel jobs, using the same hardware and application, by a factor equal to 1.0 - parallel efficiency.  If those serial simulations are run using Sensor, then the time required to run them is additionally reduced by a factor equal to Sensor's (up to 15x, and much higher in certain cases) serial speedup.  Serial speedup is far more valuable than parallel speedup.  It is 100% efficient and massively parallelizable.

What if you have more compute nodes than independent runs to make?  In that case, the extra nodes can be used for additional studies, or to consider a greater number of possible history matches for better quantification of uncertainty in optimizations, or to increase accuracy through consideration of a much larger number of, and/or a more discrete description of, uncertain / history matching / optimal control variables.  There is no way around overall parallel inefficiency in making large numbers of runs, especially when comparing to Sensor.

History Matching

How many runs need to be made in a history matching study?  That number, and the quality of the match, depends on the complexity of the field, available data and its quality, the expertise of the user, and the effectiveness of the method used to evaluate results and to select and adjust the model parameters.  But in general, the answer is “a large number”.  The increasingly recognized need to account for uncertainty in predictions by generating multiple viable history matches is making that number even higher, while proportionately increasing the number of independent simulations that can be made simultaneously.

Prediction

How many runs need to be made in an optimization study?  That number depends on the number of control variables to optimize, the expertise of the user, and the effectiveness of the method used to optimize (evaluate results and adjust) the controls.  Since the number of possible control variables is generally large, and the dynamic control variables are discretely optimized as a function of time, in general the answer again is “a large number”, which is also increased greatly when uncertainty is quantified.

Advanced Workflows

The difficulty of reservoir characterization and manual history matching and optimization has led to the development and increasing use of advanced (assisted or automated) workflows around reservoir simulation.  These tools promise to reduce the time requirements for reservoir studies, improve match quality, reduce errors and quantify uncertainty with improved optimizations in predictions, and to hugely increase the number of runs made in a given time.  Most of the algorithms used, including stochastic, gradient-based, and genetic methods, inherently provide the ability to make large numbers of runs simultaneously, and efficient implementation requires that they do so.

The Bottom Line

Maximum productivity in any realistic reservoir study requires that you evaluate differences in results between a large number of well-designed runs as quickly and effectively as possible.  Even a simple, manual sensitivity study provides large numbers of independent runs that can be made simultaneously.  Advanced workflows provide tools that can be used to generate, submit, and evaluate the results of very large numbers of simultaneous runs.  That number of runs could easily reach many thousands even for reservoirs of moderate complexity and resolution of unknowns.

Obviously, 16 machines running a serial reservoir simulator will run twice as many jobs in the same time as a single 16 node cluster running that same simulator in parallel (at 50% parallel efficiency), with much less complexity.  Add to this the fact that Sensor is usually 3 to 10 times faster than any other model serially.  On average, Sensor running a single job on a $2000 to $5000 PC (depending on memory, 2 to 16 GB) will outperform available parallel reservoir simulators running on an 8-node cluster.  Sensor running serial simulations in a simultaneous computing environment is by far the simplest, fastest, cheapest, and most accurate and reliable 'parallel system' available that is sufficient to study the large majority of the world’s reservoirs that are subjected to isothermal recovery processes.

In making advanced reservoir studies today, "parallel efficiency" is an oxymoron.

A more appropriate term and measure is "parallel inefficiency".

A Simple Example

 

 

* The original and true definition of "parallel efficiency" arose in the field of computational mathematics, and is based on the speed of the fastest serial algorithm.  To our knowledge, there are no reports of absolute timings for reproducible examples in technical and marketing publications of commercial parallel reservoir model performance, from which true parallel efficiency might be computed.

** Parallel computing  through domain decomposition was originally designed to be and is in general applicable to systems that can be divided into independent domains.  In reservoir models, decomposition of the coupled system of equations requires an additional outer loop around the domain solves in the linear solver, which must be iterated to convergence of the global system.  This is because in each domain solve, the values of the off-domain variables appearing in the domain equations must be 'lagged', i.e. set equal to the previous iterate value from that off-domain solve.  This inability to simultaneously resolve all variable interdependencies causes degradation of solver performance that increases rapidly with the number of domains used.  And any failure to fully converge the global system leads to error in the parallel solution.  Although solver degradation is the main problem with reservoir model parallelization, additional inefficiencies and extreme complexities exist due to communication, load balancing, and synchronization requirements.

*** Multiprocessor and multicore hardware inefficiencies currently exist, affecting simultaneous serial and parallel jobs to the same degree, due to shared memory, cache, and system bus.  Perfect multiprocessor/multicore hardware efficiency requires that each processor or core have independently efficient access to cache and to sufficient memory.  Hardware inefficiency is reflected in reduced parallel efficiency on these machines, and for simultaneous serial jobs causes n simultaneous simulations to take longer than one running by itself, by a fraction increasing with n, for n equal to 2 up to the number of processors or cores.  Our observation is that the efficiencies of cores used beyond the number of processors are very poor for memory-intensive applications in high performance computing, such as for field-scale cases in reservoir simulation.  The table below is an example of Sensor64 timings obtained running 1, 2, 3, and 4 simultaneous cases using data file spe10_case2.dat (141,000 cell black oil) on Machine 2 of our Benchmarks page (dual processor, dual core Intel Woodcrest 5160, running Windows XP Professional x64 with 16 GB RAM):

Simultaneous Jobs CPU Seconds Elapsed Seconds Runs/Hour
1 1785 1792 2.01
2 2028

2020

2036

2029

3.53
3 2674

2653

2286

2687

2666

2298

4.08
4 3313

3307

3302

3306

3331

3320

3316

3317

4.32

If the cores were independently efficient, the number of runs made per hour would be equal to 2.01 times the number of simultaneous jobs.  Running two simultaneous jobs makes use of 2 cores, one on each processor.  The incremental efficiency in going from 1 to 2 simultaneous jobs (due to use of the second processor and core) is (3.53-2.01) / 2.01 = .76.  The incremental efficiency due to use of the third core is (4.08-3.53) / 2.01 = .27 while that due to use of the fourth is only (4.32-4.08) / 2.01 = .12.  Overall hardware efficiency of the machine running 4 simultaneous jobs is 4.32 / 8.04 = .54, barely better than what would be obtained from  only 2 completely independent single-core processors (.5).  Parallel applications would pay this penalty, and in addition all of the other parallel-related penalties discussed above.


© 2000 - 2017 Coats Engineering, Inc.