Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid Parallel AD (Part 3/?) #1294

Merged
merged 47 commits into from
Aug 7, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
bb27a0b
CoDiPack update.
jblueh Jun 1, 2021
183c3ca
CoDiPack tape choice via build options.
jblueh Jun 11, 2021
f501dc1
Fix for the disc_adj_fsi problem.
jblueh Jun 11, 2021
3b0ebd3
Merge branch 'develop' into hybrid_parallel_ad4
pcarruscag Jun 24, 2021
6f3c86a
work estimate for OpenMP scheduling of preconditioners based on num n…
pcarruscag Jun 24, 2021
d5f8ac9
small fix
pcarruscag Jun 24, 2021
53bd274
update tests
pcarruscag Jun 24, 2021
0fe1e67
add hybrid AD regressions
pcarruscag Jun 24, 2021
5017a90
set reference residuals
pcarruscag Jun 25, 2021
b7b3dd7
remove heat case because heat solver does not have openmp
pcarruscag Jun 25, 2021
967704c
Make preaccumulation threadprivate.
jblueh Jun 28, 2021
0a72b67
Re-enable parallel preaccumulation.
jblueh Jun 28, 2021
c38bf14
Remove unused variable.
jblueh Jun 28, 2021
b61684b
PreaccActive was never reset.
jblueh Jun 28, 2021
781092a
Identify some faulty preaccumulation regions.
jblueh Jun 29, 2021
77aa7d0
Disable preaccumulation for parallel boundary numerics.
jblueh Jun 30, 2021
742118d
Add assert.
jblueh Jun 30, 2021
9b09003
Merge remote-tracking branch 'upstream/develop' into hybrid_parallel_ad4
pcarruscag Jul 5, 2021
1d2c206
disable preacc when coloring fails
pcarruscag Jul 5, 2021
a573f9a
small fix
pcarruscag Jul 5, 2021
bc90f74
Add shared reading switches.
jblueh Jul 6, 2021
cba486d
Apply some shared reading optimizations.
jblueh Jul 6, 2021
8e7a9c6
Apply suggestions.
jblueh Jul 6, 2021
e03f11b
Update Arina2K regression
pcarruscag Jul 6, 2021
15d3666
Remove redundant init.
jblueh Jul 7, 2021
e10abcc
OpDiLib update.
jblueh Jul 13, 2021
2726ca6
Add build option for shared reading optimization.
jblueh Jul 13, 2021
f8fe252
Fix.
jblueh Jul 13, 2021
ab91794
Merge branch 'develop' into hybrid_parallel_ad4
pcarruscag Jul 15, 2021
d8656aa
update discadj_fea (hybrid AD)
pcarruscag Jul 15, 2021
ac18c09
Merge branch 'develop' into hybrid_parallel_ad4
pcarruscag Jul 20, 2021
3c84ad1
Missing barrier.
jblueh Jul 20, 2021
c8ff857
CoDiPack update.
jblueh Jul 20, 2021
5901d8b
Merge remote-tracking branch 'su2github/hybrid_parallel_ad4' into hyb…
jblueh Jul 20, 2021
fcc39ce
Move barrier inside HandleTemporariesOut.
jblueh Jul 22, 2021
028d1e0
Further shared reading optimizations.
jblueh Jul 28, 2021
7acc44f
Test without boundary treatment.
jblueh Jul 28, 2021
7586e7c
Merge branch 'develop' into hybrid_parallel_ad4
jblueh Jul 28, 2021
2830dea
Source_Residual shared reading optimizations.
jblueh Jul 28, 2021
25ba4e3
revise some shared readings and add others
pcarruscag Aug 1, 2021
3b4a018
Merge branch 'develop' into hybrid_parallel_ad4
pcarruscag Aug 1, 2021
a9466bb
Suggestion for an option that disabled OpDiLib.
jblueh Aug 2, 2021
1ce5115
OpDiLib update.
jblueh Aug 4, 2021
e81a8ff
Revert "Suggestion for an option that disabled OpDiLib."
jblueh Aug 4, 2021
3f81059
Replace assert by warning.
jblueh Aug 4, 2021
d64d620
Update SU2_CFD/src/drivers/CDiscAdjMultizoneDriver.cpp
pcarruscag Aug 5, 2021
597c637
Merge branch 'develop' into hybrid_parallel_ad4
pcarruscag Aug 7, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/regression.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ jobs:
strategy:
fail-fast: false
matrix:
testscript: ['tutorials.py', 'parallel_regression.py', 'parallel_regression_AD.py', 'serial_regression.py', 'serial_regression_AD.py', 'hybrid_regression.py']
testscript: ['tutorials.py', 'parallel_regression.py', 'parallel_regression_AD.py', 'serial_regression.py', 'serial_regression_AD.py', 'hybrid_regression.py', 'hybrid_regression_AD.py']
include:
- testscript: 'tutorials.py'
tag: MPI
Expand All @@ -74,6 +74,8 @@ jobs:
tag: NoMPI
- testscript: 'hybrid_regression.py'
tag: OMP
- testscript: 'hybrid_regression_AD.py'
tag: OMP
steps:
- name: Download All artifact
uses: actions/download-artifact@v2
Expand Down
66 changes: 63 additions & 3 deletions Common/include/basic_types/ad_structure.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ namespace AD{

/*!
* \brief Start a passive region, i.e. stop recording.
* \return True is tape was active.
* \return True if tape was active.
*/
inline bool BeginPassive() { return false; }

Expand All @@ -262,6 +262,28 @@ namespace AD{
*/
inline void EndPassive(bool wasActive) {}

/*!
* \brief Pause the use of preaccumulation.
* \return True if preaccumulation was active.
*/
inline bool PausePreaccumulation() { return false; }

/*!
* \brief Resume the use of preaccumulation.
* \param[in] wasActive - Whether preaccumulation was active before pausing.
*/
inline void ResumePreaccumulation(bool wasActive) {}
jblueh marked this conversation as resolved.
Show resolved Hide resolved

/*!
* \brief Begin a hybrid parallel adjoint evaluation mode that assumes an inherently safe reverse path.
*/
inline void StartNoSharedReading() {}

/*!
* \brief End the "no shared reading" adjoint evaluation mode.
*/
inline void EndNoSharedReading() {}

#else
using CheckpointHandler = codi::DataStore;

Expand All @@ -271,9 +293,10 @@ namespace AD{

extern ExtFuncHelper* FuncHelper;

extern bool Status;

extern bool PreaccActive;
#ifdef HAVE_OPDI
SU2_OMP(threadprivate(PreaccActive))
#endif

extern bool PreaccEnabled;

Expand All @@ -290,6 +313,9 @@ namespace AD{
extern std::vector<TapePosition> TapePositions;

extern codi::PreaccumulationHelper<su2double> PreaccHelper;
#ifdef HAVE_OPDI
SU2_OMP(threadprivate(PreaccHelper))
#endif

/*--- Reference to the tape. ---*/

Expand Down Expand Up @@ -446,6 +472,7 @@ namespace AD{
FORCEINLINE void EndPreacc(){
if (PreaccActive) {
PreaccHelper.finish(false);
PreaccActive = false;
}
}

Expand Down Expand Up @@ -522,6 +549,39 @@ namespace AD{

FORCEINLINE void EndPassive(bool wasActive) { if(wasActive) StartRecording(); }

FORCEINLINE bool PausePreaccumulation() {
const auto current = PreaccEnabled;
if (!current) return false;
SU2_OMP_BARRIER
SU2_OMP_MASTER
PreaccEnabled = false;
END_SU2_OMP_MASTER
SU2_OMP_BARRIER
return true;
}

FORCEINLINE void ResumePreaccumulation(bool wasActive) {
if (!wasActive) return;
SU2_OMP_BARRIER
SU2_OMP_MASTER
PreaccEnabled = true;
END_SU2_OMP_MASTER
SU2_OMP_BARRIER
}

pcarruscag marked this conversation as resolved.
Show resolved Hide resolved
FORCEINLINE void StartNoSharedReading() {
#ifdef HAVE_OPDI
opdi::logic->setAdjointAccessMode(opdi::LogicInterface::AdjointAccessMode::Classical);
opdi::logic->addReverseBarrier();
pcarruscag marked this conversation as resolved.
Show resolved Hide resolved
#endif
}

FORCEINLINE void EndNoSharedReading() {
#ifdef HAVE_OPDI
opdi::logic->setAdjointAccessMode(opdi::LogicInterface::AdjointAccessMode::Atomic);
opdi::logic->addReverseBarrier();
#endif
}
#endif // CODI_REVERSE_TYPE

} // namespace AD
Expand Down
20 changes: 5 additions & 15 deletions Common/include/code_config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -79,25 +79,15 @@ using su2conditional_t = typename su2conditional<B,T,F>::type;
#include "codi.hpp"
#include "codi/tools/dataStore.hpp"

#ifndef CODI_INDEX_TAPE
#define CODI_INDEX_TAPE 0
#endif
#ifndef CODI_PRIMAL_TAPE
#define CODI_PRIMAL_TAPE 0
#endif
#ifndef CODI_PRIMAL_INDEX_TAPE
#define CODI_PRIMAL_INDEX_TAPE 0
#endif

#if defined(HAVE_OMP)
using su2double = codi::RealReverseIndexParallel;
#else
#if CODI_INDEX_TAPE
#if defined(CODI_INDEX_TAPE)
using su2double = codi::RealReverseIndex;
#elif CODI_PRIMAL_TAPE
using su2double = codi::RealReversePrimal;
#elif CODI_PRIMAL_INDEX_TAPE
using su2double = codi::RealReversePrimalIndex;
//#elif defined(CODI_PRIMAL_TAPE)
//using su2double = codi::RealReversePrimal;
//#elif defined(CODI_PRIMAL_INDEX_TAPE)
//using su2double = codi::RealReversePrimalIndex;
#else
using su2double = codi::RealReverse;
#endif
Expand Down
2 changes: 2 additions & 0 deletions Common/include/linear_algebra/CSysSolve.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,7 @@ class CSysSolve {
void HandleTemporariesOut(CSysVector<OtherType>& LinSysSol) {

/*--- Reset the pointers. ---*/
SU2_OMP_BARRIER
SU2_OMP_MASTER {
LinSysRes_ptr = nullptr;
LinSysSol_ptr = nullptr;
Expand All @@ -276,6 +277,7 @@ class CSysSolve {
LinSysSol.PassiveCopy(LinSysSol_tmp);

/*--- Reset the pointers. ---*/
SU2_OMP_BARRIER
SU2_OMP_MASTER {
LinSysRes_ptr = nullptr;
LinSysSol_ptr = nullptr;
Expand Down
2 changes: 1 addition & 1 deletion Common/include/toolboxes/graph_toolbox.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -527,7 +527,7 @@ T createNaturalColoring(Index_t numInnerIndexes)
* \param[out] indexColor - Optional, vector with colors given to the outer indices.
* \return Coloring in the same type of the input pattern.
*/
template<class T, typename Color_t = char, size_t MaxColors = 32, size_t MaxMB = 128>
template<class T, typename Color_t = char, size_t MaxColors = 64, size_t MaxMB = 128>
T colorSparsePattern(const T& pattern, size_t groupSize = 1, bool balanceColors = false,
std::vector<Color_t>* indexColor = nullptr)
{
Expand Down
4 changes: 0 additions & 4 deletions Common/src/CConfig.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4473,11 +4473,7 @@ void CConfig::SetPostprocessing(SU2_COMPONENT val_software, unsigned short val_i
#if defined CODI_REVERSE_TYPE
AD_Mode = YES;

#if defined HAVE_OMP
AD::PreaccEnabled = false;
#else
AD::PreaccEnabled = AD_Preaccumulation;
#endif

#else
if (AD_Mode == YES) {
Expand Down
7 changes: 7 additions & 0 deletions Common/src/basic_types/ad_structure.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,16 @@ namespace AD {
std::vector<TapePosition> TapePositions;

bool PreaccActive = false;
#ifdef HAVE_OPDI
SU2_OMP(threadprivate(PreaccActive))
#endif

bool PreaccEnabled = true;

codi::PreaccumulationHelper<su2double> PreaccHelper;
#ifdef HAVE_OPDI
SU2_OMP(threadprivate(PreaccHelper))
#endif

ExtFuncHelper* FuncHelper;

Expand Down
3 changes: 2 additions & 1 deletion Common/src/geometry/CPhysicalGeometry.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7701,7 +7701,8 @@ void CPhysicalGeometry::SetBoundControlVolume(const CConfig *config, unsigned sh

const auto nNodes = bound[iMarker][iElem]->GetnNodes();

AD::StartPreacc();
/*--- Cannot preaccumulate if hybrid parallel due to shared reading. ---*/
if (omp_get_num_threads() == 1) AD::StartPreacc();

/*--- Get pointers to the coordinates of all the element nodes ---*/
array<const su2double*, N_POINTS_MAXIMUM> Coord;
Expand Down
15 changes: 11 additions & 4 deletions Common/src/linear_algebra/CSysMatrix.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -185,10 +185,17 @@ void CSysMatrix<ScalarType>::Initialize(unsigned long npoint, unsigned long npoi
/*--- This is akin to the row_ptr. ---*/
omp_partitions = new unsigned long [omp_num_parts+1];

/// TODO: Use a work estimate to produce more balanced partitions.
auto pts_per_part = roundUpDiv(nPointDomain, omp_num_parts);
for(auto part = 0ul; part < omp_num_parts; ++part)
omp_partitions[part] = part * pts_per_part;
/*--- Work estimate based on non-zeros to produce balanced partitions. ---*/

const auto row_ptr_prec = ilu_needed? row_ptr_ilu : row_ptr;
const auto nnz_prec = row_ptr_prec[nPointDomain];

const auto nnz_per_part = roundUpDiv(nnz_prec, omp_num_parts);

for (auto iPoint = 0ul, part = 0ul; iPoint < nPointDomain; ++iPoint) {
if (row_ptr_prec[iPoint] >= part*nnz_per_part)
omp_partitions[part++] = iPoint;
}
omp_partitions[omp_num_parts] = nPointDomain;

/*--- Generate MKL Kernels ---*/
Expand Down
3 changes: 2 additions & 1 deletion SU2_CFD/include/gradients/computeGradientsGreenGauss.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@ void computeGradientsGreenGauss(CSolver* solver,
{
auto nodes = geometry.nodes;

AD::StartPreacc();
/*--- Cannot preaccumulate if hybrid parallel due to shared reading. ---*/
if (omp_get_num_threads() == 1) AD::StartPreacc();
AD::SetPreaccIn(nodes->GetVolume(iPoint));
AD::SetPreaccIn(nodes->GetPeriodicVolume(iPoint));

Expand Down
3 changes: 2 additions & 1 deletion SU2_CFD/include/gradients/computeGradientsLeastSquares.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,8 @@ void computeGradientsLeastSquares(CSolver* solver,
auto nodes = geometry.nodes;
const auto coord_i = nodes->GetCoord(iPoint);

AD::StartPreacc();
/*--- Cannot preaccumulate if hybrid parallel due to shared reading. ---*/
if (omp_get_num_threads() == 1) AD::StartPreacc();
AD::SetPreaccIn(coord_i, nDim);

for (size_t iVar = varBegin; iVar < varEnd; ++iVar)
Expand Down
3 changes: 2 additions & 1 deletion SU2_CFD/include/limiters/computeLimiters_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,8 @@ void computeLimiters_impl(CSolver* solver,
auto nodes = geometry.nodes;
const auto coord_i = nodes->GetCoord(iPoint);

AD::StartPreacc();
/*--- Cannot preaccumulate if hybrid parallel due to shared reading. ---*/
if (omp_get_num_threads() == 1) AD::StartPreacc();
AD::SetPreaccIn(coord_i, nDim);

for (size_t iVar = varBegin; iVar < varEnd; ++iVar)
Expand Down
24 changes: 23 additions & 1 deletion SU2_CFD/include/solvers/CFVMFlowSolverBase.inl
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,11 @@ void CFVMFlowSolverBase<V, R>::HybridParallelInitialization(const CConfig& confi
cout << "WARNING: On " << numRanksUsingReducer << " MPI ranks the coloring efficiency was less than "
<< COLORING_EFF_THRESH << " (min value was " << minEff << ").\n"
<< " Those ranks will now use a fallback strategy, better performance may be possible\n"
<< " with a different value of config option EDGE_COLORING_GROUP_SIZE (default 512)." << endl;
<< " with a different value of config option EDGE_COLORING_GROUP_SIZE (default 512)."
#ifdef HAVE_OPDI
<< "\n The memory usage of the discrete adjoint solver is higher when using the fallback."
#endif
<< endl;
}

if (config.GetUseVectorization() && (omp_get_max_threads() > 1) &&
Expand Down Expand Up @@ -1531,6 +1535,12 @@ void CFVMFlowSolverBase<V, R>::EdgeFluxResidual(const CGeometry *geometry,
InstantiateEdgeNumerics(solvers, config);
}

/*--- For hybrid parallel AD, pause preaccumulation if there is shared reading of
* variables, otherwise switch to the faster adjoint evaluation mode. ---*/
bool pausePreacc = false;
if (ReducerStrategy) pausePreacc = AD::PausePreaccumulation();
else AD::StartNoSharedReading();

/*--- Loop over edge colors. ---*/
for (auto color : EdgeColoring) {
/*--- Chunk size is at least OMP_MIN_SIZE and a multiple of the color group size. ---*/
Expand All @@ -1553,6 +1563,10 @@ void CFVMFlowSolverBase<V, R>::EdgeFluxResidual(const CGeometry *geometry,
END_SU2_OMP_FOR
}

/*--- Restore preaccumulation and adjoint evaluation state. ---*/
AD::ResumePreaccumulation(pausePreacc);
if (!ReducerStrategy) AD::EndNoSharedReading();

if (ReducerStrategy) {
SumEdgeFluxes(geometry);
if (config->GetKind_TimeIntScheme() == EULER_IMPLICIT) {
Expand Down Expand Up @@ -1607,6 +1621,8 @@ void CFVMFlowSolverBase<V, FlowRegime>::SetResidual_DualTime(CGeometry *geometry

/*--- Loop over all nodes (excluding halos) ---*/

AD::StartNoSharedReading();

SU2_OMP_FOR_STAT(omp_chunk_size)
for (iPoint = 0; iPoint < nPointDomain; iPoint++) {

Expand Down Expand Up @@ -1642,6 +1658,8 @@ void CFVMFlowSolverBase<V, FlowRegime>::SetResidual_DualTime(CGeometry *geometry
}
END_SU2_OMP_FOR

AD::EndNoSharedReading();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these shared reading optimizations depend on TimeStep being passive. Does this always hold true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fair-enough assumption, I added some more over the Primitive loops and removed some over smaller loops where the performance benefit might not justify the increased maintenance.

}

else {
Expand Down Expand Up @@ -1719,6 +1737,8 @@ void CFVMFlowSolverBase<V, FlowRegime>::SetResidual_DualTime(CGeometry *geometry
/*--- Loop over all nodes (excluding halos) to compute the remainder
of the dual time-stepping source term. ---*/

AD::StartNoSharedReading();

SU2_OMP_FOR_STAT(omp_chunk_size)
for (iPoint = 0; iPoint < nPointDomain; iPoint++) {

Expand Down Expand Up @@ -1756,6 +1776,8 @@ void CFVMFlowSolverBase<V, FlowRegime>::SetResidual_DualTime(CGeometry *geometry
}
}
END_SU2_OMP_FOR

AD::EndNoSharedReading();
}

}
Expand Down
5 changes: 0 additions & 5 deletions SU2_CFD/src/SU2_CFD.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,11 +73,6 @@ int main(int argc, char *argv[]) {
#endif
SU2_MPI::Comm MPICommunicator = SU2_MPI::GetComm();

/*--- AD initialization ---*/
#ifdef HAVE_OPDI
AD::getGlobalTape().initialize();
#endif

/*--- Uncomment the following line if runtime NaN catching is desired. ---*/
// feenableexcept(FE_INVALID | FE_OVERFLOW | FE_DIVBYZERO );

Expand Down
6 changes: 6 additions & 0 deletions SU2_CFD/src/drivers/CDiscAdjMultizoneDriver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -880,6 +880,12 @@ void CDiscAdjMultizoneDriver::SetAdj_ObjFunction() {

void CDiscAdjMultizoneDriver::ComputeAdjoints(unsigned short iZone, bool eval_transfer) {

#if defined(CODI_INDEX_TAPE) || defined(HAVE_OPDI)
if (nZone > 1 && rank == MASTER_NODE) {
std::cout << "WARNING: Index AD types do not support multiple zones." << std::endl;
}
#endif

AD::ClearAdjoints();

/*--- Initialize the adjoints in iZone ---*/
Expand Down
6 changes: 6 additions & 0 deletions SU2_CFD/src/integration/CIntegration.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,10 @@ void CIntegration::Space_Integration(CGeometry *geometry,
CNumerics* conv_bound_numerics = numerics[CONV_BOUND_TERM + omp_get_thread_num()*MAX_TERMS];
CNumerics* visc_bound_numerics = numerics[VISC_BOUND_TERM + omp_get_thread_num()*MAX_TERMS];

/*--- Pause preaccumulation in boundary conditions for hybrid parallel AD. ---*/
/// TODO: Check if this is really needed.
//const auto pausePreacc = (omp_get_num_threads() > 1) && AD::PausePreaccumulation();

/*--- Boundary conditions that depend on other boundaries (they require MPI sincronization)---*/

solver_container[MainSolver]->BC_Fluid_Interface(geometry, solver_container, conv_bound_numerics, visc_bound_numerics, config);
Expand Down Expand Up @@ -181,6 +185,8 @@ void CIntegration::Space_Integration(CGeometry *geometry,
solver_container[MainSolver]->BC_Periodic(geometry, solver_container, conv_bound_numerics, config);
}

//AD::ResumePreaccumulation(pausePreacc);

}

void CIntegration::Time_Integration(CGeometry *geometry, CSolver **solver_container, CConfig *config,
Expand Down
Loading