Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convolution1D and Deconvolution1D layers #4811

Open
magicse opened this issue Jun 18, 2023 · 20 comments
Open

Convolution1D and Deconvolution1D layers #4811

magicse opened this issue Jun 18, 2023 · 20 comments

Comments

@magicse
Copy link
Contributor

magicse commented Jun 18, 2023

Simple question.
My model has many Convolution1D and Deconvolution1D layers. the execution time on CPU and VULKAN is about the same. I just wanted to know if ncnn supports VULKAN acceleration for Convolution1D and Deconvolution1D layers?

@magicse magicse changed the title Convolution1D Convolution1D and Deconvolution1D layers Jun 18, 2023
@nihui
Copy link
Member

nihui commented Jun 18, 2023

currently, no vulkan conv1d / deconv1d

@magicse
Copy link
Contributor Author

magicse commented Jun 18, 2023

Thank You @nihui .
Is there an example of a custom layer template somewhere that VULKAN uses?
Something like this implement-custom-layer-step-by-step but for VULKAN.
I want to try to make my custom Conv1d layer with VULKAN support
Because without VULKAN my HIFI GAN vocoder is quite slow. Vocal phrase in 3 seconds generated in 36 seconds

@Baiyuetribe
Copy link
Contributor

Additionally, this holds true for the vocoders of both VITS and DiffSinger, in summary, all TTS synthesis relies on this.

@magicse
Copy link
Contributor Author

magicse commented Jun 21, 2023

I had to create
Convolution1D_vulkan.cpp

#include "Convolution1D_vulkan.h"
#include "layer_shader_type.h"
#include "layer_type.h"

Convolution1D_vulkan::Convolution1D_vulkan()
{
    one_blob_only = true;
    support_vulkan = true;
    support_image_storage = true;
    pipeline_convolution1d = 0;
    reshape_w = 0;
}
int Convolution1D_vulkan::create_pipeline(const Option& _opt)
{
 ...
}
int Convolution1D_vulkan::destroy_pipeline(const Option&)
{
	//
}
int Convolution1D_vulkan::upload_model(VkTransfer& cmd, const Option& opt)
{
....
}
int Convolution1D_vulkan::forward(const VkMat& bottom_blob, VkMat& top_blob, VkCompute& cmd, const Option& opt) const
{
...
}

Convolution1D_vulkan.h

All needed implementations

Main.cpp

#include "Convolution1D_vulkan.h"
DEFINE_LAYER_CREATOR(Convolution1D_vulkan)
...
ncnn::Net HIFIVOICE;
HIFIVOICE.register_custom_layer("Convolution1D_vulkan", Convolution1D_vulkan_layer_creator);

All compiled well
But also i have convolution1d.comp. I had to create convolution1d.text2hex.txt and convolution1d.hex.h from convolution1d.comp.
As i saw native ncnn shaders for VULKAN calls thru indexes

       int shader_type_index = -1;
        if (elempack == 1 && out_elempack == 1) shader_type_index = LayerShaderType::convolution;
        if (elempack == 4 && out_elempack == 4) shader_type_index = LayerShaderType::convolution_pack4;

        pipeline_convolution1d = new Pipeline(vkdev);
        pipeline_convolution1d->set_optimal_local_size_xyz(local_size_xyz);
        pipeline_convolution1d->create(shader_type_index, opt, specializations);

But i dont know how implement this in my custom layer without layer_shader_type.h and layer_shader_type_enum.h.

@magicse
Copy link
Contributor Author

magicse commented Jun 24, 2023

I found how make this

    static std::vector<uint32_t> spirv;

    static ncnn::Mutex lock;
    {
        ncnn::MutexLockGuard guard(lock);
        if (spirv.empty())
        {
            compile_spirv_module(convolution1d_comp_data, sizeof(convolution1d_comp_data), opt, spirv);
        }

    }

       std::vector<vk_specialization_type> specializations(7 + 10);
        specializations[0].i = kernel_w;
        specializations[1].i = dilation_w;
        specializations[2].i = stride_w;
        specializations[3].i = bias_term;
        specializations[4].i = activation_type;
        specializations[5].f = activation_params.w >= 1 ? activation_params[0] : 0.f;
        specializations[6].f = activation_params.w == 2 ? activation_params[1] : 0.f;
        specializations[7 + 0].i = shape_bordered_packed.dims;
        specializations[7 + 1].i = shape_bordered_packed.w;
        specializations[7 + 2].i = shape_bordered_packed.h;
        specializations[7 + 3].i = shape_bordered_packed.c;
        specializations[7 + 4].i = shape_bordered_packed.cstep;
        specializations[7 + 5].i = out_shape_packed.dims;
        specializations[7 + 6].i = out_shape_packed.w;
        specializations[7 + 7].i = out_shape_packed.h;
        specializations[7 + 8].i = out_shape_packed.c;
        specializations[7 + 9].i = out_shape_packed.cstep;

       Mat local_size_xyz(8, 8, std::min(4, (num_output / out_elempack + 1) / 2), (void*)0);
        if (out_shape_packed.dims != 0)
        {
            local_size_xyz.w = std::min(8, out_shape_packed.w);
            local_size_xyz.h = std::min(8, out_shape_packed.h);
            local_size_xyz.c = std::min(4, (out_shape_packed.c + 1) / 2);
        }


        pipeline_convolution1d = new Pipeline(vkdev);
        pipeline_convolution1d->set_optimal_local_size_xyz(local_size_xyz);
        pipeline_convolution1d->create(spirv.data(), spirv.size() * 4, specializations);

@magicse
Copy link
Contributor Author

magicse commented Jun 24, 2023

I have only one question
in shaders i saw this "afp v0" , sfp , "psc(c) ", "afpvec4"
I didn't understand what this the types ... afp, psc , sfp, afpvec4? I couldn't find any information about this.

GLSL data type C data type Description
bool int A conditional type, taking on values of true or false.
int int Signed integer.
float float Single floating-point scalar.
vec2 float [2] Two component floating-point vector.
vect3 float [3] Three component floating-point vector.
vec4 float [4] Four component floating-point vector.
bvec2 int [2] Two component Boolean vector.
bvec3 int [3] Three component Boolean vector.
bvec4 int [4] Four component Boolean vector.
ivec2 int [2] Two component signed integer vector.
ivec3 int [3] Three component signed integer vector.
ivec4 int [4] Four component signed integer vector.
mat2 float [4] 2×2 floating-point matrix.
mat3 float [9] 3×3 floating-point matrix.
mat4 float [16] 4×4 floating-point matrix.
sampler1D int Handle for accessing a 1D texture.
sampler2D int Handle for accessing a 2D texture.
sampler3D int Handle for accessing a 3D texture.
samplerCube int Handle for accessing a cubemap texture.
sampler1DShadow int A handle for accessing a 1D depth texture with comparison.
Sampler2DShadow int A handle for accessing a 2D depth texture with comparison.

@magicse
Copy link
Contributor Author

magicse commented Jun 25, 2023

I found declarations here gpu.cpp

@nihui
Copy link
Member

nihui commented Jun 25, 2023

I have only one question in shaders i saw this "afp v0" , sfp , "psc(c) ", "afpvec4" I didn't understand what this the types ... afp, psc , sfp, afpvec4? I couldn't find any information about this.

GLSL data type C data type Description bool int A conditional type, taking on values of true or false. int int Signed integer. float float Single floating-point scalar. vec2 float [2] Two component floating-point vector. vect3 float [3] Three component floating-point vector. vec4 float [4] Four component floating-point vector. bvec2 int [2] Two component Boolean vector. bvec3 int [3] Three component Boolean vector. bvec4 int [4] Four component Boolean vector. ivec2 int [2] Two component signed integer vector. ivec3 int [3] Three component signed integer vector. ivec4 int [4] Four component signed integer vector. mat2 float [4] 2×2 floating-point matrix. mat3 float [9] 3×3 floating-point matrix. mat4 float [16] 4×4 floating-point matrix. sampler1D int Handle for accessing a 1D texture. sampler2D int Handle for accessing a 2D texture. sampler3D int Handle for accessing a 3D texture. samplerCube int Handle for accessing a cubemap texture. sampler1DShadow int A handle for accessing a 1D depth texture with comparison. Sampler2DShadow int A handle for accessing a 2D depth texture with comparison.

under construction ...
/~https://github.com/nihui/ncnn/blob/doc-glsl-ext/docs/developer-guide/glsl-extension.md

@magicse
Copy link
Contributor Author

magicse commented Jun 25, 2023

Hi @nihui , thank You for link and helping.
Now i try create conv1d shader.... May it will be ready soon )))

@nihui
Copy link
Member

nihui commented Jun 26, 2023

Hi @nihui , thank You for link and helping. Now i try create conv1d shader.... May it will be ready soon )))

Hi, you can join ncnn qq group if you use qq (see ncnn readme) thru which I can provide more help in time

@nihui
Copy link
Member

nihui commented Jun 26, 2023

/~https://github.com/Tencent/ncnn/wiki/glsl-extension

@magicse
Copy link
Contributor Author

magicse commented Jun 30, 2023

Work in progress

convolution1d.comp for kernel_w > 1 and elempack 1

#version 450

#if NCNN_fp16_storage
#extension GL_EXT_shader_16bit_storage: require
#endif
#if NCNN_fp16_arithmetic
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require
#endif


#extension GL_EXT_debug_printf : enable
#extension GL_GOOGLE_include_directive: enable
#include "vulkan_activation.comp"

layout (constant_id = 0) const int kernel_w = 1;
layout (constant_id = 1) const int dilation_w = 1;
layout (constant_id = 2) const int stride_w = 1;
layout (constant_id = 3) const int bias_term = 0;
layout (constant_id = 4) const int activation_type = 0;
layout (constant_id = 5) const float activation_param_0 = 0;
layout (constant_id = 6) const float activation_param_1 = 0;

#define shape_constant_id_offset 7
layout (constant_id = shape_constant_id_offset + 0) const int dims = 0;
layout (constant_id = shape_constant_id_offset + 1) const int w = 0;
layout (constant_id = shape_constant_id_offset + 2) const int h = 0;
layout (constant_id = shape_constant_id_offset + 3) const int c = 0;
layout (constant_id = shape_constant_id_offset + 4) const int cstep = 0;

layout (constant_id = shape_constant_id_offset + 5) const int outdims = 0;
layout (constant_id = shape_constant_id_offset + 6) const int outw = 0;
layout (constant_id = shape_constant_id_offset + 7) const int outh = 0;
layout (constant_id = shape_constant_id_offset + 8) const int outc = 0;
layout (constant_id = shape_constant_id_offset + 9) const int outcstep = 0;

#if NCNN_image_shader
layout (binding = 0) uniform unfp sampler2D bottom_blob;
layout (binding = 1, imfmtc1) writeonly uniform unfp image2D top_blob;
layout (binding = 2) uniform unfp sampler3D weight_blob;
layout (binding = 3) uniform unfp sampler3D bias_blob;
#else
layout (binding = 0) readonly buffer bottom_blob { sfp bottom_blob_data[]; };
layout (binding = 1) writeonly buffer top_blob { sfp top_blob_data[]; };
layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };
layout (binding = 3) readonly buffer bias_blob { sfp bias_data[]; };
#endif

layout (push_constant) uniform parameter
{
   int dims;
   int w;
   int h;
   int c;
   int cstep;

   int outdims;
   int outw;
   int outh;
   int outc;
   int outcstep;
} p;

void print_bottom_blob()
{    
	int gx = int(gl_GlobalInvocationID.x);
	int gy = int(gl_GlobalInvocationID.y);
	int gz = int(gl_GlobalInvocationID.z);
	if (gx >= 1 || gy >= 1)
			return;
	debugPrintfEXT("Hello %i, %i\n", gx, gy);
	for (int i = 0; i < psc(w); ++i) {
		for (int j = 0; j < psc(h); ++j) {
		debugPrintfEXT("Elem %d %d: %f ", i, j, bottom_blob_data[i*psc(h)+j]);
		}
		debugPrintfEXT("\n");
	}
}

void main()
{

    int gx = int(gl_GlobalInvocationID.x) * 2;
    int gy = int(gl_GlobalInvocationID.y) * 2;
    int gz = int(gl_GlobalInvocationID.z) * 2;
    //print_bottom_blob();

    if (gx >= psc(outw) || gy >= psc(outh) || gz >= psc(outc))
        return;

    const ivec2 gx2 = gx + ivec2(0, 1);
    const ivec2 gy2 = gy + ivec2(0, 1);

    afp sum0 = afp(0.0f);
    afp sum1 = afp(0.0f);
    afp sum2 = afp(0.0f);
    afp sum3 = afp(0.0f);	

    if (bias_term == 1)
    {
#if NCNN_image_shader
        //sum = image2d_ld1(bias_blob, ivec2(gx, 0));
#else
	sum0 = buffer_ld1(bias_data, gy2.x);
	sum2 = buffer_ld1(bias_data, gy2.y);
	sum1 = sum0;
	sum3 = sum2;
#endif
    }

#if NCNN_image_shader
  //
#else
	ivec2 w_offsetv = kernel_w * psc(h) * gy2; //  weight offset
	for (int iny = 0; iny < psc(h); iny++)
	{
		ivec2 v_offsetv = iny * psc(w) + gx2 * stride_w; // value offset
		for (int x = 0; x < kernel_w; x++)
		{
			afp v0 = buffer_ld1(bottom_blob_data, v_offsetv.x + x * dilation_w); // Load the value +0
			afp v1 = buffer_ld1(bottom_blob_data, v_offsetv.y + x * dilation_w); // Load the value +1
			afp k0 = buffer_ld1(weight_data, w_offsetv.x + x); // Load the weight value +0
			afp k1 = buffer_ld1(weight_data, w_offsetv.y + x); // Load the weight value +1

			sum0 += v0 * k0;
			sum1 += v1 * k0;
			sum2 += v0 * k1;
			sum3 += v1 * k1;
		}
		w_offsetv += kernel_w; // Move to the next set of weights
	}
#endif	
	sum0 = activation_afp(sum0, activation_type, activation_param_0, activation_param_1);
	sum1 = activation_afp(sum1, activation_type, activation_param_0, activation_param_1);
	sum2 = activation_afp(sum2, activation_type, activation_param_0, activation_param_1);
	sum3 = activation_afp(sum3, activation_type, activation_param_0, activation_param_1);
	
#if NCNN_image_shader
    //image2d_st1(top_blob, ivec3(gx2.x, gy2.x, gz2.x), sum0);
    //image2d_st1(top_blob, ivec3(gx2.y, gy2.x, gz2.x), sum1);
    //image2d_st1(top_blob, ivec3(gx2.x, gy2.y, gz2.x), sum2);
    //image2d_st1(top_blob, ivec3(gx2.y, gy2.y, gz2.x), sum3);
#else
	if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.x * psc(outw) + gx2.x, sum0);
	if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.x * psc(outw) + gx2.y, sum1);
	if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.y * psc(outw) + gx2.x, sum2);
	if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.y * psc(outw) + gx2.y, sum3);
#endif
}

@magicse
Copy link
Contributor Author

magicse commented Jul 5, 2023

My convolution1d.comp for kernel_w > 1 and elempack 1 ( unpacked float 32) work correct and produce correct results.
Convolution1D_vulkan.cpp arranged as custom layer like
And all work correctly .
But i have one problem. create_pipeline procedure Convolution1D_vulkan::create_pipeline(const Option& _opt) receives all parameters correctly except bottom_shapes and top_shapes, it always empty.
Main.cpp

#include "Convolution1D_vulkan.h"
DEFINE_LAYER_CREATOR(Convolution1D_vulkan)
....
ncnn::Net HIFIVOICE;
HIFIVOICE.opt.use_fp16_packed = false;
HIFIVOICE.opt.use_fp16_storage = false;
HIFIVOICE.opt.use_fp16_arithmetic = false;
HIFIVOICE.opt.use_int8_storage = false;
HIFIVOICE.opt.use_int8_arithmetic = false;
HIFIVOICE.opt.use_int8_packed = false;
HIFIVOICE.opt.use_vulkan_compute = true;
HIFIVOICE.register_custom_layer("Convolution1D_vulkan", Convolution1D_vulkan_layer_creator);
....
int Convolution1D_vulkan::create_pipeline(const Option& _opt)
{
    std::cout << "=== Create Pipeline: ===" << std::endl;
    if (dynamic_weight)
    {
        support_vulkan = false;
        support_image_storage = false;
        return 0;
    }

    // Create a convolution pipeline using Vulkan
    Option opt = _opt;
	const Mat& shape = bottom_shapes.empty() ? Mat() : bottom_shapes[0];
    const Mat& out_shape = top_shapes.empty() ? Mat() : top_shapes[0];
    std::cout << "=== Create Pipeline: ===" << std::endl;
    std::cout << "=== Create Pipeline: New shape from bottom_shapes WxHxC x Dims ===" << shape.w << " " << shape.h << " " << shape.c << " " << shape.d <<std::endl;
    std::cout << "=== Create Pipeline: New out_shape from top_shapes WxHxC x Dims ===" << out_shape.w << " " << out_shape.h << " " << out_shape.c << " " << out_shape.d <<std::endl;

Output

=== Create Pipeline: ===
=== Create Pipeline: New shape from bottom_shapes WxHxC x Dims ===0 0 0 0
=== Create Pipeline: New out_shape from top_shapes WxHxC x Dims ===0 0 0 0

=== Create Pipeline: padding pipeline pad_left pad_right : ===3 3
=== Create Pipeline: data_packed.create maxk : ===7
=== Create Pipeline: data_packed.create num_input : ===5120
=== Create Pipeline: data_packed.create elempack : ===4
=== Create Pipeline: data_packed.create num_input / elempack : ===1280
=== Create Pipeline: data_packed.create num_output : ===8
=== Create Pipeline: data_packed.create out_elempack : ===4
=== Create Pipeline: data_packed.create num_output / out_elempack : ===2
=== Create Pipeline: data_packed.create (size_t)4 * elempack * out_elempack : ===64
=== Create Pipeline: data_packed.create elempack * out_elempack : ===16
=== Create Pipeline: weight_data WxHxC : === 286720 x 1 x 1
=== Create Pipeline: weight_data WxHxC reshaped : === 7 x 5120 x 8

@magicse
Copy link
Contributor Author

magicse commented Jul 9, 2023

I made Convolution1d for the GPU (for float 32 pack1 and pack4 blobs and unpacked weights)
And now a 7 second voice phrase is generated in 7 seconds.
Good results.
melgram_flipped

7 second voice phrase without GPU Inference duration:

------------------------
Inference duration: 177 seconds
Out matrix size W x H = 173312 x 1 number of channels 1

7 second voice phrase with GPU Inference duration:

------------------------
Inference duration: 7 seconds
Out matrix size W x H = 173312 x 1 number of channels 1

This is realtime.

@magicse
Copy link
Contributor Author

magicse commented Jul 9, 2023

convolution1d_pack4.comp (float 32 pack4 blobs and input unpacked weights)

#version 450

#if NCNN_fp16_storage
#extension GL_EXT_shader_16bit_storage: require
#endif
#if NCNN_fp16_arithmetic
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require
#endif


//#extension GL_EXT_debug_printf : enable
#extension GL_GOOGLE_include_directive: enable
#include "vulkan_activation.comp"

layout (constant_id = 0) const int kernel_w = 1;
layout (constant_id = 1) const int dilation_w = 1;
layout (constant_id = 2) const int stride_w = 1;
layout (constant_id = 3) const int bias_term = 0;
layout (constant_id = 4) const int activation_type = 0;
layout (constant_id = 5) const float activation_param_0 = 0;
layout (constant_id = 6) const float activation_param_1 = 0;

#define shape_constant_id_offset 7
layout (constant_id = shape_constant_id_offset + 0) const int dims = 0;
layout (constant_id = shape_constant_id_offset + 1) const int w = 0;
layout (constant_id = shape_constant_id_offset + 2) const int h = 0;
layout (constant_id = shape_constant_id_offset + 3) const int c = 0;
layout (constant_id = shape_constant_id_offset + 4) const int cstep = 0;

layout (constant_id = shape_constant_id_offset + 5) const int outdims = 0;
layout (constant_id = shape_constant_id_offset + 6) const int outw = 0;
layout (constant_id = shape_constant_id_offset + 7) const int outh = 0;
layout (constant_id = shape_constant_id_offset + 8) const int outc = 0;
layout (constant_id = shape_constant_id_offset + 9) const int outcstep = 0;

#if NCNN_image_shader
layout (binding = 0) uniform unfp sampler3D bottom_blob;
layout (binding = 1, imfmtc4) writeonly uniform unfp image3D top_blob;
layout (binding = 2) uniform unfp sampler3D weight_blob;
layout (binding = 3) uniform unfp sampler3D bias_blob;
#else
//layout (binding = 0) readonly buffer bottom_blob { sfp bottom_blob_data[]; };
layout (binding = 0) readonly buffer bottom_blob { sfpvec4 bottom_blob_data[]; };

//layout (binding = 1) writeonly buffer top_blob { sfp top_blob_data[]; };
layout (binding = 1) writeonly buffer top_blob { sfpvec4 top_blob_data[]; };

//layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };
//layout (binding = 3) readonly buffer bias_blob { sfp bias_data[]; };
#if NCNN_fp16_packed || (NCNN_fp16_storage && !NCNN_fp16_arithmetic)
layout (binding = 2) readonly buffer weight_blob { sfpvec4 weight_data[]; };
#else
//layout (binding = 2) readonly buffer weight_blob { sfpmat4 weight_data[]; };
//layout (binding = 2) readonly buffer weight_blob { sfpvec4 weight_data[]; };
layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };

#endif
layout (binding = 3) readonly buffer bias_blob { sfpvec4 bias_data[]; };

#endif

layout (push_constant) uniform parameter
{
    int dims;
    int w;
    int h;
    int c;
    int cstep;

    int outdims;
    int outw;
    int outh;
    int outc;
    int outcstep;
} p;

/*
void print_bottblob()
{    
	int gx = int(gl_GlobalInvocationID.x);
    int gy = int(gl_GlobalInvocationID.y);
    int gz = int(gl_GlobalInvocationID.z);
	if (gx >= 1 || gy >= 1 || gz >= 1)
			return;
	//debugPrintfEXT("Hello %i, %i\n", gx, gy);
	for (int i = 0; i < psc(h)/4; ++i) {
		for (int j = 0; j < psc(w); ++j) {
		//for (int j = 0; j < psc(h); ++j) {
		//afp v = buffer_ld1(bottom_blob_data, 3);
		//debugPrintfEXT("Elem %d %d: %f ", i, j, v);
		
		//debugPrintfEXT("Bot_Blob %d %d: %f ", i, j, bottom_blob_data[i*psc(h)+j]);
		
		afpvec4 test = buffer_ld4(bottom_blob_data, i*psc(w)+j);
		debugPrintfEXT(" Top_Blob %d %d: %v4f ", i, j, test);
		
		//afpvec4 value;
		//value = buffer_ld4(bottom_blob_data, i*psc(h)+j );		
		//debugPrintfEXT("Bot_Blob %d %d: %f ", i, j, value);

		}
		debugPrintfEXT("\n");
	}
}

void print_weight()
{    
	int gx = int(gl_GlobalInvocationID.x);
    int gy = int(gl_GlobalInvocationID.y);
    int gz = int(gl_GlobalInvocationID.z);
	if (gx >= 1 || gy >= 1 || gz >= 1)
			return;
	debugPrintfEXT("Hello %i, %i\n", gx, gy);
	for (int i = 0; i < psc(outh)*4; ++i) {
		for (int j = 0; j < psc(outw)*kernel_w; ++j) {
		//afp v = buffer_ld1(bottom_blob_data, 3);
		//debugPrintfEXT("Elem %d %d: %f ", i, j, v);
		debugPrintfEXT("Weight %d %d: %f ", i, j, weight_data[i*psc(outw)*kernel_w+j]);
		//afpvec4 test = buffer_ld4(weight_data, i*psc(outw)+j);
		//debugPrintfEXT(" Weight %d %d: %v4f ", i, j, test);
		}
		debugPrintfEXT("\n");
	}
}

*/

void main()
{

    int gx = int(gl_GlobalInvocationID.x) * 2;
    int gy = int(gl_GlobalInvocationID.y) * 2;
    int gz = int(gl_GlobalInvocationID.z) * 2;

	//print_bottblob();
	//print_weight();

    if (gx >= psc(outw) || gy >= psc(outh) || gz >= psc(outc))
        return;

    const ivec2 gx2 = gx + ivec2(0, 1);
    const ivec2 gy2 = gy + ivec2(0, 1);
    const ivec2 gy4 = gy*4 + ivec2(0, 4);
    const ivec2 gz2 = gz + ivec2(0, 1);

	afpvec4 sum0 = afpvec4(0.0f);
	afpvec4 sum1 = afpvec4(0.0f);
	afpvec4 sum2 = afpvec4(0.0f);
	afpvec4 sum3 = afpvec4(0.0f);	
	
	afpvec4 sum4 = afpvec4(0.0f);
	afpvec4 sum5 = afpvec4(0.0f);

	afpvec4 sum6 = afpvec4(0.0f);
	afpvec4 sum7 = afpvec4(0.0f);
	afpvec4 sum8 = afpvec4(0.0f);
	afpvec4 sum9 = afpvec4(0.0f);	
	
	
	afpvec4 sum10 = afpvec4(0.0f);
	afpvec4 sum11 = afpvec4(0.0f);
	afpvec4 sum12 = afpvec4(0.0f);
	afpvec4 sum13 = afpvec4(0.0f);
	
	afpvec4 sum14 = afpvec4(0.0f);
	afpvec4 sum15 = afpvec4(0.0f);
	
	afpvec4 sum16 = afpvec4(0.0f);
	afpvec4 sum17 = afpvec4(0.0f);
	afpvec4 sum18 = afpvec4(0.0f);
	afpvec4 sum19 = afpvec4(0.0f);

    if (bias_term == 1)
    {
#if NCNN_image_shader
        sum = image2d_ld1(bias_blob, ivec2(gx, 0));
#else
		sum4 = buffer_ld4(bias_data, gy2.x);
		sum5 = sum4;
		sum14 = buffer_ld4(bias_data, gy2.y);
		sum15 = sum14;

#endif
    }

#if NCNN_image_shader
	//
#else
		

			ivec4 gy4_0 = gy4.x + ivec4(0, 1, 2, 3);
			ivec4 gy4_1 = gy4.y + ivec4(0, 1, 2, 3);

			ivec4 w_offsetv4_0;
			ivec4 w_offsetv4_1;
			w_offsetv4_0 = kernel_w * psc(h) * 4 * gy4_0;
			w_offsetv4_1 = kernel_w * psc(h) * 4 * gy4_1;
			
			for (int iny = 0; iny < psc(h); iny++)
			{
				
				ivec2 v_offsetv = iny * psc(w) + gx2 * stride_w;
				
				for (int x = 0; x < kernel_w; x++)
				{
					
					afpvec4 v0 = buffer_ld4(bottom_blob_data, v_offsetv.x + x * dilation_w);
					afpvec4 v1 = buffer_ld4(bottom_blob_data, v_offsetv.y + x * dilation_w);
					
					
					afp k0 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 0); // Load the weight value
					afp k1 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 1); // Load the weight value
					afp k2 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 2); // Load the weight value
					afp k3 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 3); // Load the weight value
					
					afp k4 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 0); // Load the weight value
					afp k5 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 1); // Load the weight value
					afp k6 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 2); // Load the weight value
					afp k7 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 3); // Load the weight value
					
					afp k8 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 0); // Load the weight value
					afp k9 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 1); // Load the weight value
					afp k10 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 2); // Load the weight value
					afp k11 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 3); // Load the weight value
					
					afp k12 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 0); // Load the weight value
					afp k13 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 1); // Load the weight value
					afp k14 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 2); // Load the weight value
					afp k15 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 3); // Load the weight value
					
					
					afp k16 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 0); // Load the weight value
					afp k17 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 1); // Load the weight value
					afp k18 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 2); // Load the weight value
					afp k19 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 3); // Load the weight value
					
					afp k20 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 0); // Load the weight value
					afp k21 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 1); // Load the weight value
					afp k22 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 2); // Load the weight value
					afp k23 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 3); // Load the weight value
					
					afp k24 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 0); // Load the weight value
					afp k25 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 1); // Load the weight value
					afp k26 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 2); // Load the weight value
					afp k27 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 3); // Load the weight value
					
					afp k28 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 0); // Load the weight value
					afp k29 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 1); // Load the weight value
					afp k30 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 2); // Load the weight value
					afp k31 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 3); // Load the weight value
					
					

#if NCNN_fp16_packed || (NCNN_fp16_storage && !NCNN_fp16_arithmetic)
                // GL_EXT_shader_16bit_storage does not define f16mat4 type :(
                afpmat4 k0 = afpmat4(
                    buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 0),
                    buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 1),
                    buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 2),
                    buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 3)
                );
                afpmat4 k1 = afpmat4(
                    buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 0),
                    buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 1),
                    buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 2),
                    buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 3)
                );
#else

				
#endif
					//debugPrintfEXT(" k0, k1, k2, k3 %f, %f, %f, %f \n", k0, k1, k2, k3);
					//debugPrintfEXT(" k4, k5, k6, k7 %f, %f, %f, %f \n", k4, k5, k6, k7);
					sum0 += v0 * afpvec4(k0, k1, k2, k3); //* k0;
					sum1 += v1 * afpvec4(k0, k1, k2, k3); //* k0;
					sum2 += v0 * afpvec4(k4, k5, k6, k7); //* k1;
					sum3 += v1 * afpvec4(k4, k5, k6, k7); //* k1;
					
					sum6 += v0 * afpvec4(k8, k9, k10, k11); //* k0;
					sum7 += v1 * afpvec4(k8, k9, k10, k11); //* k0;
					sum8 += v0 * afpvec4(k12, k13, k14, k15); //* k1;
					sum9 += v1 * afpvec4(k12, k13, k14, k15); //* k1;
					
					sum10 += v0 * afpvec4(k16, k17, k18, k19); //* k0;
					sum11 += v1 * afpvec4(k16, k17, k18, k19); //* k0;
					sum12 += v0 * afpvec4(k20, k21, k22, k23); //* k1;
					sum13 += v1 * afpvec4(k20, k21, k22, k23); //* k1;
					
					sum16 += v0 * afpvec4(k24, k25, k26, k27); //* k0;
					sum17 += v1 * afpvec4(k24, k25, k26, k27); //* k0;
					sum18 += v0 * afpvec4(k28, k29, k30, k31); //* k1;
					sum19 += v1 * afpvec4(k28, k29, k30, k31); //* k1;
					

				}

				w_offsetv4_0 += kernel_w*4;
				w_offsetv4_1 += kernel_w*4;
			}
			
			sum4.x += sum0.x + sum0.y + sum0.z + sum0.w;
			sum4.y += sum2.x + sum2.y + sum2.z + sum2.w;
			sum4.z += sum6.x + sum6.y + sum6.z + sum6.w;
			sum4.w += sum8.x + sum8.y + sum8.z + sum8.w;
			
			sum5.x += sum1.x + sum1.y + sum1.z + sum1.w;
			sum5.y += sum3.x + sum3.y + sum3.z + sum3.w;
			sum5.z += sum7.x + sum7.y + sum7.z + sum7.w;
			sum5.w += sum9.x + sum9.y + sum9.z + sum9.w;

			sum14.x += sum10.x + sum10.y + sum10.z + sum10.w;
			sum14.y += sum12.x + sum12.y + sum12.z + sum12.w;
			sum14.z += sum16.x + sum16.y + sum16.z + sum16.w;
			sum14.w += sum18.x + sum18.y + sum18.z + sum18.w;
			
			sum15.x += sum11.x + sum11.y + sum11.z + sum11.w;
			sum15.y += sum13.x + sum13.y + sum13.z + sum13.w;
			sum15.z += sum17.x + sum17.y + sum17.z + sum17.w;
			sum15.w += sum19.x + sum19.y + sum19.z + sum19.w;			

#endif	
	sum4 = activation_afpvec4(sum4, activation_type, activation_param_0, activation_param_1);
	sum5 = activation_afpvec4(sum5, activation_type, activation_param_0, activation_param_1);
	sum14 = activation_afpvec4(sum14, activation_type, activation_param_0, activation_param_1);
	sum15 = activation_afpvec4(sum15, activation_type, activation_param_0, activation_param_1);
#if NCNN_image_shader
    image2d_st1(top_blob, ivec3(gx2.x, gy2.x, gz2.x), sum0);
    image2d_st1(top_blob, ivec3(gx2.y, gy2.x, gz2.x), sum1);
    image2d_st1(top_blob, ivec3(gx2.x, gy2.y, gz2.x), sum2);
    image2d_st1(top_blob, ivec3(gx2.y, gy2.y, gz2.x), sum3);
#else
	if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.x * psc(outw) + gx2.x, sum4);
	if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.x * psc(outw) + gx2.y, sum5);
	if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.y * psc(outw) + gx2.x, sum14);
	if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.y * psc(outw) + gx2.y, sum15);
#endif

}

@magicse
Copy link
Contributor Author

magicse commented Jul 24, 2023

I have finished creating working convolution1d_vulkan for fp32

opt.use_fp16_packed = false;
opt.use_fp16_storage = false;
opt.use_fp16_arithmetic = false;
opt.use_int8_storage = false;
opt.use_int8_arithmetic = false;
opt.use_int8_packed = false;

convolution1d.comp
convolution1d_pack1to4.comp
convolution1d_pack4.comp
convolution1d_pack4to1.comp in progress

Inference duration for this mel spectrogram: 5 seconds

melgram_flipped

output.mp4

@nihui
Copy link
Member

nihui commented Oct 20, 2023

vulkan conv1d #5060

@magicse
Copy link
Contributor Author

magicse commented Oct 21, 2023

hi @nihui
I tried new /~https://github.com/Tencent/ncnn/pull/5060/files conv1d comp shaders and layer (convolution1d_vulkan.cpp, convolution1d_vulkan.h) and it doesn't work correctly, I don't get sound. The output I get is just noise.
Try my model ncnn-hifi-GAN with opt.use_vulkan_compute = true (i get noise) and opt.use_vulkan_compute = false (i get sound).
But with my own shaders for fp32 It worked correctly and i get sound.
Convolution1D_vulkan.cpp
Convolution1D_vulkan.h
convolution1d.comp
convolution1d_pack1to4.comp
convolution1d_pack4.comp

@nihui
Copy link
Member

nihui commented Oct 23, 2023

try disabling fp16

The following test print the same result on cpu and gpu

int main()
{
    ncnn::Net net;

    net.opt.use_vulkan_compute = true;
    // net.opt.use_vulkan_compute = false;

    net.opt.use_fp16_packed = false;
    net.opt.use_fp16_storage = false;
    net.opt.use_fp16_arithmetic = false;

    net.load_param("/home/nihui/osd/ncnn-nihui/mytools/hifivoice.ncnn.param");
    net.load_model("/home/nihui/osd/ncnn-nihui/mytools/hifivoice.ncnn.bin");

    {
        ncnn::Extractor ex = net.create_extractor();
        ex.set_vulkan_compute(false);

        ncnn::Mat in0 = RandomMat(64, 80);

        ex.input("in0", in0);

        ncnn::Mat out0;
        ex.extract("out0", out0);

        fprintf(stderr, "out0 %d %d %d %d %d\n", out0.dims, out0.w, out0.h, out0.d, out0.c);

        fprintf(stderr, "out0 %f %f %f %f %f %f\n", out0[0], out0[1], out0[10], out0[20], out0[200], out0[1020]);
    }

    {
        ncnn::Extractor ex = net.create_extractor();
        ex.set_vulkan_compute(true);

        ncnn::Mat in0 = RandomMat(64, 80);

        ex.input("in0", in0);

        ncnn::Mat out0;
        ex.extract("out0", out0);

        fprintf(stderr, "out0 %d %d %d %d %d\n", out0.dims, out0.w, out0.h, out0.d, out0.c);

        fprintf(stderr, "out0 %f %f %f %f %f %f\n", out0[0], out0[1], out0[10], out0[20], out0[200], out0[1020]);
    }

    return 0;
}
[nihui@nihuini-LC2 mytools]$ ./testnet 
[0 AMD Radeon Graphics (RADV NAVI14)]  queueC=1[4]  queueG=0[1]  queueT=0[1]
[0 AMD Radeon Graphics (RADV NAVI14)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[0 AMD Radeon Graphics (RADV NAVI14)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1
[0 AMD Radeon Graphics (RADV NAVI14)]  subgroup=64  basic/vote/ballot/shuffle=1/1/1/1
[0 AMD Radeon Graphics (RADV NAVI14)]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  queueC=0[1]  queueG=0[1]  queueT=0[1]
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  subgroup=8  basic/vote/ballot/shuffle=1/1/1/1
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0
out0 2 16384 1 1 1
out0 0.048641 0.074993 -0.038795 0.100711 -0.048371 0.117223
out0 2 16384 1 1 1
out0 0.048641 0.074993 -0.038795 0.100711 -0.048371 0.117224

@magicse
Copy link
Contributor Author

magicse commented Oct 29, 2023

Hi @nihui , thank you for your work. Now ncnn is open to new directions such as sound synthesis, voice conversion, music synthesis and TTS.
I check your code and of course I get correct results.

Z:\AI_SDK\VAE-GAN\HIFIVoice_cpp>hifivoice.exe -i melgram_flipped.jpg

Input option value=melgram_flipped.jpg
path = melgram_flipped.jpgimagepath0: melgram_flipped.jpg
argv[0]: mel
argv[1]: melgram_flipped.jpg
[0 NVIDIA GeForce RTX 3060]  queueC=2[8]  queueG=0[16]  queueT=1[2]
[0 NVIDIA GeForce RTX 3060]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[0 NVIDIA GeForce RTX 3060]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1
[0 NVIDIA GeForce RTX 3060]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1
in0 2 64 80 1 1
out0 2 16384 1 1 1
out0 0.048641 0.074994 -0.038795 0.100711 -0.048369 0.117221
out0 2 16384 1 1 1
out0 0.048641 0.074994 -0.038795 0.100711 -0.048372 0.117223
Max mel magnitude val: 1.89804
Min mel magnitude val: -11
[677 x 80]; ch: 1
MelIn 3 677 80 1 1
Inference duration: 6 seconds
Out matrix size W x H = 173312 x 1 number of channels 1

Final
Z:\AI_SDK\VAE-GAN\HIFIVoice_cpp>

I also found what the problem was.
convolution1d with vulkan=true and convolution1d with vulkan=false handle ncnn::Mat with an incorrect dimension differently.

For example, convolution1d is waiting for input dimension 2, and I passed ncnn:Mat with dimension 3.

convolution1d with vulkan=false treats ncnn::Mat with dimension 3 correctly as dimension 2, but convolution1d with vulkan=true produces the wrong result.
My code was like this

     ncnn::Mat MelIn(melscpectro.cols, melscpectro.rows, 1, (void*)melscpectro.data);
     fprintf(stderr, "MelIn %d %d %d %d %d\n", MelIn.dims, MelIn.w, MelIn.h, MelIn.d, MelIn.c);

and I was getting an erroneous result with vulkan=true because dims=3

MelIn 3 677 80 1 1

Now I have changed the code

     ncnn::Mat MelIn(melscpectro.cols, melscpectro.rows, (void*)melscpectro.data);
     fprintf(stderr, "MelIn %d %d %d %d %d\n", MelIn.dims, MelIn.w, MelIn.h, MelIn.d, MelIn.c);

Output:

MelIn 2 677 80 1 1

and I get the correct result with convolution1d vulkan=true
Thank you again @nihui for your work !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants