__tile_dpfp16ps
Classification
AMX, Application-Targeted, CPUID Test: AMX-FP16
Header File
immintrin.h
Instruction
TDPBF16PS tmm, tmm, tmm
Synopsis
 __tile_dpfp16ps(__tile1024i* dst, __tile1024i src0, __tile1024i src1);
Description
Compute dot-product of FP16 (16-bit) floating-point pairs in tiles "src0" and "src1", accumulating the intermediate single-precision (32-bit) floating-point elements with elements in "dst", and store the 32-bit result back to tile "dst". The shape of tile is specified in the struct of __tile1024i. The register of the tile is allocated by compiler.
Operation
FOR m := 0 TO dst.rows - 1
	tmp := dst.row[m]
	FOR k := 0 TO (src0.colsb / 4) - 1
		FOR n := 0 TO (dst.colsb / 4) - 1
			tmp.fp32[n] += FP32(src0.row[m].fp16[2*k+0]) * FP32(src1.row[k].fp16[2*n+0])
			tmp.fp32[n] += FP32(src0.row[m].fp16[2*k+1]) * FP32(src1.row[k].fp16[2*n+1])
		ENDFOR
	ENDFOR
	write_row_and_zero(dst, m, tmp, dst.colsb)
ENDFOR
zero_upper_rows(dst, dst.rows)
zero_tileconfig_start()