GLdc/GL/perfctr.h
Hayden Kowalchuk 3a4f09bef2 feat: implement proper chanegs from profiling
- math
- inlining
2020-02-18 11:48:37 -05:00

317 lines
16 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

// ---- perfctr.h - SH7091 Performance Counter Module Header ----
//
// This file is part of the DreamHAL project, a hardware abstraction library
// primarily intended for use on the SH7091 found in hardware such as the SEGA
// Dreamcast game console.
//
// The performance counter module is hereby released into the public domain in
// the hope that it may prove useful. Now go profile some code and hit 60 fps! :)
//
// --Moopthehedgehog
//
#ifndef __PERFCTR_H__
#define __PERFCTR_H__
//
// -- General SH4 Performance Counter Notes --
//
// There are 2 performance counters that can measure elapsed time. They are each
// 48-bit counters. They are part of the so-called "ASE" subsystem, which you can
// read about in chapter 13 of the "SuperH™ (SH) 32-bit RISC series SH-4, ST40
// system architecture, volume 1: system":
// https://www.st.com/content/ccc/resource/technical/document/user_manual/36/75/05/ac/e8/7e/42/2d/CD00147163.pdf/files/CD00147163.pdf/jcr:content/translations/en.CD00147163.pdf
//
// They can count cycles, so that's 199.5MHz (not 200MHz!!) a.k.a. roughly 5 ns
// increments. At 5 ns increments, a 48-bit cycle counter can run continuously
// for 16.33 days. It's actually 16 days, 7 hours, 55 minutes, and 2 seconds,
// depending on how close the bus clock is to 99.75MHz. There is also a second
// mode that counts cycles according to a ratio between the CPU frequency and
// the system bus clock, and it increments the counter by 12 every bus cycle.
// This second mode is detailed in the description for PMCR_CLOCK_TYPE in this
// file, and it is recommended for use when the CPU frequency is not a runtime
// constant.
//
// Side note: The counters don't have an overflow interrupt or overflow bit.
// (I did actually run one to 48-bit overflow in elapsed time mode using the
// ratio method to check this. They don't appear to sign-extend the upper 16
// bits in elapsed time mode, either.)
//
// The two counters are functionally identical. I would recommend using the
// PMCR_Init() function to start one (or both) up the first time.
//
// -- Configuration Address Info --
//
// Addresses for these counters can be easily seen here, in lxdream's source code:
// https://github.com/lutris/lxdream/blob/master/src/sh4/sh4mmio.h
//
// They are also on display in the Linux kernel, but at the time of writing appear
// to be set incorrectly (the clock mode at bit 0x100 is never set or cleared,
// for example, so they're at the mercy of whatever the hardware defaults are):
// http://git.lpclinux.com/cgit/linux-2.6.28.2-lpc313x/plain/arch/sh/oprofile/op_model_sh7750.c
// https://github.com/torvalds/linux/blob/master/arch/sh/kernel/cpu/sh4/perf_event.c
// ...It also appears as though they may not be handling bus ratio mode correctly,
// which appears to be the default mode on the Dreamcast in all my tests.
//
// You can also find these addresses by ripping a copy of Virtua Fighter 3 that
// you own for Dreamcast and looking at the raw byte code (or a raw disassembly)
// of its main program binary. It would appear as though they were timing a loop
// with the low half of perf counter 1 in elapsed time mode. Definitely seems
// like a good thing to do when targeting 60fps! Shenmue Disc 4 also uses the
// same configuration, but what's being timed is not as clear.
//
// Another place you can actually find both control addresses 0xFF00008x and all
// data addresses 0xFF10000x is in binaries of ancient, freely available versions
// of CodeScape. Literally all you need to do is open an SH7750-related DLL in a
// hex editor and do a search to find the control register addresses, and the
// data addresses are equally plain to see in any relevant performance profiling
// firmware. There's no effort or decryption required to find them whatsoever;
// all you need is an old trial version and a hex editor.
//
// However, something even better than all of that is if you search for "SH4
// 0xFF000084" (without quotes) online you'll find an old forum where some logs
// were posted of the terminal/command prompt output from some STMicro JTAG tool,
// which not only has the address registers but also clearly characterizes their
// size as 16-bit:
// https://www.multimediaforum.de/threads/36260834-alice-hsn-3800tw-usb-jtag-ft4232h/page2
//
// -- Event Mode Info --
//
// Specific information on each counter mode can be found in the document titled
// "SuperH™ Family E10A-USB Emulator: Additional Document for Users Manual:
// Supplementary Information on Using the SH7750R Renesas Microcomputer Development Environment System"
// which is available on Renesas's website, in the "Documents" section of the
// E10A-USB product page:
// https://www.renesas.com/us/en/products/software-tools/tools/emulator/e10a-usb.html
// At the time of writing (12/2019), the E10A-USB adapter is still available
// for purchase, and it is priced around $1200 (USD).
//
// Appendix C of the "ST40 Micro Toolset Manual" also has these modes documented:
// https://www.st.com/content/ccc/resource/technical/document/user_manual/c5/98/11/89/50/68/41/66/CD17379953.pdf/files/CD17379953.pdf/jcr:content/translations/en.CD17379953.pdf
//
// See here for the hexadecimal values corresponding to each mode (pg. 370):
// http://www.macmadigan.com/BusaECU/Renesas%20documents/Hitachi_codescape_CS40_light_userguides.pdf
// You can also find the same "Counter Description Table" in user's guide PDFs
// bundled in ancient demo versions of CodeScape 3 from 2000 (e.g.
// CSDemo_272.exe), which can still be found in the Internet Archive.
// http://web.archive.org/web/*/http://codescape.com/dl/CSDemo/*
//
// See here for a support document on Lauterbach's SH2, SH3, and SH4 debugger,
// which contains units for each mode (e.g. which measure time and which just
// count): https://www.lauterbach.com/frames.html?home.html (It's in Downloads
// -> Trace32 Help System -> it's the file called "SH2, SH3 and SH4 Debugger"
// with the filename debugger_sh4.pdf).
//
//
// --- Performance Counter Registers ---
//
// These registers are 16 bits only and configure the performance counters
#define PMCR1_CTRL_REG 0xFF000084
#define PMCR2_CTRL_REG 0xFF000088
// These registers are 32-bits each and hold the high low parts of each counter
#define PMCTR1H_REG 0xFF100004
#define PMCTR1L_REG 0xFF100008
#define PMCTR2H_REG 0xFF10000C
#define PMCTR2L_REG 0xFF100010
//
// --- Performance Counter Configuration Flags ---
//
// These bits' functions are currently unknown, but they may simply be reserved.
// It's possible that there's a [maybe expired?] patent that details the
// configuration registers, though I haven't been able to find one. Places to
// check would be Google Patents and the Japanese Patent Office--maybe someone
// else can find something?
//
// Some notes:
// Writing 1 to all of these bits reads back as 0, so it looks like they aren't
// config bits. It's possible they are write-only like the stop bit, though,
// or that they're just reserved-write-0-only. It appears that they are always
// written with zeros in software that uses them, so that's confirmed safe to do.
//
// Also, after running counter 1 to overflow, it appears there's no overflow bit
// (maybe the designers thought 48-bits would be so much to count to that they
// didn't bother implementing one?). The upper 16-bits of the counter high
// register are also not sign-extension bits. They may be a hidden config area,
// but probably not because big endian mode would swap the byte order.
#define PMCR_UNKNOWN_BIT_0040 0x0040
#define PMCR_UNKNOWN_BIT_0080 0x0080
#define PMCR_UNKNOWN_BIT_0200 0x0200
#define PMCR_UNKNOWN_BIT_0400 0x0400
#define PMCR_UNKNOWN_BIT_0800 0x0800
#define PMCR_UNKNOWN_BIT_1000 0x1000
// PMCR_MODE_CLEAR_INVERTED just clears the event mode if it's inverted with
// '~', and event modes are listed below.
#define PMCR_MODE_CLEAR_INVERTED 0x003f
// PMCR_CLOCK_TYPE sets the counters to count clock cycles or CPU/bus ratio mode
// cycles (where T = C x B / 24 and T is time, C is count, and B is time
// of one bus cycle). Note: B = 1/99753008 or so, but it may vary, as mine is
// actually 1/99749010-ish; the target frequency is probably meant to be 99.75MHz.
//
// See the ST40 or Renesas SH7750R documents described in the above "Event Mode
// Info" section for more details about that formula.
//
// Set PMCR_CLOCK_TYPE to 0 for CPU cycle counting, where 1 count = 1 cycle, or
// set it to 1 to use the above formula. Renesas documentation recommends using
// the ratio version (set the bit to 1) when user programs alter CPU clock
// frequencies. This header has some definitions later on to help with this.
#define PMCR_CLOCK_TYPE 0x0100
#define PMCR_CLOCK_TYPE_SHIFT 8
// PMCR_STOP_COUNTER is write-only, as it always reads back as 0. It does what
// the name suggests: when this bit is written to, the counter stops. However,
// if written to while the counter is disabled or stopped, the counter's high
// and low registers are reset to 0.
//
// Using PMCR_STOP_COUNTER to stop the counter has the effect of holding the
// data in the data registers while stopped, unlike PMCR_DISABLE_COUNTER, and
// this bit needs to be written to again (e.g. on next start) in order to
// actually clear the counter data for another run. If not explicitly cleared,
// the counter will continue from where it left off before being stopped.
#define PMCR_STOP_COUNTER 0x2000
#define PMCR_RESET_COUNTER_SHIFT 13
// Bits 0xC000 both need to be set to 1 for the counters to actually begin
// counting. I have seen that the Linux kernel actually separates them out into
// two separate labelled bits (PMEN and PMST) for some reason, however they do
// not appear to do anything separately. Perhaps this is a two-bit mode where
// 1-1 is run, 1-0 and 0-1 are ???, and 0-0 is off.
#define PMCR_RUN_COUNTER 0xC000
#define PMCR_RUN_SHIFT 14
// Interestingly, the output here writes 0x6000 to the counter config registers,
// which would be the "PMST" bit and the "RESET" bit:
// https://www.multimediaforum.de/threads/36260834-alice-hsn-3800tw-usb-jtag-ft4232h/page2
// To disable a counter, just write 0 to its config register. This will not
// reset the counter to 0, as that requires an explicit clear via setting the
// PMCR_STOP_COUNTER bit. What's odd is that a disabled counter's data
// registers read back as all 0, but re-enabling it without a clear will
// continue from the last value before disabling.
#define PMCR_DISABLE_COUNTER 0x0000
// These definitions merely separate out the two PMCR_RUN_COUNTER bits, and
// they are included here for documentation purposes.
// PMST may mean PMCR START. It's consistently used to enable the counter.
// I'm just calling it PMST here for lack of a better name, since this is what
// the Linux kernel and lxdream call it. It could also have something to do with
// a mode specific to STMicroelectronics.
#define PMCR_PMST_BIT 0x4000
#define PMCR_PMST_SHIFT 14
// Likewise PMEN may mean PMCR ENABLE
#define PMCR_PMEN_BIT 0x8000
#define PMCR_PMEN_SHIFT 15
//
// --- Performance Counter Event Code Definitions ---
//
// Interestingly enough, it so happens that the SEGA Dreamcast's CPU seems to
// contain the same performance counter functionality as SH4 debug adapters for
// the SH7750R. Awesome!
//
// MODE DEFINITION VALUE MEASURMENT TYPE & NOTES
#define PMCR_INIT_NO_MODE 0x00 // None; Just here to be complete
#define PMCR_OPERAND_READ_ACCESS_MODE 0x01 // Quantity; With cache
#define PMCR_OPERAND_WRITE_ACCESS_MODE 0x02 // Quantity; With cache
#define PMCR_UTLB_MISS_MODE 0x03 // Quantity
#define PMCR_OPERAND_CACHE_READ_MISS_MODE 0x04 // Quantity
#define PMCR_OPERAND_CACHE_WRITE_MISS_MODE 0x05 // Quantity
#define PMCR_INSTRUCTION_FETCH_MODE 0x06 // Quantity; With cache
#define PMCR_INSTRUCTION_TLB_MISS_MODE 0x07 // Quantity
#define PMCR_INSTRUCTION_CACHE_MISS_MODE 0x08 // Quantity
#define PMCR_ALL_OPERAND_ACCESS_MODE 0x09 // Quantity
#define PMCR_ALL_INSTRUCTION_FETCH_MODE 0x0a // Quantity
#define PMCR_ON_CHIP_RAM_OPERAND_ACCESS_MODE 0x0b // Quantity
// No 0x0c
#define PMCR_ON_CHIP_IO_ACCESS_MODE 0x0d // Quantity
#define PMCR_OPERAND_ACCESS_MODE 0x0e // Quantity; With cache, counts both reads and writes
#define PMCR_OPERAND_CACHE_MISS_MODE 0x0f // Quantity
#define PMCR_BRANCH_ISSUED_MODE 0x10 // Quantity; Not the same as branch taken!
#define PMCR_BRANCH_TAKEN_MODE 0x11 // Quantity
#define PMCR_SUBROUTINE_ISSUED_MODE 0x12 // Quantity; Issued a BSR, BSRF, JSR, JSR/N
#define PMCR_INSTRUCTION_ISSUED_MODE 0x13 // Quantity
#define PMCR_PARALLEL_INSTRUCTION_ISSUED_MODE 0x14 // Quantity
#define PMCR_FPU_INSTRUCTION_ISSUED_MODE 0x15 // Quantity
#define PMCR_INTERRUPT_COUNTER_MODE 0x16 // Quantity
#define PMCR_NMI_COUNTER_MODE 0x17 // Quantity
#define PMCR_TRAPA_INSTRUCTION_COUNTER_MODE 0x18 // Quantity
#define PMCR_UBC_A_MATCH_MODE 0x19 // Quantity
#define PMCR_UBC_B_MATCH_MODE 0x1a // Quantity
// No 0x1b-0x20
#define PMCR_INSTRUCTION_CACHE_FILL_MODE 0x21 // Cycles
#define PMCR_OPERAND_CACHE_FILL_MODE 0x22 // Cycles
#define PMCR_ELAPSED_TIME_MODE 0x23 // Cycles; For 200MHz CPU: 5ns per count in 1 cycle = 1 count mode, or around 417.715ps per count (increments by 12) in CPU/bus ratio mode
#define PMCR_PIPELINE_FREEZE_BY_ICACHE_MISS_MODE 0x24 // Cycles
#define PMCR_PIPELINE_FREEZE_BY_DCACHE_MISS_MODE 0x25 // Cycles
// No 0x26
#define PMCR_PIPELINE_FREEZE_BY_BRANCH_MODE 0x27 // Cycles
#define PMCR_PIPELINE_FREEZE_BY_CPU_REGISTER_MODE 0x28 // Cycles
#define PMCR_PIPELINE_FREEZE_BY_FPU_MODE 0x29 // Cycles
//
// --- Performance Counter Support Definitions ---
//
// This definition can be passed as the init/enable/restart functions'
// count_type parameter to use the 1 cycle = 1 count mode. This is how the
// counter can be made to run for 16.3 days.
#define PMCR_COUNT_CPU_CYCLES 0
// Likewise this uses the CPU/bus ratio method
#define PMCR_COUNT_RATIO_CYCLES 1
// These definitions are for the enable function and specify whether to reset
// a counter to 0 or to continue from where it left off
#define PMCR_CONTINUE_COUNTER 0
#define PMCR_RESET_COUNTER 1
//
// --- Performance Counter Miscellaneous Definitions ---
//
// For convenience; assume stock bus clock of 99.75MHz
// (Bus clock is the external CPU clock, not the peripheral bus clock)
//
#define PMCR_SH4_CPU_FREQUENCY 199500000
#define PMCR_CPU_CYCLES_MAX_SECONDS 1410902
#define PMCR_SH4_BUS_FREQUENCY 99750000
#define PMCR_SH4_BUS_FREQUENCY_SCALED 2394000000 // 99.75MHz x 24
#define PMCR_BUS_RATIO_MAX_SECONDS 117575
//
// --- Performance Counter Functions ---
//
// See perfctr.c file for more details about each function and some more usage notes.
//
// Note: PMCR_Init() and PMCR_Enable() will do nothing if the perf counter is already running!
//
// Clear counter and enable
void PMCR_Init(int which, unsigned short mode, unsigned char count_type);
// Enable one or both of these "undocumented" performance counters.
void PMCR_Enable(int which, unsigned short mode, unsigned char count_type, unsigned char reset_counter);
// Disable, clear, and re-enable with new mode (or same mode)
void PMCR_Restart(int which, unsigned short mode, unsigned char count_type);
// Read a counter
// out_array is specifically uint32 out_array[2] -- 48-bit value needs a 64-bit storage unit
void PMCR_Read(int which, volatile unsigned int *out_array);
// Stop counter(s) (without clearing)
void PMCR_Stop(int which);
// Disable counter(s) (without clearing)
void PMCR_Disable(int which);
#endif /* __PERFCTR_H__ */