ppt
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Production Performance Telemetry
#+TITLE: =ppt= Production Performance Telemetry 
#+AUTHOR: Lally Singh

AKA the Portable Performance Tool.  But really, the performance tool of last resort.

/Too cool to find on Google!  We live in the shadows of powerpoint!/

* Purpose
=ppt= is a performance tracing tool.  You define different types of individual
frames and =ppt= will generate code to call and link into your executable.
Then =ppt= will let you attach to your running executable to collect these
traces.

This is quite useful when you want to know why your code performs a certain way
in real-world scenarios.

A frame is essentially a C =struct= that you fill out and save to a circular
buffer in shared memory, called a =buffer=.  You can add any members you like
to a frame to record relevant details about the event.  =ppt= also has built-in
support for using timers and CPU performance counters.

=ppt= prioritizes low overhead first.

When =ppt= isn't attached to a running process, there is no buffer and saving a
frame is a no-op.

* How do I build it?
  You need =stack=, the haskell build tool.  [[Download Instructions Are Here][https://docs.haskellstack.org/en/stable/README/]].  Only tested on Linux.  I'm quite sure it won't work on MacOS X without serious changes (e.g., the ELF dependency).
  
** With Nix
  With =nix=, it's just =stack install=.  The build output will indicate the directory (=~/.local/bin=) where the binaries are installed.  
  
** Without Nix
  =nix= installs some dependencies, but not many.  You just need to install two libraries:
  
  | Library | Ubuntu Names        |
  |---------+---------------------|
  | zlib    | zlib1g, zlib1g-dev  |
  | libpfm4 | libpfm4 libpfm4-dev |
  
  Once installed just run =stack install --no-nix=.  The build output will indicate the directory (=~/.local/bin=) where the binaries are installed.  You can ignore =elf=, only =ppt= matters.
  
  
* Why another performance tool?
** Why not dtrace or gprof?
   Because they solve different problems.  In a dreaded car analogy, =gprof= is
   a dynomometer check, =dtrace= is a PMU logger, and =ppt= is custom
   telemetry.  All useful, all separate.

   =dtrace= is an excellent tool.  It instruments unaltered binaries using
   several built-in "providers", and you can alter the program to give =dtrace=
   even more data.  But it has two drawbacks:
   - Attachment time: the more instrumentation you want on the execution, the
     longer it takes to attach to the process.  This can leave the process
     unresponsive for longer than you may like.
   - Statistics and histograms only: no access to raw data.  Instead, you can
     get histograms of collected data (or simple statistics).  But the
     underlying raw data is captured only (AFAIK) with a relatively expensive
     print statement.

   =gprof= is a good tool.  It provides a simple way to get a rough idea of
   where your CPU time is taken.  But it sacrifices a few things to get there:
   - It chooses where the overhead goes -- =gprof= instruments on a function
     level, which biases towards overreporting small function overhead (because
     as a percentage, =gprof=-induced overhead is higher).
   - You don't get as much detail.  =gprof= gives you basic statistics on a
     program execution, but much detail can be hidden behind those statistics.
     For example, the program performance could be dominated by some
     infrequent - but very expensive - situation.

** Why =ppt=? Get everything you want, how you want it.
   In contrast, =ppt='s data gives you the whole distribution (instead of
   statistics) for your data, and you choose what and when to instrument in the
   program's execution.

   The transfer protocol for sending a frame from the process under observation
   to =ppt= for recording is a pair of memory barriers, a copy of the data, and
   some additional writes to the sequence number.  There's no flow-control or
   other synchronization between =ppt= and the instrumented app.  =ppt= can't
   slow down your app, even if you ^Z =ppt=.

** Use =ppt= for regression analysis.
   The best use-case for =ppt= is for regression analysis of state or inputs to
   performance values.  Embed input values, loop counts, and whatever else you
   want (add more than you think you'll need, the instrumentation's cheap) and
   measure what matters.  The resulting records will directly contain cause and
   effect.

* Can it lose data?
  Yes. This is a flow-control policy.  =ppt= /does not/ let the writer (the
  process being instrumented) wait for the reader (=ppt=) to catch up.
  Instead, some data will be overwritten in the shared buffer before it's read
  by =ppt= and saved to disk.  

  Each frame in a =ppt= buffer has a sequence number, that monotonically
  increases.  If =ppt= doesn't capture some frames in time, the captured frames
  will have jumps in their sequence numbers.

* How do I use it?
  Roughly:
  1. Describe the frames you want to collect in a .spec file.
  2. Generate source for those frames,
  3. Add instrumentation to your program to fill in frames and save them to the
     buffer.
  4. Attach to the running program to collect your data.
  5. Convert the collected data to CSV for analysis.

  It's a lot of steps, but it has a few advantages:
  - During your program's run, PPT isn't doing anything more than saving a
    shared memory buffer to disk periodically.
  - You get lots of control in your instrumentation.
  - You get raw data at the end, for deeper analysis.

** Describing the Frames
   #+begin_src filename:minimal.spec
emit C++;

buffer Minimal 512;

option time timespec realtime;

frame first {
   int a, b, c;
   interval time duration;
   interval counter events;
}

frame second {
   interval counter foos;
}
   #+end_src
   See the [[doc/buffer_syntax.md][Buffer Syntax Reference]] for details.

*** Performance Counters
    Notice above that you can use the =counter= type.  Its recorded value is a
    =uint64_t=, but the actual counter used is unspecified.  When attaching,
    use the =-c= flag to specify which actual counters to use.

    The counter names are as-specified by =libpfm4=.  You can use the
    =showevtinfo= command from that distribution to list the (many) counters
    available on your machine.  Some highlights:

    | Counter               | Description                                                                                                                                                                    |
    |-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    | INSTRUCTION_RETIRED   | count the number of instructions at retirement. For instructions that consists of multiple micro-ops, this event counts the retirement of the last micro-op of the instruction |
    | LLC_MISSES            | count each cache miss condition for references to the last level cache. The event count may include speculation, but excludes cache line fills due to hardware prefetch        |
    | DTLB_LOAD_MISSES      | DTLB Load misses.  See all modifiers in =showevtinfo= for details.                                                                                                             |
    | ITLB_MISSES           | Instruction TLB misses.                                                                                                                                                        |
    | L1-DCACHE-LOAD-MISSES | L1 data cache load misses.                                                                                                                                                     |
    | L1-ICACHE-LOAD-MISSES | L1 instruction cache misses                                                                                                                                                    |

    Many, many more are available, but =showevtinfo= does the job of explaining
    what's counters you can use much better than we can.

    =ppt= allocates enough space for 3 counters' worth of data in each
    =counter= member of a frame.  Twice that for =interval counter=.  You
    specify which counters you want on the command line in =ppt attach=.  You
    can attach to the same process several times (sequentially) with different
    counters to get more than 3.

    The current implementation will use a system call to read the counters in
    the generated code. Please beware of the performance impact of measurement.  

** Generating Source for Frames
#+begin_src sh
$ ppt generate ./minimal.spec
#+end_src
   Will generate source with this public API:
#+begin_src cpp
namespace ppt { namespace Minimal {
class first {
public:
    struct timespec duration_start;
    struct timespec duration_end;
    uint64_t events_0_start= 0;
    uint64_t events_0_end= 0;
    uint64_t events_1_start= 0;
    uint64_t events_1_end= 0;
    uint64_t events_2_start= 0;
    uint64_t events_2_end= 0;
    int a= 0;
    int b= 0;
    int c= 0;

    void save();
    void snapshot_duration_start();
    void snapshot_duration_end();
    void snapshot_events_start();
    void snapshot_events_end();
};
class second {
public:
    uint64_t foos_0_start= 0;
    uint64_t foos_0_end= 0;
    uint64_t foos_1_start= 0;
    uint64_t foos_1_end= 0;
    uint64_t foos_2_start= 0;
    uint64_t foos_2_end= 0;

    void save();
    void snapshot_foos_start();
    void snapshot_foos_end();
};
}} // namespace ppt::Minimal
#+end_src

   There are additional members saved in each =class= that =ppt= uses
   internally.  Check out the generated code for details.

   Generally:
   - For simple scalar frame members, you should see a matching =class= member
     of the same name and type.  Directly assign to it.
   - For =time= members, a method is provided to save the current time, called
     =snapshot_MEMBER()=.
   - Same for =counter= members, only that they're saving to 3 members at a
     time.
   - For =interval= types, =ppt= adds a  =_start= and =_end= suffix to
     distinguish members for the start and end of the interval.
   - When it's done filling in the frame, your code should call =save()= to
     make it available for an attached =ppt= process to capture.

** Instrumenting your Program
#+begin_src sh
minimal-client: ppt-Minimal.hh ppt-Minimal.cc minimal-client.cc
	g++ -o minimal-client minimal-client.cc ppt-Minimal.cc
#+end_src

#+begin_src cpp
  #include "ppt-Minimal.hh"

  int main() {
     int acount = 0;
     while (1) {
         ppt::Minimal::first record;
         // collect timestamp of when this starts.
         record.snapshot_duration_start();
         // snapshot performance counters
         record.snapshot_events_start();
         // Do anything you want here.  For example, saving relevant parts of
         // input, loop counts, etc.
         record.a = 0xaaaa0000 + acount++;
         record.b = acount - record.a;
         record.c = 0xcccccccc;
         // snapshot performance counters.
         record.snapshot_events_end();
         // snapshot timestamp.
         record.snapshot_duration_end();
         // save to buffer.
         record.save();
     }
     return 0;
  }
#+end_src

   Roughly: make an instance of the frame type you want, fill it in, and then
   =save()= it.  You can reuse the instance across =save()= calls.  The members
   are directly accessible.

   For example, if you're processing a batch of the events, you may not want to
   pay the overhead of repeating =snapshot_duration_start()= and
   =snapshot_duration_end()= calls.  You can simply call one of them, and copy the
   member over to the other:
#+begin_src cpp
   ppt::Minimal::first record;
   bool first = true;
   while (1) {
      if (first) {
         record.snapshot_duration_start();
         first = false;
      } else {
         record.duration_start = record.duration_end;
      }
      // .. same innards as above.
      record.snapshot_duration_end();
      record.save();
  }
#+end_src

   You can do the same for the counters, but you may end up with a frame of
   distorted data if you detach and reattach (with different counters).

** Attaching to your Program
#+begin_src sh
$ ./minimal-client &
$ ppt attach -p $(pgrep minimal-client) -o output.bin
#+end_src
** Converting Data for Analysis
#+begin_src sh
$ ppt convert output.bin 
$ ls output.bin_output
minimal.csv
#+end_src

* How do I use it effectively?
  =ppt= has two phases of your program's lifecycle where it becomes quite
  handy:
  1. During development, it provides good feedback on the implementation's
     performance characteristics.  This is very useful for:
     - Determining performance trade-offs.
     - Improving performance of the system
  2. During operations, it provides a good way to monitor the application.
     - Significantly faster to log than text
     - Easier to analyze/plot
     Note that more operational support is planned.  Once I get around to
     learning (n)curses.

** When optimizing my program
   First, you'll clearly have to figure out what you want to optimize.  The
   latency in response to an event?  The time through the main loop?  Time to
   complete N items of work?
   - Sort that out and come back.  We'll wait.
   - Got it?  Good.  Here we go:

*** What to instrument
    First figure out what you're measuring:
    - The /performance metric/: the number you want to make better.  This can
      be, for example: frame rate (go higher!), latency (go lower), or
      throughput (higher again!).  You don't have to make it time-based.  If
      you want to measure how many I/Os you do instead, you can do that. 
    - The /unit/ of work.  What's a single measurement look like?  For frame
      rates, this would be the time for a single frame.  For latency, the a
      single time interval.  For throughput, the time for a single item (or if
      batching, two numbers: number in batch and time taken for batch).
 
    Now, here's what you want to instrument:
    - The /load/ on your program for this unit.  The event you processed, some
      characteristic of how much data you processed in that iteration of your
      main loop, or the type/parameters of the item processed.
    - The /effort/ expended on this unit.  This is where you do most of your
      instrumentation.  Things to record:
      - How many iterations of each loop you run
      - Which major branches (if conditions) you take
      - Key performance counters
        - Mostly we're talking about cache misses
      - Synchronization overhead
        - How long you spent waiting for a mutex, for example.

   Generally, you can start off with a rough breakdown of where you're spending
   your effort, and drill down with more instrumentation once you see where the
   effort's really going.

*** Setting up a benchmark
    When optimizing a program, you can't be sure that you've actually improved
    the performance without a /benchmark/ for comparison.  This doesn't have to
    be hard, it can be setup like a unit test.
    - Take some input that's characteristic of reality.
    - Run it through your code.
    - Collect results.

   Run before/after each change you want to compare, and you can tell if you're
   doing better.  As to how many times you want to run it: generally run it a
   bunch of times to get clean data, then investigate why it varies.

   Then repeatedly go back and change your instrumentation, and re-benchmark
   until you can predict the performance of slightly different code or input
   load.  Now that you have a real mapping between your load, your
   implementation, and its performance, you can start to alter the code to
   perform more like how you want it.

*** Data Analysis
    =ppt decode= emits CSV files, one per frame type.  You can use Jupyter, R,
    or Excel to great success with that data.  Yes, other output formats are
    good too, I just haven't had a use case to write more. CSV just keeps
    working, no matter how it kinda sucks.

** When monitoring a program
   This intended use case for =ppt= doesn't have the desired support in =ppt=
   yet.  Generally, you can define other buffers for monitoring, and in the
   future, =ppt= should have a monitor mode that presents a live-decoded
   version of the data in that buffer.

   The essential issue with this is that we need to present the monitor data in
   a way that scales up easily.  This may just be ppt emitting JSON on =stdout=.

* Limitations
  - =ppt= is only maintained for x86_64 Linux.  Not very portable, I know.  It
    used to be used mostly on Solaris, which doesn't really count.  If anyone
    asks why I still call it portable, I do almost all the development for it
    while on public transportation.

  - =ppt= attaches to a running process via =ptrace(2)=, which means that you
    can't be debugging the process at the same time.  As =ppt= only uses
    =ptrace(2)= when attaching and detaching, you can start the process, have
    ppt attach to it, and then have =gdb= attach to it.  It should work, but
    isn't exactly convenient.

  - C++ code gen only.  It's what I use, so it's what I developed this
    for. As long as the generated code follows the same format and protocol,
    and that we can emit the same symbols into the executable, other languages
    should be straightforward.
    - Back when this was used on Solaris, the generator was C based.  It's not hard to bring it back, if it was useful.
    - VM-based languages are probably not as straight forward.
    - python3 support is work-in-progress.

  - =save()= checks to see if a =ppt= process is attached or has just
    detached.  In that case, it will take a few system calls to setup / tear
    down the internal buffering and performance counter measurement apparatus.
    See the generated code (it's readable!) for details.

  - The data transfer between =ppt= and the instrumented process is /lossy/.
    - You can detect loss by watching sequence numbers in the frames.
    - You can increase the buffer size to reduce loss.  But you incur more
      cache misses that way.  OTOH, larger buffers means =ppt= doesn't have to
      wake up as often to get data.  If you have a lot of cache churn anyways,
      you may prefer to keep that CPU core idle more often.
      - Perhaps non-temporal stores could be used in the future to mitigate this.
    - This is the price to pay to prevent having the process-under-observation block when the reader falls behind.
      
  - Performance counters are limited to 3 at a time, and require a system call
    per use.  There is /experimental/ code to avoid the system call interface
    if the counters are /architectural/, and could be read with =rdpmc=
    directly. 
    - By /experimental/, I mean that it does what I think it's supposed to, but
      still seems to crash the process a lot.  Suggestions, pull requests,
      etc., more than welcome.

本源码包内暂不包含可直接显示的源代码文件,请下载源码包。