OpenLava Reporting Tool – breport

Overview

The OpenLava Reporting Tool generates job related reports from OpenLava accounting and job events data. It is available in the OpenLava Enterprise Edition only.

The tool outputs data in csv format by default so that data can easily be imported into other tools. It also has an integration with open-source gnuplot to plot data graphically, as illustrated below.

Average job slot usage by user
Average job slot usage by user

The OpenLava Reporting Tool can generate the following reports. All reports are based on a specified time period.

  • Number of submitted jobs
  • Number of done, exited, or completed jobs
  • Average number of pending jobs
  • Average number of suspended jobs
  • Average job slot usage

Data can be separated by:

  • Users
  • Queues
  • Projects
  • User groups
  • Resource requirement strings

Report can also be narrowed down by specifying a range of:

  • User names
  • Queue names
  • Project names
  • User group names
  • Host names
  • Time period

Installation

If you don’t need to generate gnuplot charts, simply copy the breport binary to a directory, preferably one included in your $PATH, and run the command from there.

If you want to create gnuplot charts, you’ll also need to install gnuplot.

  • Install gnuplot 5 and some required fonts. Refer to the section “Gnuplot setup”.
  • Copy the gnuplot template file into the current working directory or $LSF_ENVDIR.

Syntax and options

Command syntax: breport [-h] [-f "file_name ..." | -f “path[+]” ] [-C time0,time1] [-m "host_name ..."]
                        [-P "project_name ..."] [-q "queue_name ..."] [-G "user_group ..."]
                        [-u "user_name ..."] [-j "jobId ..."] [-l | -r report_spec [-i report_inverval]]
                        [-p gnuplot_output_file] [-z timezone] [-V]

Every parameter is optional.

OptionDescription and argumentDefault Value
-hDisplays the usage informationDisplays if there is an error in any argument
-VDisplays the version informationnone
-f “file_name …”
-f path[+]
Specifies OpenLava accounting or event file names. Multiple filenames can be specified with quotes and separated by space, e.g. “lsb.acct lsb.acct1”. If the argument is a directory path, all files of lsb.acct* in the directory will be used. If the argument is a directory path plus a ‘+’ character (e.g. /opt/openlava-3.1/work/logdir+), it reads both the lsb.acct* file as well as lsb.events* files in that directoryThe lsb.acct file in the current installed OpenLava environment
-C time0,time1Specifies the reporting time period. The syntax is the same as the time syntax in the bhist commandEntire time period of what the events or/and accounting file covers
-m “host_name …”Uses data of jobs running on specified hosts. Multiple hosts can be specified within quotes separated by space.All hosts in the OpenLava system
-P “project_names …”Uses data of jobs with the specified projects. Multiple projects can be specified within quotes separated by space.All projects in the OpenLava system
-q “queue_names …”Uses data of jobs running in the specified queues. Multiple queues can be specified within quotes separated by space.All queues in the OpenLava system
-u “user_names …”Uses data of jobs submitted by the specified users. Multiple users can be specified within quotes separated by space.All users in the OpenLava system
-G “user_group …”Uses data of jobs running with the specified user groups. Multiple user group can be specified within quotes separated by space. Please note the accounting file does not contain user group information, hence can’t use this option.All user groups in the OpenLava system
-j “jobID …”Uses data of jobs with the specified jobIDs. Multiple jobIDs can be specified within quotes separated by space. For array jobs, only jobID can be specified.All jobs in the OpenLava system
-lList all reported jobs without generating the reportNot to list jobs
-p gnuplot_output_fileGenerates report chart by using gnuplotOutput csv format
-r report_typeSpecifies the report type. Report type name has the format of “statstics:category”.
Statistics is one of the following names:
submit - number of submitted jobs
done - number of done jobs
exited - number of exited jobs
completed – number of done and exited jobs
run - average slot usage
pending - average number of pending jobs
suspend - average number of suspended jobs
Category is one of the following names
user - data categorized by users
queue - data categorized by queues
project - data categorized by projects
resreq - data categorized by resource
requirement string
ugroup - data categorized by user groups
all - data is not categorized
run:all
-i interval[unit]Specifies the time interval between two report data in seconds. For example: “-i 3” means 3 seconds.
The unit is an optional character.
‘m’: minute
‘h’: hour
‘d’: day
The interval is automatically calculated so that about 160 report data points are generated for the entire reporting period

Using the OpenLava Reporting Tool

(1) Utilization by user report

Command example:

# breport –r run:user –p run-user.png

Report: Job slot usage by users over time. This report shows the share of job slots allocated by OpenLava to users over time,

The command reads data from the lsb.acct file in the OpenLava cluster, and calls gnuplot to generate a report image file run-user.png.

Average job slot usage by user
Average job slot usage by user

In this example, a fairshare scheduling policy in use, and user shares configured in lsb.users are: u0:1, u1:2, u2:3, u3:8, u4:8, u5:8, u6:6. U7:12. The chart shows when every user has jobs, the OpenLava fairshare algorithm allocated job slots in the expected ratios.

Jobs from u7 (black) finished first, then jobs of u3, u4, u5, and u6 were completed, u0’s job were finished the last because u0 was allocated less share of the cluster job slots.

(2) Cluster utilization report

Similarly, the above graph also shows how the OpenLava cluster has been performing.

The top line in the chart represents the total job slots used over time. It is close to 100% usage when there are jobs running.

If you want to see an output showing cluster utilization only, not broken down by user, you can just run:

# breport –r run –p run.png

Or if you want to import the data into a spreadsheet or other tools, you can let the tool to write data to the stdout, then capture it into a file. For example:

# breport –r run > run.csv
# breport –r run:user > run-user.csv
# breport –r run:queue > run-queue.csv

(3) Job throughput report

To report on the job throughput of the OpenLava system, run the following command to show the rate at which jobs are completed:

# breport –r comp:user –p comp-user.png
Number of finished and exited jobs by queue
Number of finished and exited jobs by queue

In the above chart, we see the number of jobs completed (done or exited) per hour, which is essentially the job throughput.

(4) Job failure report

If you want to examine how many jobs exited with a non-zero status indicating a failure, you can run the following command:

# breport –r exit –C 2015/10/30 –i 1h

This reports the number of exited jobs on October 30, 2015 with one row per hour. As shown in Table 1, you can generate statistics like this in intervals of seconds, minutes, hours, days or months.

(5) Charge back accounting report

To report charge back accounting information by individual users on a daily basis for the month of October, run:

# breport –r run:user –C 2015/10 –i 1d

This will report average job slot usage by user by day. You can import the generated data into a spreadsheet, then add all rows together for a user. The result represents the slot-day usage for each user in October 2015.

(6) Scheduler performance report

Typically, when there are enough pending jobs, the slot usage should be high in the system. Sometimes, jobs are not dispatching while there are free job slots available. One of the common reasons for a job not being dispatched is unmet resource requirements.

First, we generate a report for the job slot usage by resource requirement. If we see empty spaces, we may generate another report showing pending jobs by resource requirement.

Average job slot usage by resreq string

In above chart, we see some white spaces. Let’s look at the pending job report.

Average number of pending jobs by resreq string
Average number of pending jobs by resreq string

There were many pending jobs during the period shown. The fact that there were pending jobs and the cluster was not fully utilized suggests that something else blocked the jobs from being scheduled. We may look at the OpenLava configuration to further out investigation.

(7) Zooming in on a time period of interest

To zoom into a small time period, specify the time period with the–C option. To get more data points, you may also shorten the interval between reported data points using the –i switch. The smallest interval allowed is 1 second.

You may also narrow down the data by specifying queue names, project names, user names, user group names, host names, job IDs, or any combinations.

Time specification:

The “time0,time1” in options of -C must conform to the following:

       time_argument = ptime,ptime | ptime, | ,ptime | itime
       ptime = day | /day | month/ | year/month/day | year/month/day/ | hour: | month/day | year/month/day/hour:
                   | year/month/day/hour:minute | day/hour: | month/day/hour: |day/hour:minute | hour:minute | month/day/hour:minute 
                   | -itime
       itime = ptime day, month, hour, minute = two digits

where ‘ptime’ stands for a specific point of time, ‘itime’ stands for a specific interval of time, and ‘.’ stands for the current month/day/hour:minute.

Keeping the following rules in mind will help you to specify the time freely:

  • year must be 4 digits and followed by a /
  • month must be followed by a /
  • day must be preceded by a /
  • hour must be followed by a :
  • minute must be preceded by a :
  • The / before day can be omitted when day stands alone or when day is followed by /hour:
  • No spaces are allowed in the time format, that is, the time must be a single string.

The above time format is designed for easy and flexible time specification.

See the following examples:

Suppose the current time is Mar 9 17:06:30 2015.

1,8                   from Mar 1 00:00:00 2015 to Mar 8 23:59:00 2015;

,4 or ,/4             from the time when first job was logged to Mar 4 23:59:00 2015;

6 or /6               from Mar 6 00:00:00 2015 to Mar 6 23:59:00 2015;

2/                    from Feb 1 00:00:00 2015 to Feb 28 23:59:00 2015;

12:                   from Mar 9 12:00:00 2015 to Mar 9 12:59:00 2015;

2/1                   from Feb 1 00:00:00 2015 to Feb 1 23:59:00 2015;

2/1,                  from Feb 1 00:00:00 to the current time;

,. or ,               from the time when first job was logged to the current time;

,.-2                  from the time when first job was logged to Mar 7 17:06:30 2015;

,.-2/                 from the time when first job was logged to Jan 9 17:06:30 2015;

,2/10:                from the time when first job was logged to Mar 2 10:59:00 2015;

2014/11/25,2015/1/25  from Nov 25 00:00:00 2014 to Jan 25 23:59:00 2015;

Gnuplot setup:

In order for breport to generate gnuplot charts, gnuplot 5.0 is required. You may download source, then compile and install it.

Before compiling and installing gnuplot, prepare the environment as follows:

  • For gnuplot to generate png image files, make sure you install the following packages:
# yum install gd-devel cairo-devel pango-devel
  • Install some fonts. The breport gnuplot template uses the font gnu-free-sans-fonts
# yum install gnu-free-sans-fonts

Compile and install gnuplot. Run the following as root:

# tar xvfz gnuplot-5.0.1.tar.gz
# cd gnuplot-5.0.1
# ./configure --prefix=/usr
# make
# make install

If you want to change the look and feel of the gnuplot chart, you may modify the template file bgnu.temp. But make sure you know what you are doing.

Limitations:

Currently, breport uses only the OpenLava accounting and events data file as a source of information. It has the following limitations:

  • There is no OpenLava configuration information captured in the OpenLava accounting and events files. The tool can’t report slot utilization in percentage terms.
  • If you just use lsb.acct (accounting file) as the data source, there is no user group information in it. Job suspension and resuming information is also missing.
  • There is no data for job pending reasons.
  • There is no data for system loads.
  • The time value in the report is local time. If the data is created in a location with a different time zone from the one breport runs, be aware the time difference. For example, for jobs run in Singapore at 11:00pm November 1, when running the report in North America EST, it shows 10:00am November 1.