Monday, May 20, 2019

Filtering data in CSV file using Python

As promised in my previous post Creating Excel spreadsheet with IBM i library report using Python I will in this article show you how to use Python to remove rows from a comma-separated (CSV) file based on the value of a column. This is a situation I was in some time ago, where I had to reduce a CSV file containing millions of rows by filtering the rows by a certain value in one of the columns. I tried several tools before ending with a solution that was surprisingly simple.

There are several ways to do the job of filtering a comma-separated file on IBM i besides using Python - for example you can

  • import the file to a table using CPYFRMIMPF, remove unwanted rows using SQL and export the table using CPYTOIMPF
  • use the (g)awk tool in a PASE shell
  • use the DATA-INTO operation in RPG together with a CSV parser
  • move the file to a PC, filter using a PC tool and move the file back to IBM i
  • and you can probably add to this list...

However, the reason I chose Python to do the job was because of the simplicity - I only needed to code four lines in Python! Another reason was that the current file to be filtered was quite big - more than 1 GB in size. Not many PC tools can handle such large text files and even fewer can understand and filter the data in a CSV file.

The simplicity of the solution was not only due to Python but even more to the Python module called Pandas. Pandas is a high-performance library for data analysis that is very easy to use. It supports several file formats including JSON, Excel, database (SQL) - and flatfile, which includes CSV files. The Pandas package has been ported to the IBM i by IBM and can easily be installed using either the Open Source Package Management in Access Client Solutions or by using the YUM package manager in a shell. Here is the yum command to install Python3 and Pandas in one single command:

yum install python3 python3-pandas

Pandas depend on date and timezone packages so you must also install these for Pandas to work:

yum install python3-pip python3-dateutil
pip3 install pytz

A central item in the Pandas package is the DataFrame object. This is a two-dimensional data-structure and you can compare it to a SQL table. This fits perfectly with us having a CSV file, which is just some rows with columns separated by a delimiter. The first row in our CSV file contains the name of the columns, and these names will be used as column labels in the DataFrame.

In this post I will use a very simple CSV file as input. It looks like this:

ITEM;DESCRIPTION;STOCK;PRICE;STATUS
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
C;Item C;30;1000;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE

Now, the full Python program to filter this CSV file looks like this:

import pandas as pd
data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
new_data = data.query( 'PRICE < 1000' )
new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )

I will go through the lines one by one and explain what is going on.

import pandas as pd

In order to make the Pandas library accessible to the Python program we import the package. The "pd" is a shortcut for the library, when we reference the library later in the Python program.

data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )

Here we read the CSV file into a Pandas DataFrame which we call "data". The parameters for the function "read_csv" are the following:
  1. the name of the CSV file to read
  2. the number of rows to read (None means all rows)
  3. the delimiter (in this case the semicolon ";") and
  4. the memory allowed to be used (low_memory=False means Pandas should not process the file in chunks).
There are many more parameters available in the read_csv function, and you can read about them in the API reference.

When the line has executed, the DataFrame object "data" has all content of the CSV file. We can now query and manipulate the DataFrame as we like. In this example we will only query the DataFrame:

new_data = data.query( 'PRICE < 1000' )

We call the "query" method for the "data" DataFrame object and use it to filter the data. In this example we include only rows where the value in PRICE is less than 1000. The output of the operation is a new DataFrame object called "new_data".

new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )

Finally we write the DataFrame with the filtered data to a new CSV file. We do this by calling the method "to_csv" for the "new_data" object, specifying several parameters:
  1. the name of the new CSV file
  2. what separator (delimiter) to use
  3. not to include the index of the DataFrame and
  4. how to format floating point values.
The first two parameters are obvious, but the third and fourth parameter are not so obvious.

A DataFrame has index (row labels ) and columns (column labels), but we are only using the column labels (retrieved from the first header row in the file "input.csv"). The index has been automatically generated by the "read_csv" function (think of it like a row id), so to avoid having the index written to the new CSV file, we specify "index=False" on the "to_csv" function.

You may wonder why we have to deal with floating point numbers since there are only numeric values of type integer in our CSV example. This is due to the way Pandas works when it encounters missing values in an integer column - it converts the column data type from integer to float because the NaN (Not-a-Number) value in Pandas is a float. Item D in row 5 in our example has no stock value (the item has not been released yet), so the STOCK column will be converted to float. When we write the DataFrame to the CSV file, we specify '%.f' as the format which will print the integer values without any decimals and the NaN values as null (empty).

The new CSV file created by the Python program looks like this:

ITEM;DESCRIPTION;STOCK;PRICE;STATUS
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE

Item C has been filtered out and is not in the output CSV file. Its price was 1000 and was not accepted by the query, thus filtered out - which was the purpose of this exercise.

You can use the Pandas functions interactively by starting Python3 and enter the lines one by one. If for example you want to see the data that Pandas has imported after the "read_csv" function, you simply write "data" and press Enter, and Python will show the DataFrame:

>>> data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
>>> data
  ITEM DESCRIPTION  STOCK  PRICE          STATUS
0    A      Item A   10.0    100          ACTIVE
1    B      Item B   20.0    120          ACTIVE
2    C      Item C   30.0   1000          ACTIVE
3    D      Item D    NaN    100  TO BE RELEASED
4    E      Item E   45.0    105          ACTIVE

I hope by providing this article that you now have an idea of what a brilliant and powerful tool Pandas is for filtering and analyzing large amounts of data, and I encourage you to read the Pandas API reference and play with the functions in Pandas. The library can do so much - with great performance - and we have not even scratched the surface in this post. The documentation is great and also has a quick introduction to Pandas as well as tutorials. If you ever work with data sets and data analysis, this library can help you work with your data. And remember: It also works with data in Excel and databases - and it can be automated very easily.

And best of all - it is available on IBM i!

Thank you for reading.

Friday, May 03, 2019

Creating Excel spreadsheet with IBM i library report using Python

A few days ago Mike Larsen published a great article titled "Creating Excel Spreadsheets With Python", in which he described how easy it is to populate Excel spreadsheets with data from IBM i. Mike demonstrated several features like formatting, tab colors and conditional formatting in his example and showed how few Python lines of code are needed in order to get a good-looking result.

His article reminded me of another Python script, that I wrote last year, using the very same technique. My Python script is called "lib_report.py" and creates an Excel spreadsheet containing extracted information about objects in a library, including

  • Object count
  • Object size total
  • Object size total by age
  • Object name, type, description, size and last used date

One of the most useful features of the spreadsheet generated is the filter on the detail information, where the user can filter and sort the information. This Python scripts demostrates some very useful Excel features not shown in Mikes article, so I publish my script to complement the article by Mike.

The source for the Python script can be found here and is a mere 100 lines - including comments! The Python modules used are again 'xlsxwriter' and 'ibm_db_dbi' in addition to the 'sys' module. I will go through the script and explain what the code does and how the Excel features are implemented:

#!/usr/bin/env python3

This line is a so-called "shebang", telling the shell how to run the script. I won't go into details (and you can read more about shebangs here), but it is a nice way to make text files into executables.

import xlsxwriter
import ibm_db_dbi as dbi
import sys

Here we import the modules used in the Python script.

lib = sys.argv[1].upper()

A Python variable called "lib" is set to the value (in uppercase) of the first argument to the script. You specify what library to examine by calling the script and adding the name of the library as a parameter, e.g. "lib_report.py qgpl".

output_filename = 'lib_report_' + lib + '.xlsx'

Here we create the filename of the Excel spreadsheet. The filename will be on the form "lib_report_<library name>.xlsx".

conn = dbi.connect(dsn=None, database='*LOCAL', user=None, password=None)
cur = conn.cursor()

These lines connect to the database and create a cursor variable.

sql_allobj = """\
select objname, objtype, objtext, objsize,
       coalesce( date( last_used_timestamp ), date( objcreated ) ) as objdate
  from table( qsys2.object_statistics( ?, '*ALL' ) ) os
 order by objsize desc
"""

Here we define the SQL statement, that will extract the object information from the library. Note that we use the SQL Service OBJECT_STATISTICS to do the job, and the statement will receive the library name as a parameter - the "?" in the statement text.

print('Getting object statistics for ' + lib + '.')

sql_parms = [ lib ]
cur.execute(sql_allobj, sql_parms)
headers = [descr[0] for descr in cur.description]

We print a status information on the console, set the parameters for the SQL statements and executes the statement. The column headers from the result set are extracted so we can write them to the Excel sheet later.

print('Creating Excel file with object statistics.')

workbook = xlsxwriter.Workbook(output_filename)
worksheet_summary = workbook.add_worksheet('Summary')
worksheet_details = workbook.add_worksheet('Detail')

Again we print status to the console before creating the Excel file. Then we add two sheets to the spreadsheet - one for the summary and another for the detailed information.

# Add a bold format for the headers.
bold = workbook.add_format({'bold': True})
bold_right = workbook.add_format({'bold': True, 'align': 'right'})

Some formats are created to be used with the headers, like bold and right-justifying. You can't add formatting directly to one or more cells, you have to create a format object first in Excel and then apply the format object to the cell(s). This can be very cumbersome because you can only specify one format object to a cell and thus have to create a format object for each distinct format used in the Excel spreadsheet.

# Make the columns wider.
worksheet_details.set_column('A:A', 14)
worksheet_details.set_column('B:B', 10)
worksheet_details.set_column('C:C', 48)
# Make the header rows larger and bold.
worksheet_details.set_row(0, 20, bold)
worksheet_details.set_row(1, 20, bold)
worksheet_details.set_row(2, 20, bold)
# Write the headers.
worksheet_details.write_row('A3', headers)

We set the column width and row height of the header rows on the detail tab, and then we write the headers which we extracted from the result set earlier.

size_format = workbook.add_format({'num_format': 0x03})
size_bold_format = workbook.add_format({'num_format': 0x03, 'bold': True})
objdate_format = workbook.add_format({'num_format': 'yyyy-mm-dd'})

# Display size as formatted integer.
worksheet_details.set_column('D:D', 16, size_format)
# Display used as formatted date.
worksheet_details.set_column('E:E', 14, objdate_format)

We create some formats for the object size and date and apply them to the columns D and E.

row = 2
for row_data in cur:
# Move on to the next worksheet row.
    row += 1
    col = 0
    for col_data in row_data:
        worksheet_details.write(row, col, col_data)
    # Move on to the next worksheet column.
    col += 1

object_count = row - 2

Now we populate the rows in the detail sheet. First we initiate the row counter to 2 (for row number 3 - Excel rows and columns are indexed starting with zero). Then we loop through the rows in the result set, increment the row counter, initiate the column counter and loop through the columns in the row, writing each one to a cell. After the loops are completed we retrieve the count of objects.

workbook.define_name('Detail!Objsizes', '=Detail!$D$4:$D$' + str(object_count + 3))
workbook.define_name('Detail!Objdates', '=Detail!$E$4:$E$' + str(object_count + 3))

Here we define two cell ranges in the Excel sheet. One cell range is called 'Detail!Objsizes' and covers all the values that we have written to column D. The same goes for the second cell range 'Detail!Objdates', which covers the values in column E.

worksheet_details.write(0, 0, 'Object count', bold)
worksheet_details.write(0, 1, '=SUBTOTAL(102, Detail!Objsizes)', size_bold_format)
worksheet_details.write(0, 2, 'Object size total', bold_right)
worksheet_details.write(0, 3, '=SUBTOTAL(109, Detail!Objsizes)', size_bold_format)

Now we write some text and formulas to the first row in the detail sheet. The formulas use the Excel function SUBTOTAL, which can calculate statistical values like count and sum according to the first parameter value sent to the function. Value "102" is a count and "109" is a sum. Both values make the SUBTOTAL function exclude any hidden cells in the result - this allows the user to hide one or more rows in the sheet and calculate the result without these rows. A very dynamic function indeed!

# Set autofilter on.
worksheet_details.autofilter('A3:E' + str(object_count + 3))

As the comment says, we set filtering on. This instruction enables autofiltering in the detail sheet on the values in the cell range from column A to E and row 3 to the count of objects plus 3, i.e. all the object detail information. With filtering the user can easily select or omit certain values in the data, e.g. one or more years in the object date values. Or sort the rows by one the columns in ascending or descending order. Again a very useful feature if the user wants to examine the object statistics further.

worksheet_summary.set_column('A:A', 16, bold)
worksheet_summary.set_column('B:E', 18, size_format)
worksheet_summary.write_row(0, 0, ('Library name', lib))
worksheet_summary.write_row(1, 0, ('Object count', '=COUNT(Detail!Objsizes)'))
worksheet_summary.write_row(2, 0, ('Object size total', '=SUM(Detail!Objsizes)'))
worksheet_summary.write(22'=CONCATENATE(TEXT(SUM(Detail!Objsizes)/1024/1024/1024, "#.##0"), " GB")', bold_right)

Now we populate the summary sheet. First we set the column sizes and format, and then we write some rows with the library name, object count, object size total and object size total in gigabytes. The object count and size total are formulas that again references the detail cell ranges that we defined above.

worksheet_summary.write_row(3, 0, ('Object age', '0-3 months', '3-6 months', '6-12 months', '>12 months'), bold_right)
worksheet_summary.write(4, 1, '=SUMIFS(Detail!Objsizes, Detail!Objdates, ">=" & EDATE(TODAY(), -3))', size_format)
worksheet_summary.write(4, 2, '=SUMIFS(Detail!Objsizes, Detail!Objdates, "<" & EDATE(TODAY(), -3), Detail!Objdates, ">=" & EDATE(TODAY(), -6))', size_format)
worksheet_summary.write(4, 3, '=SUMIFS(Detail!Objsizes, Detail!Objdates, "<" & EDATE(TODAY(), -6), Detail!Objdates, ">=" & EDATE(TODAY(), -12))', size_format)
worksheet_summary.write(4, 4, '=SUMIFS(Detail!Objsizes, Detail!Objdates, "<" & EDATE(TODAY(), -12))', size_format)

Next in the summary sheet we write rows that shows the sum of the object sizes in relation to the object age. This involves using the Excel function SUMIFS, that can summarize conditionally. We write four formulas: the size of objects created or used in the last 3 months, from 3 to 6 months, from 6 to 12 months and finally older than 12 months.

workbook.close()

Finally we close the Excel workbook. Until now the Excel workbook has been in memory, and to make it into a file we have to close the workbook.

...and we're done! We now have an Excel spreadsheet file containing summary and detailed information about the objects in a library, and the Excel user can filter and sort the data for further analysis.

The generated Excel files could use some polishing and brushing-up, but I will leave that up to you. Feel free to use this Python script as you like, maybe as a basis for further exploration of how to generate advanced Excel spreadsheets on IBM i. The one thing to remember is that you have all the components necessary to do this without any cost - you don't need to buy any additional software. Just install the Python RPM if you have not already done so.

IBM has done a great job adding open source technology to the platform, and Python is one of the best tools in many situations. This is one example of such a situation - in 100 lines of Python we have generated an advanced Excel spreadsheet file. Later I will show you another example where Python come to the rescue - removing rows from a comma-separated file based on the value of a column. Coming soon...

Friday, January 27, 2017

Making system history log searches go faster

In my previous post IBM i 7.1 is not dead - yet... I introduced the SQL function HISTORY_LOG_INFO. This function has been created by IBM for IBM i 7.3 and 7.2, and because this function is such a powerful tool for a system administrator doing system log analysis, I decided to create a similar function for 7.1.

I promised those of you following this blog that enhancements to the function would follow shortly. Now the updated function is ready and I can finally write about it. The reason for the update was that the system history log (QHST) can contain a great amount of data, and searching in the system history log can be quite CPU intensive and take longer than anticipated.

The API behind my function is called Open List of History Log Messages (QMHOLHST) and you can specify some selection parameters when calling the API. You can select specific message identifiers, message types and/or jobs, and I added these selection parameters to my version of the function. You may argue that you already have this selection just by adding a WHERE clause to the SQL statement containing the HISTORY_LOG_INFO table function, and this is true indeed. However, the WHERE clause is first applied to the data when the table function has formatted and returned all its data to the DB2 engine. By allowing the user to specify the selection information to the table function as parameters, the selection can be forwarded to the API and much less data will be returned and formatted, thus saving time and CPU.

The first version of my function had two parameters - start and ending timestamps. The new version adds three parameters - a list of message identifiers, a list of message types and finally a list of jobs to find. Does it sound familiar? Well, it's exactly the same parameters as those on the DSPLOG command!

The list of message identifiers can be up to 200 items, each separated by comma or blank. The identifiers can be entered in upper or lower case and can be generic. For example, CPF0000 specifies that all CPF messages that meet the specifications of the other parameters are selected.

The list of message types can be up to 9 types, again separated by comma or blank and upper or lower case.

And finally the list of jobs can contain up to 5 jobs, more or less qualified. A qualified job name consists of three elements: job number, job user and job name separated by slash(es). You may specify just the job name, but you can add the user (and eventually the job number) if you like. No generic job values are allowed.

The new version of the function is found in my Bitbucket git repository, just like the previous one. The README documentation has been updated to include description of the new parameters and some additional examples of how to use them. Go and have a look.

I hope that these enhancements will help making your system log searches much faster, thus saving time and CPU. If you have any questions or suggestions, you are welcome to comment this post, or you can enter your question or suggestion on the BitBucket git repository in the Issues section.

Thursday, December 08, 2016

IBM i 7.1 is not dead - yet...

IBM i 7.1 is riding towards the sunset - but there is still some light of the day left. This popular version of the IBM i operating system was released by IBM in April 2010 and has been available for more than six years now, completely stable from the beginning and through all its lifetime. Although it has been replaced by IBM i 7.2 in May 2014 and 7.3 in April 2016, it is still very popular on many Power servers - due to the many enhancements and new functions added to the version since its birth.

The enhancements and functions added to IBM i 7.1 was provided through a new concept introduced by IBM shortly after the arrival of 7.1 - the Technology Refresh. Technology Refresh (just referred to as "TR" by many IBM i professionals) is a way for IBM to add new functions to the operating system without having to create a full new version of the operating system. Installing a new version of IBM i can be a huge task, because replacing the full bottom of the software stack always require extensive tests of all software on top of the operating system, which again will have many customers refrain from upgrading. Technology Refresh is a much smaller package than a full new version and is easy to install, since no other piece of software is affected. The small size also means that the new functions can be made available to the customers much quicker - we don't have to wait several years for a new version to arrive, only the time it takes for IBM to produce the next TR.

A new TR for IBM i 7.1 was released every six months since April 2010, even after IBM i 7.2 was made available in 2014. Many functions added to the newer version 7.2 were backported to 7.1 through a Technology Refresh, thus breathing more life into this aging release. And when 7.2 got a new TR, there was a TR for 7.1 as well. But no more - shortly before the announcement of IBM i 7.3 in 2016, IBM told us that no more TR's would be delivered for 7.1. TR11 was the last Technology Refresh that IBM i 7.1 would get.

As I described in my previous post called "DB2 for i Services - a great tool for system administration", IBM is working hard to extend SQL with great new services to facilitate system management. One of the new functions added to IBM i 7.3 in TR1 (and IBM i 7.2 in TR5) was a table function called HISTORY_LOG_INFO to return information from the history log. The history log is a central piece of the operation system and important to system administrators because it contains information about every job on the system. Important system messages are also sent to the history log. Previously it has been difficult to search the history log for some information because there were no easy way to read the messages in the log programmatically (i.e. no output file option on the DSPLOG command), but this new function makes it easy to pull the information from the history log and analyze it.

Unfortunately for those of us still on IBM i 7.1 this new and very useful function will never be delivered by IBM to our release, since no new TR is coming. I'm not disappointed by IBM - they must focus on the newer versions and not waste resources on keeping old software alive. IBM i 7.3 is very attractive and offers so much in terms of new functionality and stability, and it would be nice to jump to 7.3 right away and get access to the HISTORY_LOG_INFO function. But this is not an option for most of us on IBM i 7.1 at this time, so what to do? Well, if we can't make IBM give us this function we'll have to do it ourselves...

So I decided to build a similar table function for the old, but trusted IBM i 7.1! The building blocks were already available: SQL user defined table functions and an API to get data from the history log. I took the liberty to use the same name and parameters and result columns for my function as IBM has used for theirs, so when we finally make it to IBM i 7.3 we don't have to recode our use of this function.

You will find the new table function HISTORY_LOG_INFO for IBM 7.1 in my git repository on Bitbucket - just click here. The welcome page contains information about how to create the function on your own system, and the source code is open source and can be modified to your own needs, if so desired.

The table function is defined in SQL but the serviceprogram utilized by the SQL function is written in RPG. It takes two optional parameters, the starting and ending timestamp for the messages to be retrieved. It then calls the system API QMHOLHST to get a list of history log messages between the starting and ending timestamps. Each message is then read and the data is formatted before being returned to the SQL function.

Formatting the data can be very CPU intensive so be careful when you use this function! As you probably already have experienced in the past, working with the history log can be a time consuming process because of the often huge amount of messages in the history log. And this function is not a silver bullet, all that data still has to be formatted by the serviceprogram before you get the result from SQL.

Feel free to use this function as is or modify it to your needs. And be patient - some important enhancement will be made to the function shortly. This will be posted in an upcoming blog post.

Friday, December 02, 2016

DB2 for i Services - a great tool for system administration

System management on the IBM i has always been easy - there are commands to list and work with every object type on the system. And when you needed to work with a list of objects, most of the commands supported an file parameter, so you could dump the list into a file and extract and manipulate the list from the file as you wished.

However, some of the commands used for system management does not have the option to write the list to a file. Instead you had to use an API to retrieve the list, which could mean that you had to read a complex API documentation and understand all the parameters for the API and how to work with user spaces or wherever the output from the API was placed. This was tedious and error prone and few people had the knowledge, time and energy to accomplish that.

IBM wanted to make it easier to manage the IBM i for people without extensive knowledge of the system, and they combined the existing API's with an excellent tool for handling and manipulation lists - SQL! IBM added SQL views and table functions to list object and system information, and now the information was easy to work with - just fire up a SQL prompt and select data from the table function, that gives you the wanted information. Since you're using SQL to retrieve the list of information, all the SQL facilities like selecting, grouping, ordering, joining etc. are at your disposal. IBM even found a great name for these new system management functions: "DB2 for i Services". You can also see it named as "IBM i Services" or "SQL Services".

The services are documented at the DeveloperWorks website. In the beginning the number of services was small, but since the start some years ago the number has increased substantially and the list of services has grown tremendously. Take a look of the services available on DeveloperWorks and you will surely find one or more, that you can use to great benefit. Even if you are not on the latest and greatest version of IBM i, there will be services available to you even at IBM i 6.1!

IBM has even provided samples of the DB2 for i Services in the Access Client Solutions product - the replacement for IBM i Access for Windows. The Run SQL Script plugin has a whole bunch of samples ready to be insert into the SQL editor. Just go to the menu Edit and select "Insert from Examples", and you will get a long list like this:


Click on the one you want, and a sample of a SQL statement returning the selected information will be insert into the typing area of the Run SQL Script window.


Go ahead and explore the services and see what IBM has made available for you. In an upcoming post I will show how you can make your own services like the ones that IBM has created.

Until then - have fun!


Wednesday, February 24, 2016

Why?

Why this blog? Well, I wanted a place to express my passion for the best computer operating system ever build, a place where I could explain the system to others who hopefully also will find the operating system superior like I do.

This operating system is a hidden gem. Not many people in the IT world is aware of it, but those who know about IBM i are mostly passionate about it. There was a conduct by some young IT people, who had never heard of IBM i, and they immediately fell in love with the system. Their report can be read here. Another story about how IBM i can impress people can be found here.

The reason why this great operating system is not known to the majority of the industry is, that it just works! You never hear of something that just runs and runs without any problems at all, and this is true in this case as well. Since the system is designed as a multi-user system having up to thousands of users, it is also a bit difficult to get learning access - you can't just install it on your personal computer like other operating systems and software.

But once you have learned about IBM i, you will love it!