Monday, May 20, 2019

Filtering data in CSV file using Python

As promised in my previous post Creating Excel spreadsheet with IBM i library report using Python I will in this article show you how to use Python to remove rows from a comma-separated (CSV) file based on the value of a column. This is a situation I was in some time ago, where I had to reduce a CSV file containing millions of rows by filtering the rows by a certain value in one of the columns. I tried several tools before ending with a solution that was surprisingly simple.

There are several ways to do the job of filtering a comma-separated file on IBM i besides using Python - for example you can

  • import the file to a table using CPYFRMIMPF, remove unwanted rows using SQL and export the table using CPYTOIMPF
  • use the (g)awk tool in a PASE shell
  • use the DATA-INTO operation in RPG together with a CSV parser
  • move the file to a PC, filter using a PC tool and move the file back to IBM i
  • and you can probably add to this list...

However, the reason I chose Python to do the job was because of the simplicity - I only needed to code four lines in Python! Another reason was that the current file to be filtered was quite big - more than 1 GB in size. Not many PC tools can handle such large text files and even fewer can understand and filter the data in a CSV file.

The simplicity of the solution was not only due to Python but even more to the Python module called Pandas. Pandas is a high-performance library for data analysis that is very easy to use. It supports several file formats including JSON, Excel, database (SQL) - and flatfile, which includes CSV files. The Pandas package has been ported to the IBM i by IBM and can easily be installed using either the Open Source Package Management in Access Client Solutions or by using the YUM package manager in a shell. Here is the yum command to install Python3 and Pandas in one single command:

yum install python3 python3-pandas

Pandas depend on date and timezone packages so you must also install these for Pandas to work:

yum install python3-pip python3-dateutil
pip3 install pytz

A central item in the Pandas package is the DataFrame object. This is a two-dimensional data-structure and you can compare it to a SQL table. This fits perfectly with us having a CSV file, which is just some rows with columns separated by a delimiter. The first row in our CSV file contains the name of the columns, and these names will be used as column labels in the DataFrame.

In this post I will use a very simple CSV file as input. It looks like this:

ITEM;DESCRIPTION;STOCK;PRICE;STATUS
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
C;Item C;30;1000;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE

Now, the full Python program to filter this CSV file looks like this:

import pandas as pd
data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
new_data = data.query( 'PRICE < 1000' )
new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )

I will go through the lines one by one and explain what is going on.

import pandas as pd

In order to make the Pandas library accessible to the Python program we import the package. The "pd" is a shortcut for the library, when we reference the library later in the Python program.

data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )

Here we read the CSV file into a Pandas DataFrame which we call "data". The parameters for the function "read_csv" are the following:
  1. the name of the CSV file to read
  2. the number of rows to read (None means all rows)
  3. the delimiter (in this case the semicolon ";") and
  4. the memory allowed to be used (low_memory=False means Pandas should not process the file in chunks).
There are many more parameters available in the read_csv function, and you can read about them in the API reference.

When the line has executed, the DataFrame object "data" has all content of the CSV file. We can now query and manipulate the DataFrame as we like. In this example we will only query the DataFrame:

new_data = data.query( 'PRICE < 1000' )

We call the "query" method for the "data" DataFrame object and use it to filter the data. In this example we include only rows where the value in PRICE is less than 1000. The output of the operation is a new DataFrame object called "new_data".

new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )

Finally we write the DataFrame with the filtered data to a new CSV file. We do this by calling the method "to_csv" for the "new_data" object, specifying several parameters:
  1. the name of the new CSV file
  2. what separator (delimiter) to use
  3. not to include the index of the DataFrame and
  4. how to format floating point values.
The first two parameters are obvious, but the third and fourth parameter are not so obvious.

A DataFrame has index (row labels ) and columns (column labels), but we are only using the column labels (retrieved from the first header row in the file "input.csv"). The index has been automatically generated by the "read_csv" function (think of it like a row id), so to avoid having the index written to the new CSV file, we specify "index=False" on the "to_csv" function.

You may wonder why we have to deal with floating point numbers since there are only numeric values of type integer in our CSV example. This is due to the way Pandas works when it encounters missing values in an integer column - it converts the column data type from integer to float because the NaN (Not-a-Number) value in Pandas is a float. Item D in row 5 in our example has no stock value (the item has not been released yet), so the STOCK column will be converted to float. When we write the DataFrame to the CSV file, we specify '%.f' as the format which will print the integer values without any decimals and the NaN values as null (empty).

The new CSV file created by the Python program looks like this:

ITEM;DESCRIPTION;STOCK;PRICE;STATUS
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE

Item C has been filtered out and is not in the output CSV file. Its price was 1000 and was not accepted by the query, thus filtered out - which was the purpose of this exercise.

You can use the Pandas functions interactively by starting Python3 and enter the lines one by one. If for example you want to see the data that Pandas has imported after the "read_csv" function, you simply write "data" and press Enter, and Python will show the DataFrame:

>>> data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
>>> data
  ITEM DESCRIPTION  STOCK  PRICE          STATUS
0    A      Item A   10.0    100          ACTIVE
1    B      Item B   20.0    120          ACTIVE
2    C      Item C   30.0   1000          ACTIVE
3    D      Item D    NaN    100  TO BE RELEASED
4    E      Item E   45.0    105          ACTIVE

I hope by providing this article that you now have an idea of what a brilliant and powerful tool Pandas is for filtering and analyzing large amounts of data, and I encourage you to read the Pandas API reference and play with the functions in Pandas. The library can do so much - with great performance - and we have not even scratched the surface in this post. The documentation is great and also has a quick introduction to Pandas as well as tutorials. If you ever work with data sets and data analysis, this library can help you work with your data. And remember: It also works with data in Excel and databases - and it can be automated very easily.

And best of all - it is available on IBM i!

Thank you for reading.

Friday, May 03, 2019

Creating Excel spreadsheet with IBM i library report using Python

A few days ago Mike Larsen published a great article titled "Creating Excel Spreadsheets With Python", in which he described how easy it is to populate Excel spreadsheets with data from IBM i. Mike demonstrated several features like formatting, tab colors and conditional formatting in his example and showed how few Python lines of code are needed in order to get a good-looking result.

His article reminded me of another Python script, that I wrote last year, using the very same technique. My Python script is called "lib_report.py" and creates an Excel spreadsheet containing extracted information about objects in a library, including

  • Object count
  • Object size total
  • Object size total by age
  • Object name, type, description, size and last used date

One of the most useful features of the spreadsheet generated is the filter on the detail information, where the user can filter and sort the information. This Python scripts demostrates some very useful Excel features not shown in Mikes article, so I publish my script to complement the article by Mike.

The source for the Python script can be found here and is a mere 100 lines - including comments! The Python modules used are again 'xlsxwriter' and 'ibm_db_dbi' in addition to the 'sys' module. I will go through the script and explain what the code does and how the Excel features are implemented:

#!/usr/bin/env python3

This line is a so-called "shebang", telling the shell how to run the script. I won't go into details (and you can read more about shebangs here), but it is a nice way to make text files into executables.

import xlsxwriter
import ibm_db_dbi as dbi
import sys

Here we import the modules used in the Python script.

lib = sys.argv[1].upper()

A Python variable called "lib" is set to the value (in uppercase) of the first argument to the script. You specify what library to examine by calling the script and adding the name of the library as a parameter, e.g. "lib_report.py qgpl".

output_filename = 'lib_report_' + lib + '.xlsx'

Here we create the filename of the Excel spreadsheet. The filename will be on the form "lib_report_<library name>.xlsx".

conn = dbi.connect(dsn=None, database='*LOCAL', user=None, password=None)
cur = conn.cursor()

These lines connect to the database and create a cursor variable.

sql_allobj = """\
select objname, objtype, objtext, objsize,
       coalesce( date( last_used_timestamp ), date( objcreated ) ) as objdate
  from table( qsys2.object_statistics( ?, '*ALL' ) ) os
 order by objsize desc
"""

Here we define the SQL statement, that will extract the object information from the library. Note that we use the SQL Service OBJECT_STATISTICS to do the job, and the statement will receive the library name as a parameter - the "?" in the statement text.

print('Getting object statistics for ' + lib + '.')

sql_parms = [ lib ]
cur.execute(sql_allobj, sql_parms)
headers = [descr[0] for descr in cur.description]

We print a status information on the console, set the parameters for the SQL statements and executes the statement. The column headers from the result set are extracted so we can write them to the Excel sheet later.

print('Creating Excel file with object statistics.')

workbook = xlsxwriter.Workbook(output_filename)
worksheet_summary = workbook.add_worksheet('Summary')
worksheet_details = workbook.add_worksheet('Detail')

Again we print status to the console before creating the Excel file. Then we add two sheets to the spreadsheet - one for the summary and another for the detailed information.

# Add a bold format for the headers.
bold = workbook.add_format({'bold': True})
bold_right = workbook.add_format({'bold': True, 'align': 'right'})

Some formats are created to be used with the headers, like bold and right-justifying. You can't add formatting directly to one or more cells, you have to create a format object first in Excel and then apply the format object to the cell(s). This can be very cumbersome because you can only specify one format object to a cell and thus have to create a format object for each distinct format used in the Excel spreadsheet.

# Make the columns wider.
worksheet_details.set_column('A:A', 14)
worksheet_details.set_column('B:B', 10)
worksheet_details.set_column('C:C', 48)
# Make the header rows larger and bold.
worksheet_details.set_row(0, 20, bold)
worksheet_details.set_row(1, 20, bold)
worksheet_details.set_row(2, 20, bold)
# Write the headers.
worksheet_details.write_row('A3', headers)

We set the column width and row height of the header rows on the detail tab, and then we write the headers which we extracted from the result set earlier.

size_format = workbook.add_format({'num_format': 0x03})
size_bold_format = workbook.add_format({'num_format': 0x03, 'bold': True})
objdate_format = workbook.add_format({'num_format': 'yyyy-mm-dd'})

# Display size as formatted integer.
worksheet_details.set_column('D:D', 16, size_format)
# Display used as formatted date.
worksheet_details.set_column('E:E', 14, objdate_format)

We create some formats for the object size and date and apply them to the columns D and E.

row = 2
for row_data in cur:
# Move on to the next worksheet row.
    row += 1
    col = 0
    for col_data in row_data:
        worksheet_details.write(row, col, col_data)
    # Move on to the next worksheet column.
    col += 1

object_count = row - 2

Now we populate the rows in the detail sheet. First we initiate the row counter to 2 (for row number 3 - Excel rows and columns are indexed starting with zero). Then we loop through the rows in the result set, increment the row counter, initiate the column counter and loop through the columns in the row, writing each one to a cell. After the loops are completed we retrieve the count of objects.

workbook.define_name('Detail!Objsizes', '=Detail!$D$4:$D$' + str(object_count + 3))
workbook.define_name('Detail!Objdates', '=Detail!$E$4:$E$' + str(object_count + 3))

Here we define two cell ranges in the Excel sheet. One cell range is called 'Detail!Objsizes' and covers all the values that we have written to column D. The same goes for the second cell range 'Detail!Objdates', which covers the values in column E.

worksheet_details.write(0, 0, 'Object count', bold)
worksheet_details.write(0, 1, '=SUBTOTAL(102, Detail!Objsizes)', size_bold_format)
worksheet_details.write(0, 2, 'Object size total', bold_right)
worksheet_details.write(0, 3, '=SUBTOTAL(109, Detail!Objsizes)', size_bold_format)

Now we write some text and formulas to the first row in the detail sheet. The formulas use the Excel function SUBTOTAL, which can calculate statistical values like count and sum according to the first parameter value sent to the function. Value "102" is a count and "109" is a sum. Both values make the SUBTOTAL function exclude any hidden cells in the result - this allows the user to hide one or more rows in the sheet and calculate the result without these rows. A very dynamic function indeed!

# Set autofilter on.
worksheet_details.autofilter('A3:E' + str(object_count + 3))

As the comment says, we set filtering on. This instruction enables autofiltering in the detail sheet on the values in the cell range from column A to E and row 3 to the count of objects plus 3, i.e. all the object detail information. With filtering the user can easily select or omit certain values in the data, e.g. one or more years in the object date values. Or sort the rows by one the columns in ascending or descending order. Again a very useful feature if the user wants to examine the object statistics further.

worksheet_summary.set_column('A:A', 16, bold)
worksheet_summary.set_column('B:E', 18, size_format)
worksheet_summary.write_row(0, 0, ('Library name', lib))
worksheet_summary.write_row(1, 0, ('Object count', '=COUNT(Detail!Objsizes)'))
worksheet_summary.write_row(2, 0, ('Object size total', '=SUM(Detail!Objsizes)'))
worksheet_summary.write(22'=CONCATENATE(TEXT(SUM(Detail!Objsizes)/1024/1024/1024, "#.##0"), " GB")', bold_right)

Now we populate the summary sheet. First we set the column sizes and format, and then we write some rows with the library name, object count, object size total and object size total in gigabytes. The object count and size total are formulas that again references the detail cell ranges that we defined above.

worksheet_summary.write_row(3, 0, ('Object age', '0-3 months', '3-6 months', '6-12 months', '>12 months'), bold_right)
worksheet_summary.write(4, 1, '=SUMIFS(Detail!Objsizes, Detail!Objdates, ">=" & EDATE(TODAY(), -3))', size_format)
worksheet_summary.write(4, 2, '=SUMIFS(Detail!Objsizes, Detail!Objdates, "<" & EDATE(TODAY(), -3), Detail!Objdates, ">=" & EDATE(TODAY(), -6))', size_format)
worksheet_summary.write(4, 3, '=SUMIFS(Detail!Objsizes, Detail!Objdates, "<" & EDATE(TODAY(), -6), Detail!Objdates, ">=" & EDATE(TODAY(), -12))', size_format)
worksheet_summary.write(4, 4, '=SUMIFS(Detail!Objsizes, Detail!Objdates, "<" & EDATE(TODAY(), -12))', size_format)

Next in the summary sheet we write rows that shows the sum of the object sizes in relation to the object age. This involves using the Excel function SUMIFS, that can summarize conditionally. We write four formulas: the size of objects created or used in the last 3 months, from 3 to 6 months, from 6 to 12 months and finally older than 12 months.

workbook.close()

Finally we close the Excel workbook. Until now the Excel workbook has been in memory, and to make it into a file we have to close the workbook.

...and we're done! We now have an Excel spreadsheet file containing summary and detailed information about the objects in a library, and the Excel user can filter and sort the data for further analysis.

The generated Excel files could use some polishing and brushing-up, but I will leave that up to you. Feel free to use this Python script as you like, maybe as a basis for further exploration of how to generate advanced Excel spreadsheets on IBM i. The one thing to remember is that you have all the components necessary to do this without any cost - you don't need to buy any additional software. Just install the Python RPM if you have not already done so.

IBM has done a great job adding open source technology to the platform, and Python is one of the best tools in many situations. This is one example of such a situation - in 100 lines of Python we have generated an advanced Excel spreadsheet file. Later I will show you another example where Python come to the rescue - removing rows from a comma-separated file based on the value of a column. Coming soon...