There are several ways to do the job of filtering a comma-separated file on IBM i besides using Python - for example you can
- import the file to a table using CPYFRMIMPF, remove unwanted rows using SQL and export the table using CPYTOIMPF
- use the (g)awk tool in a PASE shell
- use the DATA-INTO operation in RPG together with a CSV parser
- move the file to a PC, filter using a PC tool and move the file back to IBM i
- and you can probably add to this list...
However, the reason I chose Python to do the job was because of the simplicity - I only needed to code four lines in Python! Another reason was that the current file to be filtered was quite big - more than 1 GB in size. Not many PC tools can handle such large text files and even fewer can understand and filter the data in a CSV file.
The simplicity of the solution was not only due to Python but even more to the Python module called Pandas. Pandas is a high-performance library for data analysis that is very easy to use. It supports several file formats including JSON, Excel, database (SQL) - and flatfile, which includes CSV files. The Pandas package has been ported to the IBM i by IBM and can easily be installed using either the Open Source Package Management in Access Client Solutions or by using the YUM package manager in a shell. Here is the yum command to install Python3 and Pandas in one single command:
yum install python3 python3-pandas
Pandas depend on date and timezone packages so you must also install these for Pandas to work:
A central item in the Pandas package is the DataFrame object. This is a two-dimensional data-structure and you can compare it to a SQL table. This fits perfectly with us having a CSV file, which is just some rows with columns separated by a delimiter. The first row in our CSV file contains the name of the columns, and these names will be used as column labels in the DataFrame.
yum install python3-pip python3-dateutil
pip3 install pytz
pip3 install pytz
A central item in the Pandas package is the DataFrame object. This is a two-dimensional data-structure and you can compare it to a SQL table. This fits perfectly with us having a CSV file, which is just some rows with columns separated by a delimiter. The first row in our CSV file contains the name of the columns, and these names will be used as column labels in the DataFrame.
In this post I will use a very simple CSV file as input. It looks like this:
ITEM;DESCRIPTION;STOCK;PRICE;STATUS
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
C;Item C;30;1000;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
C;Item C;30;1000;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE
Now, the full Python program to filter this CSV file looks like this:
import pandas as pd
data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
new_data = data.query( 'PRICE < 1000' )
new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )
data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
new_data = data.query( 'PRICE < 1000' )
new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )
I will go through the lines one by one and explain what is going on.
import pandas as pd
In order to make the Pandas library accessible to the Python program we import the package. The "pd" is a shortcut for the library, when we reference the library later in the Python program.
data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
Here we read the CSV file into a Pandas DataFrame which we call "data". The parameters for the function "read_csv" are the following:
- the name of the CSV file to read
- the number of rows to read (None means all rows)
- the delimiter (in this case the semicolon ";") and
- the memory allowed to be used (low_memory=False means Pandas should not process the file in chunks).
When the line has executed, the DataFrame object "data" has all content of the CSV file. We can now query and manipulate the DataFrame as we like. In this example we will only query the DataFrame:
new_data = data.query( 'PRICE < 1000' )
We call the "query" method for the "data" DataFrame object and use it to filter the data. In this example we include only rows where the value in PRICE is less than 1000. The output of the operation is a new DataFrame object called "new_data".
new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )
Finally we write the DataFrame with the filtered data to a new CSV file. We do this by calling the method "to_csv" for the "new_data" object, specifying several parameters:
- the name of the new CSV file
- what separator (delimiter) to use
- not to include the index of the DataFrame and
- how to format floating point values.
A DataFrame has index (row labels ) and columns (column labels), but we are only using the column labels (retrieved from the first header row in the file "input.csv"). The index has been automatically generated by the "read_csv" function (think of it like a row id), so to avoid having the index written to the new CSV file, we specify "index=False" on the "to_csv" function.
You may wonder why we have to deal with floating point numbers since there are only numeric values of type integer in our CSV example. This is due to the way Pandas works when it encounters missing values in an integer column - it converts the column data type from integer to float because the NaN (Not-a-Number) value in Pandas is a float. Item D in row 5 in our example has no stock value (the item has not been released yet), so the STOCK column will be converted to float. When we write the DataFrame to the CSV file, we specify '%.f' as the format which will print the integer values without any decimals and the NaN values as null (empty).
The new CSV file created by the Python program looks like this:
ITEM;DESCRIPTION;STOCK;PRICE;STATUS
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE
Item C has been filtered out and is not in the output CSV file. Its price was 1000 and was not accepted by the query, thus filtered out - which was the purpose of this exercise.
You can use the Pandas functions interactively by starting Python3 and enter the lines one by one. If for example you want to see the data that Pandas has imported after the "read_csv" function, you simply write "data" and press Enter, and Python will show the DataFrame:
>>> data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
>>> data
ITEM DESCRIPTION STOCK PRICE STATUS
0 A Item A 10.0 100 ACTIVE
1 B Item B 20.0 120 ACTIVE
2 C Item C 30.0 1000 ACTIVE
3 D Item D NaN 100 TO BE RELEASED
4 E Item E 45.0 105 ACTIVE
>>> data
ITEM DESCRIPTION STOCK PRICE STATUS
0 A Item A 10.0 100 ACTIVE
1 B Item B 20.0 120 ACTIVE
2 C Item C 30.0 1000 ACTIVE
3 D Item D NaN 100 TO BE RELEASED
4 E Item E 45.0 105 ACTIVE
And best of all - it is available on IBM i!
Thank you for reading.