Monday, May 20, 2019

Filtering data in CSV file using Python

As promised in my previous post Creating Excel spreadsheet with IBM i library report using Python I will in this article show you how to use Python to remove rows from a comma-separated (CSV) file based on the value of a column. This is a situation I was in some time ago, where I had to reduce a CSV file containing millions of rows by filtering the rows by a certain value in one of the columns. I tried several tools before ending with a solution that was surprisingly simple.

There are several ways to do the job of filtering a comma-separated file on IBM i besides using Python - for example you can

  • import the file to a table using CPYFRMIMPF, remove unwanted rows using SQL and export the table using CPYTOIMPF
  • use the (g)awk tool in a PASE shell
  • use the DATA-INTO operation in RPG together with a CSV parser
  • move the file to a PC, filter using a PC tool and move the file back to IBM i
  • and you can probably add to this list...

However, the reason I chose Python to do the job was because of the simplicity - I only needed to code four lines in Python! Another reason was that the current file to be filtered was quite big - more than 1 GB in size. Not many PC tools can handle such large text files and even fewer can understand and filter the data in a CSV file.

The simplicity of the solution was not only due to Python but even more to the Python module called Pandas. Pandas is a high-performance library for data analysis that is very easy to use. It supports several file formats including JSON, Excel, database (SQL) - and flatfile, which includes CSV files. The Pandas package has been ported to the IBM i by IBM and can easily be installed using either the Open Source Package Management in Access Client Solutions or by using the YUM package manager in a shell. Here is the yum command to install Python3 and Pandas in one single command:

yum install python3 python3-pandas

Pandas depend on date and timezone packages so you must also install these for Pandas to work:

yum install python3-pip python3-dateutil
pip3 install pytz

A central item in the Pandas package is the DataFrame object. This is a two-dimensional data-structure and you can compare it to a SQL table. This fits perfectly with us having a CSV file, which is just some rows with columns separated by a delimiter. The first row in our CSV file contains the name of the columns, and these names will be used as column labels in the DataFrame.

In this post I will use a very simple CSV file as input. It looks like this:

ITEM;DESCRIPTION;STOCK;PRICE;STATUS
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
C;Item C;30;1000;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE

Now, the full Python program to filter this CSV file looks like this:

import pandas as pd
data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
new_data = data.query( 'PRICE < 1000' )
new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )

I will go through the lines one by one and explain what is going on.

import pandas as pd

In order to make the Pandas library accessible to the Python program we import the package. The "pd" is a shortcut for the library, when we reference the library later in the Python program.

data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )

Here we read the CSV file into a Pandas DataFrame which we call "data". The parameters for the function "read_csv" are the following:
  1. the name of the CSV file to read
  2. the number of rows to read (None means all rows)
  3. the delimiter (in this case the semicolon ";") and
  4. the memory allowed to be used (low_memory=False means Pandas should not process the file in chunks).
There are many more parameters available in the read_csv function, and you can read about them in the API reference.

When the line has executed, the DataFrame object "data" has all content of the CSV file. We can now query and manipulate the DataFrame as we like. In this example we will only query the DataFrame:

new_data = data.query( 'PRICE < 1000' )

We call the "query" method for the "data" DataFrame object and use it to filter the data. In this example we include only rows where the value in PRICE is less than 1000. The output of the operation is a new DataFrame object called "new_data".

new_data.to_csv( 'output.csv', sep=';', index=False, float_format='%.f' )

Finally we write the DataFrame with the filtered data to a new CSV file. We do this by calling the method "to_csv" for the "new_data" object, specifying several parameters:
  1. the name of the new CSV file
  2. what separator (delimiter) to use
  3. not to include the index of the DataFrame and
  4. how to format floating point values.
The first two parameters are obvious, but the third and fourth parameter are not so obvious.

A DataFrame has index (row labels ) and columns (column labels), but we are only using the column labels (retrieved from the first header row in the file "input.csv"). The index has been automatically generated by the "read_csv" function (think of it like a row id), so to avoid having the index written to the new CSV file, we specify "index=False" on the "to_csv" function.

You may wonder why we have to deal with floating point numbers since there are only numeric values of type integer in our CSV example. This is due to the way Pandas works when it encounters missing values in an integer column - it converts the column data type from integer to float because the NaN (Not-a-Number) value in Pandas is a float. Item D in row 5 in our example has no stock value (the item has not been released yet), so the STOCK column will be converted to float. When we write the DataFrame to the CSV file, we specify '%.f' as the format which will print the integer values without any decimals and the NaN values as null (empty).

The new CSV file created by the Python program looks like this:

ITEM;DESCRIPTION;STOCK;PRICE;STATUS
A;Item A;10;100;ACTIVE
B;Item B;20;120;ACTIVE
D;Item D;;100;TO BE RELEASED
E;Item E;45;105;ACTIVE

Item C has been filtered out and is not in the output CSV file. Its price was 1000 and was not accepted by the query, thus filtered out - which was the purpose of this exercise.

You can use the Pandas functions interactively by starting Python3 and enter the lines one by one. If for example you want to see the data that Pandas has imported after the "read_csv" function, you simply write "data" and press Enter, and Python will show the DataFrame:

>>> data = pd.read_csv( 'input.csv', nrows=None, delimiter=';', low_memory=False )
>>> data
  ITEM DESCRIPTION  STOCK  PRICE          STATUS
0    A      Item A   10.0    100          ACTIVE
1    B      Item B   20.0    120          ACTIVE
2    C      Item C   30.0   1000          ACTIVE
3    D      Item D    NaN    100  TO BE RELEASED
4    E      Item E   45.0    105          ACTIVE

I hope by providing this article that you now have an idea of what a brilliant and powerful tool Pandas is for filtering and analyzing large amounts of data, and I encourage you to read the Pandas API reference and play with the functions in Pandas. The library can do so much - with great performance - and we have not even scratched the surface in this post. The documentation is great and also has a quick introduction to Pandas as well as tutorials. If you ever work with data sets and data analysis, this library can help you work with your data. And remember: It also works with data in Excel and databases - and it can be automated very easily.

And best of all - it is available on IBM i!

Thank you for reading.

1 comment:

Anonymous said...

Pretty! Τhis has been an еxtremnely wonderful
pօst. Many thanks f᧐r supplyіng this information.