Rahul Anand's Blog

If my mind can conceive it, and my heart can believe it, I know I can achieve it.

  Home  |   Contact  |   Syndication    |   Login
  15 Posts | 1 Stories | 25 Comments | 7 Trackbacks

News

Article Categories

Archives

Post Categories

Tuesday, April 2, 2013 #

Coming from C# background I assumed certain behavior while using split() method on strings that no matter what the returned array will have delimiter count + 1 elements. And my assumption cost me few hours of debugging to fix a simple issue!!

Java behaves differently with split() and provides many possible options configurable through second parameter.

The default implementation of split() with single parameter assumes second parameter value as ‘0’, which means trailing empty elements are ignored.

[CODE]

        String line = "A|B|C||||";
        String [] fields = line.split("\\|");
        System.out.println(fields.length);

[/CODE]

The output is 3.

Which is same with following code:

[CODE]

String line = "A|B|C||||";
String [] fields = line.split("\\|", 0);
System.out.println(fields.length);

[/CODE]

In case if you are wondering why “\\|” instead of “|”, it is required to escape ‘|’ character as it has a special meaning in regex and the first parameter to split() is expected to be a regex.

The documentation of split() says it all:

 

          public String[] split(String regex, int limit)

Splits this string around matches of the given regular expression.

The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

In my case I passed “-1” (non-positive) to return all elements even if they are trailing empty strings.

[CODE]

String line = "A|B|C||||";
String [] fields = line.split("\\|", -1);
System.out.println(fields.length);

[/CODE]

The output is 7. But soon I realized this is done for a purpose and for better performance it is recommended to have low memory foot print, and the default behavior helps in reducing memory usage when dealing with huge volume of data. Realizing this, I modified my code to access the returned array with a length check instead of assuming the constant size of the returned array. And it helped me optimize performance up to 1.5 x.

 

Reference:

http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#split(java.lang.String, int)


Monday, March 4, 2013 #

JavaBeans are reusable software components written in Java. It is only a standard defined to follow a convention of class design. It enables the ability to transform Java Bean objects into a single object (bean), which can be passed around easily using the serializable feature.

A JavaBean is a Java Object that is serializable, has a 0-argument constructor, and allows access to properties using getter and setter methods. The three mandatory requirements are bold-faced in preceding line


Monday, February 18, 2013 #

Linux/Unix offers many text processing tools which are very powerful and can be used together to process data and extract information from files.

Some of these are listed below with few commonly used examples:

  1. head -- display first lines of a file
  2. tail -- display the last part of a file
  3. cat -- concatenate and print files
  4. less – paged output, with support to move forward and backward
  5. cut -- cut out selected portions of each line of a file
  6. sort -- sort lines of text files
  7. uniq -- report or filter out repeated lines in a file
  8. paste -- merge corresponding or subsequent lines of files
  9. join -- relational database operator
  10. diff -- compare files line by line
  11. awk -- pattern-directed scanning and processing language
  12. grep -- print lines matching a pattern
  13. find -- display lines beginning with a given string
  14. wc -- word, line, character, and byte count
  15. tr -- translate characters
  16. sed -- stream editor

Examples:

1. Display top 20 lines of a big text based file

head –n 20 bigfile.txt

2. Display 20 lines skipping 1st line (you may skip first row as it has headers) from a big file

tail –n +1 bigfile.txt | head –n 20

3. Concatenate two files.

cat file1 file2 > combinedfile

Use zcat to read a compressed file.

4. Display a file with options to move backward/forward.

less bigfile.txt

Type ‘ctrl+f’ to move forward, and ‘ctrl+b’ to move backward. Type ‘/searchstring’ to search a text in text ahead of cursor and ‘?searchstring to search backward. Type ‘/’ or ‘?’ to search again.

5. Cut the first column (read a compressed file, skip the header row, take next 10 rows and project the first field

zcat monthlyfile.zip | tail –n +1 | head | cut –f1

6. Sort the input file on first column

zcat monthlyfile.zip | tail –n +1 | head | cut –f1 | sort

7. Display duplicates with count

zcat monthlyfile.zip | tail –n +1 | head | cut –f1 | sort | uniq –c

8. Create a colon-separated list of directories named bin, suitable for use in the PATH environment variable:

find / -name bin -type d | paste -s -d : -

9. Find the extra data from first file which are not present in second file.

join -v 1 <(zcat monthly_feb.zip | tail -n +1) <(zcat monthly_jan.zip | tail -n +1)

10. Find the differences from two files

diff -y -W 80 firstfile.txt secondfile.txt

11. Print the first two fields from file (with tab separated fields) along with the row number and at the end print total rows.

awk –F ‘\t’ ‘{print “Row Count: “,NR,$1,'”,”,$2} END '{print “Total Record Count: “,NR}’

12. Output the line which contain word ‘error’ or ‘warning’ from the log file.

grepiwerror|warning” 20130218.log

13. List files with extension *.gz and greater than 1GB?

find / -type f -name *.tar.gz -size +1G -exec ls -l {} \;

14. Count number of processes currently running on a server

ps -ef | wc –l

15. Translate the contents of file to upper case

tr "[:lower:]" "[:upper:]" file1.txt

16. Get the user names from the /etc/passwd file

sed 's/\([^:]*\).*/\1/' /etc/passwd

 

Advanced usage:

Disk usage per directory
du -sk * | sort -n | while read size fname; do for unit in k M G T P E Z Y; do if [ $size -lt 1024 ]; then echo -e "${size}${unit}\t${fname}"; break; fi; size=$((size/1024)); done; done

Directory listing with number of files
find . -type f -exec dirname {} \; | awk -F '/' '{arr[$2]++} END { for (i in arr) { print i, arr[i] } }' | sort

Compare Files and print lines of first file which are not present in second
awk 'NR==FNR{a[$0]++;next} !($0 in a){print $0}' file1 file2

 

References:

http://www.thegeekstuff.com/

http://www.freebsd.org/

http://tldp.org/LDP/abs/html/textproc.html


Wednesday, February 13, 2013 #

I was looking for a feature in Ingre Vectorwise similar to ‘Linked Server’ in MS SQL Server and ‘Database Link’ in Oracle. But unfortunately it seems Ingre Vectorwise 2.5 does not offer anything similar.

 

The Ingres has ‘Register Table as Link’ (Ingres Star) which is currently not supported in Vectorwise. So I think the only available option as of today is to copy tables in one database on one server and then do the cross joins.

 

Another way would be to create linked servers in MS SQL Server to different instances/databases of Ingres Vectorwise, but this will be very inefficient solution. It may be ok for some manual data analysis.


Tuesday, February 12, 2013 #

There are different tools available to connect to the Ingres Vectorwise database server.

Actian (the parent company) offers following tools:

1. DBA Tools (current version is 10.1.0)

2. Director Client (current version is 1.0.0)

These can be downloaded from http://esd.actian.com/platform/Windows_32-Bit/Administration_Tools/

With the server installation you will also get text based tools – sql and isql.

Alternatively you can also use the open source project – Squirrel SQL Client (http://squirrel-sql.sourceforge.net/), which works with JDBC connection. All of these tools let you execute queries and get the output. DBA Tools (a set of tools) offers more features related to management and administration of ingres vectorwise. Director is a new offering from Actian and it offers nice GUI based query editor.


Monday, February 11, 2013 #

I have been programming in C# and .NET for a decade now. I loved the concept of LINQ and Lambda Expression which brings a layer of separation between the user request to process a task and the real execution of the task which can be made to execute differently in different systems. This is specially beneficial in todays world with multi-core machines. This gives the power in hands of the execution engine to parallelize the queries as per availability of system cores.

The JAVA community has realized these benefits and so there is already a plan to include ‘Closure’, ‘Default Methods’ in JAVA SE 8. In the JAVA world the ‘Lambda Expression’ concept is referred as ‘Closure’ and ‘Extension Methods’ are referred as ‘Default Methods’ or ‘Virtual Extension Methods’ or ‘Defender Methods’. Similar concepts but might differ at granular level. Please check the below links for more details:

http://openjdk.java.net/projects/lambda/

http://cr.openjdk.java.net/~briangoetz/lambda/sotc3.html


Recently I started working with Linux and the first day itself I felt a need to let the remote session continue while I wanted to disconnect my laptop and leave for home. Searching around for such a session management tool tool I found ‘Screen’ which is quite powerful and helped achieve my objectives.

As defined by Adam Lazur at http://www.linuxjournal.com/article/6340:

Screen is a terminal multiplexer that allows you to manage many processes through one physical terminal. Each process gets its own virtual window, and you can bounce between virtual windows interacting with each process. The processes managed by screen continue to run when their window is not active.

Most of the details on how to use Screen can be found in below links:

http://www.rackaid.com/resources/linux-screen-tutorial-and-how-to/

http://www.linuxjournal.com/article/6340

http://linux.die.net/man/1/screen


Friday, February 8, 2013 #

From last few days I was wondering why this internal variable (Internal.Job.Filename.Directory) is always empty in my development environment.

The documentation at http://wiki.pentaho.com/display/EAI/.07+Variables says these variable will be always defined. But it was not in my case.

After lot of time spent debugging and breaking my head I found that these variables are set only if you do not use the repository. In my case I was using a file based repository which was causing this issue. By disconnecting the repository and importing the job directly solved the issue.

Now I am wondering are not we suppose to use these variables while using repository to define paths of sub-jobs or transformations?


Many articles on implementing a loop in PDI suggest to put a ‘Wait For’ step and join the output hop back to a previous step.

Something as suggested by below screen shot:

image

But I would like to warn about this implementation of looping that this causes recursive stack allocation by JVM during job execution and the system may run out of memory after a high number of iterations (depending the system available available memory).

In a test run on my laptop in which I continuously ran the above job with a ‘Wait For’ configuration of 1 second, within minutes the JVM crashed. I also noticed a continuous increase in JVM memory usage before it crashed.

Therefore to conclude it is ok to run few iterations (like retrying on error condition) with above mechanism but make sure you do so for a finite number of iteration (may be lesser than 100), but not infinite looping.

So what is the suggested approach for infinite looping – One of the possible way is to use the settings of ‘Start’ step. Set the ‘Repeat’ flag and add interval configuration. This cause the job to be re-initialize completely as a new instance and does not cause any memory issue.

 

image


Wednesday, February 6, 2013 #

Context:

Recently I got a chance to get introduced to Kettle, due to a new project which requires ingesting huge and dynamic data files using PDI in a data warehouse. I have a good experience working with Microsoft technology stack - .NET Framework, C#, SQL Server, SSIS. The idea of open source has always been fascinating to me but I have not worked much with open source tools.

So, working with PDI is quite a different experience for me. I liked the continuous evolution of PDI and the capabilities it offers.

The first thing which I fell in love with is ‘ETL Metadata Injection’ component. It is really very helpful in making your ETL packages completely dynamic. The rest of this post talks about ETL Metadata Injection component listed under ‘Flow’ category in the design steps for transformation package.

Requirement:

In my case the requirement was to load 100+ data files with dynamic columns. The columns are not fixed and are defined by a metadata file. The data files and the metadata files are delivered daily through an FTP. The files are provided as flat delimited files. Obvious enough I needed to normalize the data and store the column name as attribute name and the value as corresponding attribute value. Each line in the data file also contains an entity id for which attributes are provided.

To visualize the data you can think of following structure for the data file:

Entity Id, Attribute 1, Attribute 2, Attribute 3, …, Attribute N

1, ABC Corp, A1, 21.31, …,Y

2, BCD Corp, A2, 20.44, …,Y

3, CDE Corp, A3, 22.78, …,N

4, EFG Corp, A4, 24.21, …,Y

5, GEH Corp, A5, 25.67, …,Y

20000, XYZ Copr, Z3000, 10.50, …, Y

The N (number of attributes) could become as big as 40000. To manage the file size the data provider has split the file vertically into multiple files each containing 200+ columns, and repeated the Entity Id in each file to allow joining the data back.

Solution:

My first impression was that I would have to write a custom component to handle this dynamic scenario, but the ‘ETL Metadata Injection’ component came to rescue, and allowed me to load this data without a need to write a custom component.

The ‘ETL Metadata Injection’ as the name suggests allows you to inject the metadata of a supported component during run time. For example you want to define the metadata of a ‘CSV File Input’ component which might include the file name, column names and data type of each column, you can do so by editing the ‘CSV File Input’ component. But then it is fixed at the design time. Off course some of the metadata allows you to load the values from variables but not all. Like the file name can be defined using a variable but not the column details. Since the requirement is to load dynamic files I could not use the simple approach of splitting data and consuming through multiple ‘CSV File Input’ instances with fixed column details. If the number of columns in each file is fixed to 200 and the number of attributes is fixed to 40000, I could have done so with 200 different instances of ‘CSV File Input’ – a dirty approach but could have worked. But in my case the number of attributes is growing and number of columns in files keep changing. Therefore I thought of loading the column details in memory and inject the column details in the ‘CSV File Input’ using ‘ETL Metadata Injection’ component. The similar approach was used to normalize the data using ‘Row Normalizer’ before saving it using ‘Table Output’.

The rest of this post goes under the cover to understand more about the ETL Metadata Injection.

Under the cover:

The ETL Metadata Injection under the cover actually reads the XML of your package and does XML manipulation to update the section which defines column details and other metadata. Please do note that only few properties are exposed to be written through the metadata injection. And not all components are supported for metadata injection. As of this writing only following components are supported for metadata injection but the list is continuously growing:

1. CSV Input

2. Row Demoralizer

3. Excel Input

4. Split Field

5. Fixed file input

6. Row Normalizer

7. Select Values

After the metadata is injected this component executes the modified package. In future version it is expected to support outputting the modified XML to disk and an option to execute or not.

If you enable the detailed logging in Spoon you will be able to see the modified XML in logs.

References:

http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection

http://www.ibridge.be/?p=194