JULY SOFT .NET BLOG

About GEYSIR ENTERPRISE SEARCH, .NET, TECHNOLOGY and MORE

Hadoop 1 Master & 2 Slaves Setup

Why Hadoop is important in handling Big Data?

Hadoop provides excellent big data management provision, supports the processing of large data sets in a distributed computing environment. It is designed to expand from single servers to thousands of machines, each providing computation and storage. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure, which minimizes the risk of catastrophic system failure, even if a significant number of nodes become out of action. Hadoop is very valuable for large scale businesses.

Hadoop installation scenario on 3 Ubuntu machines:

ub1 is server node and ub2 and ub3 are the slaves nodes.

Steps:

  1. We will install Hadoop on master node ub1

  2. Hadoop is based on java framework, so we will install java first:

      • sudo add-apt-repository ppa:webupd8team/java

      • sudo apt-get update

      • sudo apt-get install default-jdk

      • sudo apt-get install oracle-java8-installer

  • Last command will install java at "/usr/lib/jvm/java-8-oracle". In order to check if the installation was Ok use next command:

      • /usr/lib/jvm/java-8-oracle

  • Create a hadoop group and "hduser" user as system user:

      • sudo addgroup hadoop

      • sudo adduser --ingroup hadoop hduser

  • Install SSH for secure accessing one machine from another(used by Hadoop for acceing slaves nodes):

      • sudo apt-get install openssh-server

  • Configure SSH. Login with hduser:

      • sudo su hduser

  • Generate SSH key for hduser:

      • ssh-keygen -t rsa -P ""

  • Copy id_rsa.pub to authorized keys from hduser:

      • cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

  • Add "hduser" to sudoers:

      • sudo adduser hduser sudo

  • Hadoop doesn’t work on IPv6, so Ipv6 must be disabled:

      • sudo apt install gksu

      • sudo apt install gedit

      • sudoedit /etc/sysctl.conf

    • Add into above file below settings:

# disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

      • CRTL+X -> yes

  • Locate hadoop installation parent directory:

      • cd /usr/local/

  • Download Hadoop:

      • sudo wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

  • Extract Hadoop sources:

      • sudo tar -xzvf hadoop-2.7.3.tar.gz

  • Move hadoop-2.7.3 to hadoop folder:

      • sudo mv hadoop-2.7.3 /usr/local/hadoop

  • Assign ownership of this folder to Hadoop user hduser:

      • sudo chown hduser:hadoop -R /usr/local/hadoop

  • Create Hadoop temp dirs for namenode and datanode:

      • sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode

      • sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

  • Assign ownership of this Hadoop temp folder to Hadoop user:

      • sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

  • Check JAVA_HOME path:

      • update-alternatives --config java

  • Edit hadoop configuration files. Edit ".bashrc" file:

      • sudoedit .bashrc

      • add into it:

# -- HADOOP ENVIRONMENT VARIABLES START -- #

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOMEi

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

# -- HADOOP ENVIRONMENT VARIABLES END -- #

  • Edit "hadoop-env.sh":

      • cd /usr/local/hadoop/etc/hadoop

      • sudoedit hadoop-env.sh

      • add into above file:

        JAVA_HOME=/usr/lib/jvm/java-8-oracle

  • Edit "core-site.xml":

      • cd /usr/local/hadoop/etc/hadoop

      • sudoedit core-site.xml

      • Add into above file:

        <property>

        <name>fs.default.name</name>

        <value>hdfs://UB1:9000</value>

        </property>

  • Edit "hdfs-site.xml":

    • cd /usr/local/hadoop/etc/hadoop

    • sudoedit hdfs-site.xml

    • add into above file:

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>

</property>

  • Edit "yarn-site.xml":

    • cd /usr/local/hadoop/etc/hadoop

    • sudoedit yarn-site.xml

    • Add into above file:

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

  • Copy template of mapred-site.xml.template file:

      • cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

  • Edit "mapred-site.xml":

    • cd /usr/local/hadoop/etc/hadoop

    • sudoedit mapred-site.xml

    • Add into above file it:

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

  • Reset the PC & open again the terminal with hduser. Format namenode:

      • cd /usr/local/hadoop/etc/hadoop

      • hdfs namenode -format

  • Start all hadpop daemons:

      • cd /usr/local/hadoop/

      • start-dfs.sh

      • start-yarn.sh

  • Verify hadoop daemons:

      • jps

  • Now we will extend setup hadoop on slaves nodes.

  • Add all host names to /etc/hosts directory in all Machines (Master and Slave nodes). You can find each PC IP using ifconfig command

    • on UB1 / then on UB2 / then on UB3:

      • sudo vim /etc/hosts

      • if vim is not installed you will intall it using:

          • sudo apt-get update

          • sudo apt-get install vim

      • Add into above file:

10.0.3.15 UB1

10.0.3.16 UB2

10.0.3.17 UB3

  • Create hadoop as group and hduser as user in all slaves Pcs

      • sudo addgroup hadoop

      • sudo adduser --ingroup hadoop hduser

      • sudo usermod -a -G sudo hduser (or edit "/etc/sudoers/" and add hduser ALL=(ALL:ALL) ALL) )

  • Install rsync for sharing hadoop source on all PCs

      • sudo apt-get install rsync

      • sudo reboot

  • Edit core-site.xml on master PC:

      • cd /usr/local/hadoop/etc/hadoop

      • sudo vim core-site.xml

      • replace localhost with UB1

  • Edit hdfs-site.xml on master and replace replication factor from 1 to 3

  • Edit yarn-site.xml on master:

<property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>UB1:8025</value>

</property>

<property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>UB2:8035</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>UB3:8050</value>

</property>

  • Edit mapred-site.xml on master and add new entry:

    <property>

<name>mapreduce.job.tracker</name>

<value>UB1:5431</value>

</property>

  • Edit on master node the master:

      • cd /usr/local/hadoop/etc/hadoop

      • sudo vim masters

      • add line:

        ## Add name of master nodes

        UB1

  • Update slaves on master:

    • cd /usr/local/hadoop/etc/hadoop

    • sudo vim slaves

## Add name of slave nodes

UB2

UB3

  • Use rsync on master:

    • First install SSH on each slave PC

        • sudo apt-get install openssh-server

        • Generate SSH key for hduser:

          • ssh-keygen -t rsa -P ""

          • Copy id_rsa.pub to authorized keys from hduser:

            • cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

            • disable Ipv6 as above

            • cd /usr/local

            • sudo mkdir hadoop

            • sudo chown hduser:hadoop -R /usr/local/hadoop

    • sudo rsync -avxP /usr/local/hadoop/ hduser@UB1:/usr/local/hadoop/

    • sudo rsync -avxP /usr/local/hadoop/ hduser@UB2:/usr/local/hadoop/

  • On master:

      • sudo rm -rf /usr/local/hadoop_tmp/

      • sudo mkdir -p /usr/local/hadoop_tmp/

      • sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode

      • sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

  • On each slave node:

      • sudo rm -rf /usr/local/hadoop_tmp/

      • sudo mkdir -p /usr/local/hadoop_tmp/

      • sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

      • sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

  • Execute on master:

      • ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@UB2

      • ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@UB3

      • If there are errors at any step, then you should first install java on each node and re-execute above commands

  • Execute on master:

      • usr/local/hadoop/

      • hdfs namenode -format

      • start-dfs.sh

      • start-yarn.sh

      • jps

  • Excute on each slave:

      • jps

  • Test:

  • In order to configure WebHDFS, we need to hdfs-site.xml as follows:

        <property>
           <name>dfs.webhdfs.enabled</name>
           <value>true</value>
        </property>
  • Copy local folder to hadoop:

For a Better Search,

July Soft Team - www.julysoft.net

 

Responsibility. Integrity. Passion.

Geysir Ent. Search version 1.1 – CRM Import Service from VC & Emails Free*

Geysir is an High-End Enterprise Search Solution, and if you get Enterprise Licence – this includes also a Document Management Solution (DMS), a Customer Relationship Solution (CRM) and an Issue Tracker Support Solution and if you take into consideration price and seamless integration of those systems (Geysir Search Portal can search also in those systems) really the choice is no brainier.

 

Let’s say you acquired Enterprise Licence and this includes a DMS and a CRM solution on top of your Geysir Enterprise Licence. You now are willing to use your CRM to make a difference towards your customers – you need to take time to add every client, its contact data, communication history, etc.

 

Well – we @ JULY SOFT are 100% aware of this tedious operation and that you / your employees many not afford the time to execute it. This is why we offer – for free – the service to import your clients / contacts (initial import) from 2 main sources:

- your Emails

- your Visit Cards

 

Yes, you got it right – we OCR / parse your Visit Cards and we then import it to your new Geysir CRM – offered free along Geysir Enterprise Full Licence! Of course we cannot guarantee 100% accuracy because OCR is not “exact science” but overall – from our past experiences we deliver an average of 70-90% accuracy (depending on VC / scanning quality) of VC Import into CRM – note for phone numbers this tends to 100%...

 

We will provide you / yours IT staff full guidance to make this process as fast and as smooth as possible and we guarantee 1-2 business days is done and you may start using your CRM from day 3 having ALL your contacts + associated history and data safely stored in your database!

 

*this service is available as free service rather for clients based in Bucharest / Romania

As you may think this operation requires rather our staff on your site and this is why we offer at the moment this service for free only for clients from Bucharest / Romania. There are nevertheless options to provide same service for remote clients.

 

For a Better Search,

July Soft Team

 

Responsibility. Integrity. Passion.

Geysir Enterprise Search version 1.1 supports now OCR (Optical Character Recognition) for PDF and IMAGE Formats

As we all know, too often the company's scanner does not offers OCR for scanned documents, and this is why very often companies have a lot of documents in PDF (Portable Document Format) or Image Files (png, jpg, tiff, bmp, etc) that contains Text and Business Data but in Image/Binary format thus hardly searchable. More than that, those documents are stored in Network Shares or SQL Databases – in same format (image, non-OCR)– without easy way to search by text content’s keywords.

 

We at JULY SOFT are 100% aware of this fact – companies have huge amounts of business-critical documents (PDFs or Images) as Image Scans (non-OCR). This is why, starting version 1.1 we have introduced OCR (Optical Character Recognition) feature for PDF and Image Files for any Enterprise Licence.

 

The work to run OCR on all those documents and then update them and updating associated keywords and tags takes too much and your business cannot afford to lose time in non-productive activities. Just by implementing Geysir Enterprise you will get – never mind having a high-end Enterprise Search Solution – also OCR out-of the box, totally transparent for you. Geysir will take care to OCR all your scanned PDFs and Image Files (regardless exact format) and you will use Geysir Search Web Portal to search within those images texts – with exact same experience as they were already in text format (OCR).

 

Just keep in mind that the price of a 12 months Geysir Enterprise Licences includes OCR as a nice feature along many dozens others and yes, is way cheaper than to buy only a OCR library that well, does only OCR. Geysir is an Enterprise Search High-End Class Solution, and if you get 1 Enterprise Licence – this includes also a Document Management Solution (DMS), a Customer Relationship Solution (CRM) and an Issue Tracker Support Solution and if you take into consideration price and seamless integration of those systems (Geysir Search Portal can search also in those systems) really the choice is no brainier.

 

For a Better Search,

July Soft Team

 

Responsibility. Integrity. Passion.

Geysir Enterprise Search 1.0.0.8 is out - You Get More Time Every New Version...

July Soft just rolled out Geysir Enterprise Search version 1.0.0.8.

The main new functionality is now "Get More Like This..." button.

As its name says, this new feature allows the user - both in Web UI (available to Enterprise, Professional or Basic license owners) and in Desktop Console UI (available to all license owners, including Free Version users) - to get all other documents that are "similar" with a given one - resulted from a previous search.

Let's take a very common example:

A lawyer in your company needs to consult "Contract Client Ben". He logs in Geysir Search Portal and types "Contract Ben".

Geysir returns few results - among them "IT Services Contract - Ben.pdf".

The lawyer needs now to see other IT Services contracts to compare them with Ben's contract...

Now, he can just click on "More Like This ..." button that is available on first result - he will then magically get all IT Services contracts existing in the Company's repository.

So, now, any Geysir user - including Free users - can enjoy this time saving new feature that is needed so often in today's reality in any company...

Also Geysir Desktop UI has been further polished and starts to look sleek and elegant as Web UI is.

Curious to see if you can Get More This? Just request us your download Geysir Free Link!

 

Happy Searching,

July Soft Team

What is the benefit for My Organization to acquire July Soft Geysir?

It's hard to measure the gain brought by an Enterprise Search system. But what would we do without them? Let's try to imagine a day without web search engines.

The first benefit of Geysir is increased productivity, less time spent on searching or re-creating documents not found! New employees are easier to accommodate, reduced costs, cheaper IT operations and satisfied customers about your support quickness! Real help in implementing any certification, really useful in any audit scenario!

Don't waste your time, you have to find quickly your intranet data.

It's simple, Geysir is your solution.

Feel free to ask for your GEYSIR FREE kit here.

Enterprise Search systems are more than an informatics product: they are a way of working to be successful!  

Change your search experience with us and try Geysir,

July Soft Team – www.julysoft.net

Responsibility. Integrity. Passion.

July Soft Geysir – Take the Search Stress Out Like a Cat Does ...

July Soft Geysir – Take the Search Stress Out Like a Cat Does ...

Dear computer and data user,

I know you're wondering sometimes, especially during extremely busy days how you can improve your search experience and find instantly the file you need.

I know you're wondering sometimes, what you can do to avoid the need to bother any colleague with a lot of questions about where the data is, or to ask the network admin where is an email, or to avoid the time lost to re-create certain documents you simply don't find!

Just ask yourself: are you really satisfied how quick you find any type of electronic data inside the company?

Maybe not ... Today you have a new product Geysir able to find within seconds your data. Indeed there is a product like this and there is no exaggeration!

You find hard to believe, but it's true: using Geysir you can find within 1 second everything you are searching in an index of 800 GB data files!

Convince yourself in the following ways:

      1. Geysir FREE – use for free a limited version of Geysir

      2. Skype Geysir Demo: request a free live demo of Geysir by Skype

      3. Geysir TRIAL: request the Trial kit to test Geysir for free for 7 days

by sending us an email cu julysoft@runbox.com and give a chance to Geysir product.

You have nothing to lose, by contrary you will find the solution to take the company data search stress out simple as a cat does...

Change your search experience with us and try Geysir,

July Soft Team – www.julysoft.net

Responsibility. Integrity. Passion. Pet lovers.

Geysir v. 1.0.0.7 launched! New features: Search Autocomplete Suggestions and Similarity "More Like This"!

We have just launched version 1.0.0.7 of Geysir Enterprise Search.

As you can see in the above image, two brand new features has been introduced in 1.0.0.7:

1) Search Autocomplete Suggestions - very handy when users are exploring data and they don't really know in advance exactly they need to find. One such example is when they search a term / name / that they do not understand, and they try to find out what that term is. The term has a context and associated keywords. This new feature does just this: takes the current user's query (incomplete) and automatically finds its most important keywords associated with its context, thus helping the user to get better the initial term.

2) Similarity "More Like This..." More precise: Web UI has now a new button – for each document called "More Like This...". This new feature allows that starting from one document to get all similar documents in all categories. By contrast with Danube SIMILARITY – that is multi-node (allows similarity across multiple Geysir instances), "More Like This" is only on current instance but in compensation the later is available both to admin and normal users, while Danube SIMILARITY is only for admin users.

Very common scenario: User type Invoice 34 - an invoice for Client X, Just by clicking on "More Like This" command of the resulted document "Invoice 34", user will obtain instantly ALL Invoices of Client X!

"More Like This" is available in following license types:
a) Basic
b) Professional
c) Enterprise
Whilst Danube SIMILARITY is available only in Enterprise Licenses!

Conclusion:

Enterprise Search should help the users spend less time searching information they need and same time helping them finding answers to questions. Similarity feature helps users to get a large collection of documents similar with a given one and as our example above - is very, very often the case in any company, while Search Suggestions Autocomplete feature helps users to dive and explore large collections of data in same way web search engines do...

Happy Searching,

July Soft Team

Why Successful Enterprise Search Requires Data De-Duplication tools like JULY SOFT KATLA!

Executive Summary:

Before Investing in Enterprise Search, Invest in Maximizing Data Quality - this way you will get the most return form your investment!

 

The above image is in fact a summary of this article and emphasizes that as Step #1, before Step #2 (which is Enterprise Search Implementation) companies should invest time in increasing Data Quality given that Data Quality impacts directly the success and return of investment in Enterprise Search Tools!

Any Enterprise Search Implementation should be treated as a separate Project.

A Project is: limited in time, has measurable results, has clear goals.

So how we measure the Success of an Enterprise Search Project Implementation?

Answer is: there are many variables here, but we will focus now on two:

- Recall      - the fraction from ALL RELEVANT DOCUMENTS of RETURNED RESULTS

- Precision - the fraction from RETURNED RESULTS            of RELEVANT RESULTS

In plain English -Recall is capacity of Search Engine to "Remember" all relevant documents relative to a user's query, while Precision is capacity to return a high concentration of relevant documents relative to user query. Higher the Recall and Precision the better!

So, a successful implementation of an Enterprise Search for a company can be measured by computing (for most important/often terms that employees are using in day-to-day operations) values of both Recall and Precision!

We at July Soft develop and implement GEYSIR Enterprise Search. We help companies to use their time, information and workforce more efficient and effective.

To see more details about benefits of Geysir you may visit this link.

Geysir's implementation success depends of its Recall and Precision - as mentioned before. But unfortunately those 2 variables are not only depending on our software's quality, they also depends heavily on input data quality!

Data quality can be drastically improved by:

- Eliminating Duplicated Files / Data

- Create Quality Meta-Data (implicit or explicit)

- Organize Data

- Eliminate old, useless data

- Etc

As we offer a limited GEYSIR Free Version you may request here, we also offer - FREE - July Soft KATLA File Organizer and Duplicate Removal Tool.

KATLA de-duplicate files, organizes data (Ex: split it between Archive and Working data, etc), creates implicit meta-data through data auto-organization, and more.

As the header image summarizes our point here, it worth, before investing in any Enterprise Search Implementation Project to increase Quality, Search-ability of data just to make sure Recall and Precision of implementation are highest possible thus maximizing the return of your investment and make your organization more efficient.

Note: If you are a technical person or willing to see more details and WHY eliminating duplicates increases Recall and Precision, you need to be at ease with following terms:

- Term Frequency

- Inverse Document Frequency

- TF-IDF Weighting

- You may do so while visiting this Wikipedia page.

If you want more details, we can show you at your email request at iulia@runbox.com a free live Geysir demo by Skype for about 1 hour.

For a better search,

July Soft Team - www.julysoft.net

Responsibility. Integrity. Passion.

Geysir Enterprise Search supports now LAN Indexing

We support now in Geysir quite few important connectors at the moment:

  • Windows Local Disks, Folders, LAN Shares - which is Disk Indexer
  • Outlook PST backups - which is PST Outlook Indexer
  • Unix & Mac OS Folders - which is SSH Indexer
  • Web Sites (both Intranet and Internet) - which is Web Indexer
  • FTP sites (secured or no) - which is FTP Indexer
  • Mailboxes, using POP3 protocol - which is Mail Indexer
  • SQL Databases Binary Files - which is SQL Indexer
  • SharePoint web sites - which is SharePoint Indexer
  • Team Foundation Server sites / tasks / documents - which is Tfs Indexer

Today we have just launched Geysir Enterprise Search Server version 1.0.0.6 that has a brand new family member:

  • LAN IP, Hosts & IP Ranges Shares - which is LAN Indexer

Using this new connector very easy - in few seconds & clicks any network administrator can setup Geysir to index HUNDREDS of LAN computers.

Why so fast? Simply because we do support IP Ranges and cascading settings - which drastically reduces time needed to setup a search LAN project!

How LAN Manager works?

He index 1 to N IP Addresses, Hosts and/or N IP Ranges you may need to setup to index. For all computers found on from the collection will crawl all shares set either at root level or for any particular device. We also support excluded hosts - where you may exclude servers, printers, etc.

And just because a picture makes at 1000 words, see below a capture of Geysir LAN Indexer Setup Settings Dashboard:

As you may notice in the above picture you can set in your search project many IP Ranges and for each one you may opt-in for root settings (Username, Password, Domain, Shares) or you can personalize for each range / IP or Host you opt-out so!

Just imagine how convenient is to:

  • Be able to search within all important Company Documents - spread over hundreds of computers in few seconds
  • Be able to access a File from Bob's computer while his computer is shut down and see its changes history also
  • Allow your employees to search inside a whole LAN in a matter of few seconds
  • Allow your employees to be more efficient by leveraging EXISTING LAN data - that today you have but is so difficult to use it!

Hope you find this article interesting. Any questions you may have about Geysir Enterprise Search Server or LAN Indexer in particular feel free to:

Happy Searching!

July Soft Team - www.julysoft.net

Responsibility. Integrity. Passion.