Posts

Showing posts from July, 2014

How much memory does a namenode need?

    A namenode can eat up memory, since a reference to every block of every file is maintained in memory. It’s difficult to give a precise formula, since memory usage depends on the number of blocks per file, the filename length, and the number of directories in the filesystem; plus it can change from one Hadoop release to another. The default of 1,000 MB of namenode memory is normally enough for a few million files, but as a rule of thumb for sizing purposes you can conservatively allow 1,000 MB per million blocks of storage.    You can increase the namenode’s memory without changing the memory allocated to other Hadoop daemons by setting HADOOP_NAMENODE_OPTS in hadoop-env.sh to include a JVM option for setting the memory size. HADOOP_NAMENODE_OPTS allows you to pass extra options to the namenode’s JVM. So, for example, if using a Sun JVM, -Xmx2000m would specify that 2,000 MB of memory should be allocated to the namenode.    If you change the namenode’s memory allocation, don’t forge

What Constitutes Progress in MapReduce?

     Progress is not always measurable, but nevertheless it tells Hadoop that a task is doing something. For example, a task writing output records is making progress, even though it cannot be expressed as a percentage of the total number that will be written, since the latter figure may not be known, even by the task producing the output. Progress reporting is important, as it means Hadoop will not fail a task that’s making progress. All of the following operations constitute progress: Reading an input record (in a mapper or reducer) Writing an output record (in a mapper or reducer) Setting the status description on a reporter (using Reporter’s setStatus() method) Incrementing a counter (using Reporter’s incrCounter() method) Calling Reporter’s progress() method

Setting User Identity (Hadoop)

    The user identity that Hadoop uses for permissions in HDFS is determined by running the whoami command on the client system. Similarly, the group names are derived from the output of running groups.     If, however, your Hadoop user identity is different from the name of your user account on your client machine, then you can explicitly set your Hadoop username and group names by setting the hadoop.job.ugi property. The username and group names are specified as a comma-separated list of strings.     You can set the user identity that the HDFS web interface runs as by setting dfs.web.ugi using the same syntax. By default, it is webuser,webgroup, which is not a super user, so system files are not accessible through the web interface. Notice that, by default, there is no authentication with this system.

Which Compression Format Should I Use for HDFS?

     Which compression format you should use depends on your application. Do you want to maximize the speed of your application or are you more concerned about keeping storage costs down? In general, you should try different strategies for your application, and benchmark them with representative data-sets to find the best approach. For large, unbounded files, like log files, the options are: Store the files uncompressed. Use a compression format that supports splitting, like bzip2 (although bzip2 is fairly slow), or one that can be indexed to support splitting, like LZO. Split the file into chunks in the application and compress each chunk separately using any supported compression format. In this case, you should choose the chunk size so that the compressed chunks are approximately the size of an HDFS block. Use Sequence File, which supports compression and splitting. Use an Avro data file, which supports compression and splitting, just like Sequence File, but has the added advantage

Compilers vs. Interpreters

       There are two general methods by which a program can be executed. It can be compiled, or it can be interpreted. Although programs written in any computer language can be compiled or interpreted, some languages are designed more for one form of execution than the other. For example, Java was designed to be interpreted, and C was designed to be compiled. However, in the case of C, it is important to understand that it was specifically optimized as a compiled language. Although C interpreters have been written and are available in some environments, C was developed with compilation in mind. Therefore, you will almost certainly be using a C compiler and not a C interpreter when developing your C programs.      Since the difference between a compiler and interpreter may not be clear to all readers, the following brief description will clarify matters. In its simplest form, an interpreter reads the source code of your program one line at a time, performing the specific instructions co

Preventing Brute Force Attack

A brute force attack is an attempt to log into a secure system by making lots of attempts in the hopes of eventual success.Brute force is also a algorithm design approach/technique in Computer Science. It’s not a sophisticated type of attack, hence the name “brute force.” For example, if you have a login process that requires a username and password, there is a limit as to the possible number of username/password combinations. That limit may be in the billions or trillions, but still, it’s a finite number.    Using algorithms and automated processes, a brute force attack repeatedly tries combinations until they succeed. The best way to prevent brute force attacks from succeeding is requiring users to register with good, hard-to-guess passwords: containing letters, numbers, and punctuation; both upper and lowercase; words not in the dictionary; at least eight characters long, etc. Also, don’t give indications as to why a login failed: saying that a username and password combination isn