2008 03 26

Jeff was the big proposer for map-reduce model–the map-reduce guy. Jeff reviews of the infrastructure and the heterogeneous data set (heterogeneous and at least petabyte scale), their goal: maximize performance by buck. Also data centers, locality, and center are key in the equation. Low cost (not redundant power supplies, not raid disks, running linux, standard network)
Software needs to be reliable to failure (node, disks, or racks going dead). Linux on all the production. An scheduler across the cluster to schedule jobs. Cluster wide file system on top, and usually big table cell. The GFS centralized manager that manages metadata and allocations of chunks and replication (talk to the master and then talk to the chunk servers). Big Table helps applications that need a bit more structure storage (key, column, time-stamp). It also provides garbage collection policies.
Distribution is break in to tablets (range of rows) and managed by a single machine, the system can split growing tablets.
Big Table provides transactions and allow to specify local columns to group together. They allow replications policies across data centers. MapReduce is a nice fitting model for some programs (for instance, the reverse index creation)
Allows to move computation to closer to data. Also allows to implement load balancing. GFS works OK for a cluster, but they do not have a global view across the data centers. For instance, they are looking for unique naming of data, and also, if integrated allow data center to keep working if they became disconnected. They are also looking for data distribution policies.

2008 03 26

Energy community and HPC. It is cheaper to collect a lot of samples and run simulations to decide where to drill (the extremely costly part). Energy community and HPC. It is cheaper to collect a lot of samples and run simulations to decide where to drill (the extremely costly part). Review of several effort one modeling for science making. They also run a collection of failures and maintenance cycles on hardware. Job interruptions becomes linear in number of chips (quite interesting result). Compute power on top 500 still doubles per year, and thus the failure rates. Lost disk becomes more and more painful (since the regeneration takes from 8h currently per terabyte to weeks in prediction). All this claims a change on the design. The approach is to spread data as much possible, and that can help to linearly reduce the reconstruction times. File systems are “object” file systems such as the Google FS and Hadoop FS. pNFS: scalable NFS soon. A key goal: reuse current tools people are using to speed up adaption and make it appealing to the users.

2008 03 26

Jilll is reviewing what is going with data and biology. There has been an explosion on the numbers they are generating data (from volumes to throughput). Simulations has also been common practices, robot operations, etc. more and more data. Some numbers, now their center use 4.8K processors and 1440+ Terabytes of storage. The challenge give the proper tools to biologist (not CS people). The two key topics of the talk: computation paradigms and computation foundations. They heavily rely on genome expression arrays (row patients, column genes, value expression values). A simple example: classify leukemias (example of how can be distinguished using expression arrays). Patient samples, extract messenger RNA and then create the expression signatures (high dimensionality low training sample set). They repeated the same problem for predicted outcome on prognosis on brain cancer, but for this program there was no strong signal to get them accurate enough. Genes work on regulatory networks (sets of genes), and they tried to do the analysis this way—acting as adding background knowledge to the problem—boosting the results and making the treatment possible. But, the problem is that there should be and infrastructure that could be easy to use and able to replicate experiments. Infrastructure should integrate and interact to components. Should be able to support techs and illiterates equally. Two interfaces (visual and programatic). Access to a library of functions, write pipes, language agnostic, and build on web services (scalable from the laptop to clusters). The name GenePattern. They are collaborating with Microsoft working on a tool (word document) to link to pipelines and the data in the data (can run with other version) and append results to the document too.

2008 03 26

Dan Reed (former NCSA director now at Microsoft Research) continues with the meeting presentations. His elevator pitch: the infrastructure need to take into account applications and the user experience. Current trend is that monolithic data consolidation is crumbling under dispersion, changing the traditional picture. The flavors of big data can be explored along two dimensions: (regular/irregular) versus (structured/unstructured). He emphasizes on focusing more on the user experience with big data, and how you can manage resource at any given point. Cloud computing can help organically orchestrate this resources on demand. He also show some examples of Dryad (the Microsoft take on map-reduce architectures) and DryadLINQ. Another interesting comment:

Building simple things ain’t easy.

I definitely agree with this one :D. Finally he mention his initiative to bring academics, business, and users together under the big data problem (PCAST NITRD review).

2008 03 26

UIUC CS professor Zhai reviews texts information management. ChenXiang start reviewing the importance of text as a natural way to encode human knowledge. His main focus is how he can provide support for different usages of text information, and how they interact to models, applications, systems and algorithms. This allowed him to motivate future research directions on information retrieval. Some of his interesting words:

Future research directions require improvements on IR and NLP (shallow: POS, partial parsing, fragmental semantic analysis), but it is fragile and domain oriented. Machine learning algorithms are still no scalable and not enough training data to satisfy the algorithm requirements. Data mining has lots of algorithms, but only for salient patterns.

ChengXiang says there is a triangle involving: (1) Keyword queries (search history, complete user models), (2) bags of words (entity-relations, nwoledge representation), (3) search (access, mining, and task support). That leads to personalized search, large-scale semantic analysis, full fledged text information management. On the road there is for sure scalability (he demoed the UCAIR project as a leap toward new search engines). On the large-scale semantics he emphasize the importance of graph representation for the analysis and how you can use graph analysis techniques. And changing gears to a third topic is how you can create multi-resolution topic map for navigation. The basic idea is zoom in and zoom out strategy to drill in and aggregate for the navigation.

2008 03 26

Randy opens fire reviewing models of parallelisms and how Google’s Mpa-Reduce model (the core of Yahoo’s Hadoop) is changing the picture. He is emphasizing how data is and integral part of the computational process (which has been greatly unregarded). Map-Reduce model can greatly help because of it fault tolerant capabilities. Now he is reviewing the two traditional parallel programming models (shared model and message-passing model) and how this differ from map-reduce (and how this increases the IO). Initiatives like Hadoop allow to cut-down cost for accessing large scale computing.

2008 03 26

I am lucky to attend the Big Data Computing Study Group 2008. The line of speaker is impressive. The event is held at Yahoo! Sunnyvale, and Thomas Kwan (UIUC alumni know at Yahoo!) is helping organize it. I will keep blogging about it the rest of the day.

2008 01 16

I just posted on my IlliGAL blog how to implement a generic genetic algorithm (GA) main loop squeezing the dynamic behavior of Python. Pretty sleek, if you have tweaked GA main you main find this interesting ;)

2008 01 10

For the first time in 9 years, this vacation break I have done absolutely nothing. Wow what a coach potato I have become! Well that is not totally true, just for fun I started going over Python and, as usual, for any new language I end writing a simple genetic algorithm. I like the flexibility and compactness of the code (no verbose at all). However when I fire my first run (yes, the good old OneMax problem), I realized that some of my assumptions about coding did not directly transfer. Yes, it was a bit slow. So I started digging for a profiler and, surprise!, it comes with the Python interpreter.

Here is an example on how to run the profiling capabilities


import cProfile
cProfile.run('main()')

The cProfile module is a profile based coded in C. If you do not have it in your install you could run the same code with the profile module instead (highly likely to be in your install). Also if you are using Python < 2.5 you may also want to use the profile instead (I read somewhere there was a bug on the cProfile, but I could not recall where I saw it). Below you can read the output of the profiler.


1246109 function calls (1096109 primitive calls) in 1.428 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.428 1.428 :1()
156000/6000 0.366 0.000 0.905 0.000 copy.py:144(deepcopy)
29953 0.008 0.000 0.008 0.000 copy.py:197(_deepcopy_atomic)
6000 0.154 0.000 0.535 0.000 copy.py:223(_deepcopy_list)
6000 0.034 0.000 0.740 0.000 copy.py:250(_deepcopy_dict)
47953 0.105 0.000 0.131 0.000 copy.py:260(_keep_alive)
6000 0.040 0.000 0.861 0.000 copy.py:276(_deepcopy_inst)
3000 0.180 0.000 0.258 0.000 crossovers.py:6(uniformCrossover)
6600 0.005 0.000 0.017 0.000 fitnesses.py:5(oneMax)
6000 0.006 0.000 0.006 0.000 ind_n_pop_classes.py:16(__init__)
11 0.023 0.002 0.040 0.004 ind_n_pop_classes.py:35(evaluate)
10 0.004 0.000 1.071 0.107 ind_n_pop_classes.py:63(selection)
10 0.026 0.003 0.317 0.032 ind_n_pop_classes.py:67(crossover)
18011 0.074 0.000 0.079 0.000 random.py:147(randrange)
18011 0.023 0.000 0.102 0.000 random.py:211(randint)
10 0.081 0.008 1.067 0.107 selections.py:7(tournamentSelection)
1 0.001 0.001 1.428 1.428 test.py:39(main)
24000 0.033 0.000 0.033 0.000 {hasattr}
227953 0.042 0.000 0.042 0.000 {id}
18038 0.004 0.000 0.004 0.000 {len}
12008 0.004 0.000 0.004 0.000 {method 'add' of 'set' objects}
293953 0.085 0.000 0.085 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
203953 0.061 0.000 0.061 0.000 {method 'get' of 'dict' objects}
6000 0.002 0.000 0.002 0.000 {method 'iteritems' of 'dict' objects}
141011 0.032 0.000 0.032 0.000 {method 'random' of '_random.Random' objects}
6000 0.006 0.000 0.006 0.000 {method 'update' of 'dict' objects}
21 0.000 0.000 0.000 0.000 {range}
6600 0.012 0.000 0.012 0.000 {sum}
3000 0.017 0.000 0.017 0.000 {zip}

Yes, I used the deepcopy method because it was nice and make my live easy. Yup, big mistake. That force my selection to take almost 67% of the overall execution time. Quite unacceptable. Thanks to the profiler, now I knew were to look for slowness and more important, I learn what Python blanks in my knowledge need to be improved :D

2007 11 26

Jena 2 Inference Support is a nice introduction to the inference engines provided by the Jena package. Besides standardized reasoning for RDF and a subset of OWL/Lite and OWL/All ontologies, it also provides the mechanisms for creating your own rule-based inference engine using the generic rule-based inference also provided.

Powered by ScribeFire.

« Previous PageNext Page »